├── .gitignore ├── .travis.yml ├── DESIGN.md ├── LICENSE ├── Makefile ├── README.md ├── TODO ├── doc ├── 10.1.1.44.2782.pdf ├── compare-innodb-vs-hanoi.png ├── design_diagrams.graffle ├── design_diagrams.pdf └── sample_result_mba_20min.png ├── include ├── hanoidb.hrl └── plain_rpc.hrl ├── rebar.config ├── src ├── gb_trees_ext.erl ├── hanoidb.app.src ├── hanoidb.erl ├── hanoidb.hrl ├── hanoidb_app.erl ├── hanoidb_bloom.erl ├── hanoidb_dense_bitmap.erl ├── hanoidb_fold_worker.erl ├── hanoidb_level.erl ├── hanoidb_merger.erl ├── hanoidb_nursery.erl ├── hanoidb_reader.erl ├── hanoidb_sparse_bitmap.erl ├── hanoidb_sup.erl ├── hanoidb_util.erl ├── hanoidb_writer.erl ├── plain_rpc.erl └── vbisect.erl ├── test ├── hanoidb_drv.erl ├── hanoidb_merger_tests.erl ├── hanoidb_tests.erl └── hanoidb_writer_tests.erl └── tools ├── basho_bench_driver_hanoidb.erl └── visualize-hanoi.sh /.gitignore: -------------------------------------------------------------------------------- 1 | ebin 2 | deps 3 | *~ 4 | .eunit 5 | .project 6 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: erlang 2 | otp_release: 3 | - R16B03 4 | - R15B03 5 | - 17.0 6 | - 18.0 7 | 8 | 9 | -------------------------------------------------------------------------------- /DESIGN.md: -------------------------------------------------------------------------------- 1 | # Hanoi's Design 2 | 3 | ### Basics 4 | If there are N records, there are in log2(N) levels (each being a plain B-tree in a file named "A-*level*.data"). The file `A-0.data` has 1 record, `A-1.data` has 2 records, `A-2.data` has 4 records, and so on: `A-n.data` has 2n records. 5 | 6 | In "stable state", each level file is either full (there) or empty (not there); so if there are e.g. 20 records stored, then there are only data in filed `A-2.data` (4 records) and `A-4.data` (16 records). 7 | 8 | OK, I've told you a lie. In practice, it is not practical to create a new file for each insert (injection at level #0), so we allows you to define the "top level" to be a number higher that #0; currently defaulting to #5 (32 records). That means that you take the amortization "hit" for ever 32 inserts. 9 | 10 | ### Lookup 11 | Lookup is quite simple: starting at `A-0.data`, the sought for Key is searched in the B-tree there. If nothing is found, search continues to the next data file. So if there are *N* levels, then *N* disk-based B-tree lookups are performed. Each lookup is "guarded" by a bloom filter to improve the likelihood that disk-based searches are only done when likely to succeed. 12 | 13 | ### Insertion 14 | Insertion works by a mechanism known as B-tree injection. Insertion always starts by constructing a fresh B-tree with 1 element in it, and "injecting" that B-tree into level #0. So you always inject a B-tree of the same size as the size of the level you're injecting it into. 15 | 16 | - If the level being injected into empty (there is no A-*level*.data file), then the injected B-tree becomes the contents for that level (we just rename the file). 17 | - Otherwise, 18 | - The injected tree file is renamed to B-*level*.data; 19 | - The files A-*level*.data and B-*level*.data are merged into a new temporary B-tree (of roughly double size), X-*level*.data. 20 | - The outcome of the merge is then injected into the next level. 21 | 22 | While merging, lookups at level *n* first consults the B-*n*.data file, then the A-*n*.data file. At a given level, there can only be one merge operation active. 23 | 24 | ### Overwrite and Delete 25 | Overwrite is done by simply doing a new insertion. Since search always starts from the top (level #0 ... level#*n*), newer values will be at a lower level, and thus be found before older values. When merging, values stored in the injected tree (that come from a lower-numbered level) have priority over the contained tree. 26 | 27 | Deletes are the same: they are also done by inserting a tombstone (a special value outside the domain of values). When a tombstone is merged at the currently highest numbered level it will be discarded. So tombstones have to bubble "down" to the highest numbered level before it can be truly evicted. 28 | 29 | 30 | ## Merge Logic 31 | The really clever thing about this storage mechanism is that merging is guaranteed to be able to "keep up" with insertion. Bitcask for instance has a similar merging phase, but it is separated from insertion. This means that there can suddenly be a lot of catching up to do. The flip side is that you can then decide to do all merging at off-peak hours, but it is yet another thing that need to be configured. 32 | 33 | With LSM B-Trees; back-pressure is provided by the injection mechanism, which only returns when an injection is complete. Thus, every 2nd insert needs to wait for level #0 to finish the required merging; which - assuming merging has linear I/O complexity - is enough to guarantee that the merge mechanism can keep up at higher-numbered levels. 34 | 35 | A further trouble is that merging does in fact not have completely linear I/O complexity, because reading from a small file that was recently written is faster that reading from a file that was written a long time ago (because of OS-level caching); thus doing a merge at level #*N+1* is sometimes more than twice as slow as doing a merge at level #*N*. Because of this, sustained insert pressure may produce a situation where the system blocks while merging, though it does require an extremely high level of inserts. We're considering ways to alleviate this. 36 | 37 | Merging can be going on concurrently at each level (in preparation for an injection to the next level), which lets you utilize available multi-core capacity to merge. 38 | 39 | 40 | ``` 41 | ABC are data files at a given level 42 | A oldest 43 | C newest 44 | X is being merged into from [A+B] 45 | 46 | 270     76 [AB X|ABCX|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 47 |  271     76 [ABCX|ABCX|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 48 |  272     77 [A   |AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 49 |  273     77 [AB X|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 50 |  274     77 [ABCX|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 51 |  275     78 [A   |ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 52 |  276     78 [AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 53 |  277     79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A   |    |    |    |    |    |    |    |    |    | 54 |  278     79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|  C |AB  |    |    |    |    |    |    |    |    |    | 55 |  279     79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|  C |AB X|    |    |    |    |    |    |    |    |    | 56 |  280     79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A   |AB X|    |    |    |    |    |    |    |    |    | 57 |  281     79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|  C |AB  |AB X|    |    |    |    |    |    |    |    |    | 58 |  282     80 [ABCX|ABCX|ABCX| BC |AB  |AB  |AB X|AB X|AB X|    |    |    |    |    |    |    |    |    | 59 |  283     80 [ABCX|ABCX|ABCX|  C |AB X|AB  |AB X|AB X|AB X|    |    |    |    |    |    |    |    |    | 60 |  284     80 [A   |AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X|    |    |    |    |    |    |    |    |    | 61 |  285     80 [AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X|    |    |    |    |    |    |    |    |    | 62 |  286     80 [ABCX|AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X|    |    |    |    |    |    |    |    |    | 63 |  287     80 [A   |ABCX|AB X|AB X|AB X|AB X|AB X|AB X|AB X|    |    |    |    |    |    |    |    |    | 64 | ``` 65 | 66 | 67 | When merge finishes, X is moved to the next level [becomes first open slot, in order of A,B,C], and the files merged (AB in this case) are deleted. If there is a C, then that becomes A of the next size. 68 | When X is closed and clean, it is actually intermittently renamed M so that if there is a crash after a merge finishes, and before it is accepted at the next level then the merge work is not lost, i.e. an M file is also clean/closed properly. Thus, if there are M's that means that the incremental merge was not fast enough. 69 | 70 | ABC files have 2^level KVs in it, regardless of the size of those KVs. XM files have 2^(level+1) approximately ... since tombstone merges might reduce the numbers or repeat PUTs of cause. 71 | 72 | ### File Descriptors 73 | Hanoi needs a lot of file descriptors, currently 6*⌈log2(N)-TOP_LEVEL⌉, with a nursery of size 2TOP_LEVEL, and N Key/Value pairs in the store. Thus, storing 1.000.000 KV's need 72 file descriptors, storing 1.000.000.000 records needs 132 file descriptors, 1.000.000.000.000 records needs 192. 74 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | REBAR= rebar 2 | DIALYZER= dialyzer 3 | 4 | 5 | .PHONY: plt analyze all deps compile get-deps clean 6 | 7 | all: get-deps compile 8 | 9 | deps: get-deps compile 10 | 11 | get-deps: 12 | @$(REBAR) get-deps 13 | 14 | compile: 15 | @$(REBAR) compile 16 | 17 | clean: 18 | @$(REBAR) clean 19 | 20 | test: eunit 21 | 22 | eunit: compile clean-test-btrees 23 | @$(REBAR) eunit skip_deps=true 24 | 25 | eunit_console: 26 | erl -pa .eunit deps/*/ebin 27 | 28 | clean-test-btrees: 29 | rm -fr .eunit/Btree_* .eunit/simple 30 | 31 | plt: compile 32 | $(DIALYZER) --build_plt --output_plt .hanoi.plt \ 33 | -pa deps/snappy/ebin \ 34 | -pa deps/snappy/ebin \ 35 | -pa deps/lz4/ebin \ 36 | -pa deps/ebloom/ebin \ 37 | -pa deps/plain_fsm/ebin \ 38 | deps/plain_fsm/ebin \ 39 | --apps erts kernel stdlib ebloom lz4 snappy 40 | 41 | analyze: compile 42 | $(DIALYZER) --plt .hanoi.plt \ 43 | -pa deps/snappy/ebin \ 44 | -pa deps/lz4/ebin \ 45 | -pa deps/ebloom/ebin \ 46 | -pa deps/plain_fsm/ebin \ 47 | ebin 48 | 49 | analyze-nospec: compile 50 | $(DIALYZER) --plt .hanoi.plt \ 51 | -pa deps/plain_fsm/ebin \ 52 | --no_spec \ 53 | ebin 54 | 55 | repl: 56 | erl -pz deps/*/ebin -pa ebin 57 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # HanoiDB Indexed Key/Value Storage 2 | 3 | [![Build Status](https://travis-ci.org/krestenkrab/hanoidb.svg?branch=master)](https://travis-ci.org/krestenkrab/hanoidb) 4 | 5 | HanoiDB implements an indexed, key/value storage engine. The primary index is 6 | a log-structured merge tree (LSM-BTree) implemented using "doubling sizes" 7 | persistent ordered sets of key/value pairs, similar is some regards to 8 | [LevelDB](http://code.google.com/p/leveldb/). HanoiDB includes a visualizer 9 | which when used to watch a living database resembles the "Towers of Hanoi" 10 | puzzle game, which inspired the name of this database. 11 | 12 | ## Features 13 | - Insert, Delete and Read all have worst case *O*(log2(*N*)) latency. 14 | - Incremental space reclaimation: The cost of evicting stale key/values 15 | is amortized into insertion 16 | - you don't need a separate eviction thread to keep memory use low 17 | - you don't need to schedule merges to happen at off-peak hours 18 | - Operations-friendly "append-only" storage 19 | - allows you to backup live system 20 | - crash-recovery is very fast and the logic is straight forward 21 | - all data subject to CRC32 checksums 22 | - data can be compressed on disk to save space 23 | - Efficient range queries 24 | - Riak secondary indexing 25 | - Fast key and bucket listing 26 | - Uses bloom filters to avoid unnecessary lookups on disk 27 | - Time-based expiry of data 28 | - configure the database to expire data older than n seconds 29 | - specify a lifetime in seconds for any particular key/value pair 30 | - Efficient resource utilization 31 | - doesn't store all keys in memory 32 | - uses a modest number of file descriptors proportional to the number of levels 33 | - I/O is generally balanced between random and sequential 34 | - low CPU overhead 35 | - ~2000 lines of pure Erlang code in src/*.erl 36 | 37 | HanoiDB is developed by Trifork, a Riak expert solutions provider, and Basho 38 | Technologies, makers of Riak. HanoiDB can be used in Riak via the 39 | `riak_kv_tower_backend` repository. 40 | 41 | ### Configuration options 42 | 43 | Put these values in your `app.config` in the `hanoidb` section 44 | 45 | ```erlang 46 | {hanoidb, [ 47 | {data_root, "./data/hanoidb"}, 48 | 49 | %% Enable/disable on-disk compression. 50 | %% 51 | {compress, none | gzip}, 52 | 53 | %% Expire (automatically delete) entries after N seconds. 54 | %% When this value is 0 (zero), entries never expire. 55 | %% 56 | {expiry_secs, 0}, 57 | 58 | %% Sync strategy `none' only syncs every time the 59 | %% nursery runs full, which is currently hard coded 60 | %% to be evert 256 inserts or deletes. 61 | %% 62 | %% Sync strategy `sync' will sync the nursery log 63 | %% for every insert or delete operation. 64 | %% 65 | {sync_strategy, none | sync | {seconds, N}}, 66 | 67 | %% The page size is a minimum page size, when a page fills 68 | %% up to beyond this size, it is written to disk. 69 | %% Compression applies to such units of page size. 70 | %% 71 | {page_size, 8192}, 72 | 73 | %% Read/write buffer sizes apply to merge processes. 74 | %% A merge process has two read buffers and a write 75 | %% buffer, and there is a merge process *per level* in 76 | %% the database. 77 | %% 78 | {write_buffer_size, 524288}, % 512kB 79 | {read_buffer_size, 524288}, % 512kB 80 | 81 | %% The merge strategy is one of `fast' or `predictable'. 82 | %% Both have same log2(N) worst case, but `fast' is 83 | %% sometimes faster; yielding latency fluctuations. 84 | %% 85 | {merge_strategy, fast | predictable}, 86 | 87 | %% "Level0" files has 2^N KVs in it, defaulting to 1024. 88 | %% If the database is to contain very small KVs, this is 89 | %% likely too small, and will result in many unnecessary 90 | %% file operations. (Subsequent levels double in size). 91 | {top_level, 10} % 1024 Key/Values 92 | ]}, 93 | ``` 94 | 95 | 96 | ### Contributors 97 | 98 | - Kresten Krab Thorup @krestenkrab 99 | - Greg Burd @gburd 100 | - Jesper Louis Andersen @jlouis 101 | - Steve Vinoski @vinoski 102 | - Erik Søe Sørensen, @eriksoe 103 | - Yamamoto Takashi @yamt 104 | - Joseph Wayne Norton @norton 105 | -------------------------------------------------------------------------------- /TODO: -------------------------------------------------------------------------------- 1 | * Phase 1: Minimum viable product (in order of priority) 2 | * lager; check for uses of lager:error/2 3 | * configurable TOP_LEVEL size 4 | * test new snappy compression support 5 | * status and statistics 6 | * for each level {#merges, {merge-time-min, max, average}} 7 | * add @doc strings and and -spec's 8 | * check to make sure every error returns with a reason {error, Reason} 9 | 10 | 11 | * Phase 2: Production Ready 12 | * dual-nursery 13 | * cache for read-path 14 | * {cache, bytes(), name} share max(bytes) cache named 'name' via etc 15 | * snapshot entire database (fresh directory w/ hard links to all files) 16 | * persist merge progress (to speed up re-opening a HanoiDB) 17 | * support for future file format changes 18 | * Define a standard struct which is the metadata added at the end of the 19 | file, e.g. [btree-nodes] [meta-data] [offset of meta-data]. This is written 20 | in hanoi_writer:flush_nodes, and read in hanoi_reader:open2. 21 | 22 | * Phase 3: Wish List 23 | * add truncate/1 - quickly truncates a database to 0 items 24 | * count/1 - return number of items currently in tree 25 | * adaptive nursery sizing 26 | * backpressure on fold operations 27 | - The "sync_fold" creates a snapshot (hard link to btree files), which 28 | provides consistent behavior but may use a lot of disk space if there is 29 | a lot of insertion going on. 30 | - The "async_fold" folds a limited number, and remembers the last key 31 | serviced, then picks up from there again. So you could see intermittent 32 | puts in a subsequent batch of results. 33 | * add block-level encryption support 34 | 35 | 36 | ## NOTES: 37 | 38 | 1: make the "first level" have more thatn 2^5 entries (controlled by the constant TOP_LEVEL in hanoi.hrl); this means a new set of files is opened/closed/merged for every 32 insert/updates/deletes. Setting this higher will just make the nursery correspondingly larger, which should be absolutely fine. 39 | 40 | 2: Right now, the streaming btree writer emits a btree page based on number of elements. This could be changed to be based on the size of the node (say, some block-size boudary) and then add padding at the end so that each node read becomes a clean block transfer. Right now, we're probably taking way to many reads. 41 | 42 | 3: Also, there is no caching of read nodes. So every time a btree node is visited it is also read from disk and term_to_binary'ed. But we need a caching system for that to work well (https://github.com/cliffmoon/cherly is difficult to build), it needs to be rebar-ified. 43 | 44 | 4: Also, the format for btree nodes could probably be optimized. Right now it's just binary_to_term of a key/value list as far as I remember. Perhaps we dont have to deserialize the entire thing. 45 | 46 | 5: It might also be good to employ a scheduler (github.com/esl/jobs) for issuing merges; because I think that it can be a problem for the OS if there are too many merges going on at the same time. 47 | -------------------------------------------------------------------------------- /doc/10.1.1.44.2782.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/10.1.1.44.2782.pdf -------------------------------------------------------------------------------- /doc/compare-innodb-vs-hanoi.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/compare-innodb-vs-hanoi.png -------------------------------------------------------------------------------- /doc/design_diagrams.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/design_diagrams.pdf -------------------------------------------------------------------------------- /doc/sample_result_mba_20min.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/sample_result_mba_20min.png -------------------------------------------------------------------------------- /include/hanoidb.hrl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | 26 | %% 27 | %% When doing "async fold", it does "sync fold" in chunks 28 | %% of this many K/V entries. 29 | %% 30 | -define(BTREE_ASYNC_CHUNK_SIZE, 100). 31 | 32 | %% 33 | %% The key_range structure is a bit assymetric, here is why: 34 | %% 35 | %% from_key=<<>> is "less than" any other key, hence we don't need to 36 | %% handle from_key=undefined to support an open-ended start of the 37 | %% interval. For to_key, we cannot (statically) construct a key 38 | %% which is > any possible key, hence we need to allow to_key=undefined 39 | %% as a token of an interval that has no upper limit. 40 | %% 41 | -record(key_range, { from_key = <<>> :: binary(), 42 | from_inclusive = true :: boolean(), 43 | to_key :: binary() | undefined, 44 | to_inclusive = false :: boolean(), 45 | limit :: pos_integer() | undefined }). 46 | -------------------------------------------------------------------------------- /include/plain_rpc.hrl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% plain_rpc: RPC module to accompany plain_fsm 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% This file is provided to you under the Apache License, Version 2.0 (the 9 | %% "License"); you may not use this file except in compliance with the License. 10 | %% You may obtain a copy of the License at 11 | %% 12 | %% http://www.apache.org/licenses/LICENSE-2.0 13 | %% 14 | %% Unless required by applicable law or agreed to in writing, software 15 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 16 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 17 | %% License for the specific language governing permissions and limitations 18 | %% under the License. 19 | %% 20 | %% ---------------------------------------------------------------------------- 21 | 22 | %% 23 | %% This module really belongs in the plain_fsm distro. 24 | %% 25 | 26 | -define(CALL(From,Msg), {'$call', From, Msg}). 27 | -define(REPLY(Ref,Msg), {'$reply', Ref, Msg}). 28 | -define(CAST(From,Msg), {'$cast', From, Msg}). 29 | 30 | -------------------------------------------------------------------------------- /rebar.config: -------------------------------------------------------------------------------- 1 | {cover_enabled, true}. 2 | 3 | {clean_files, ["*.eunit", "ebin/*.beam"]}. 4 | {eunit_opts, [verbose, {report, {eunit_surefire, [{dir, "."}]}}]}. 5 | 6 | {erl_opts, [%{d,'DEBUG',true}, 7 | {d,'USE_EBLOOM',true}, 8 | {parse_transform, lager_transform}, 9 | fail_on_warning, 10 | warn_unused_vars, 11 | warn_export_all, 12 | warn_shadow_vars, 13 | warn_unused_import, 14 | warn_unused_function, 15 | warn_bif_clash, 16 | warn_unused_record, 17 | warn_deprecated_function, 18 | warn_obsolete_guard, 19 | warn_export_vars, 20 | warn_exported_vars, 21 | warn_untyped_record, 22 | % warn_missing_spec, 23 | % strict_validation, 24 | {platform_define, "^R|17", pre18}, 25 | debug_info]}. 26 | 27 | {xref_checks, [undefined_function_calls]}. 28 | 29 | {deps, [ {sext, ".*", {git, "git://github.com/uwiger/sext", {branch, "master"}}} 30 | , {lager, ".*", {git, "git://github.com/basho/lager", {branch, "master"}}} 31 | , {snappy, "1.*", {git, "git://github.com/fdmanana/snappy-erlang-nif.git", {branch, "master"}}} 32 | , {plain_fsm, "1.*", {git, "git://github.com/gburd/plain_fsm", {branch, "master"}}} 33 | % , {basho_bench, ".*", {git, "git://github.com/basho/basho_bench", {branch, "master"}}} 34 | , {ebloom, ".*", {git, "git://github.com/basho/ebloom", {branch, "master"}}} 35 | , {triq, ".*", {git, "git://github.com/krestenkrab/triq", {branch, "master"}}} 36 | , {lz4, ".*", {git, "git://github.com/krestenkrab/erlang-lz4.git", {branch, "master"}}} 37 | % , {edown, "0.3.*", {git, "git://github.com/uwiger/edown.git", {branch, "master"}}} 38 | % , {asciiedoc, "0.1.*", {git, "git://github.com/norton/asciiedoc.git", {branch, "master"}}} 39 | % , {triq, ".*", {git, "git://github.com/krestenkrab/triq.git", {branch, "master"}}} 40 | % , {proper, ".*", {git, "git://github.com/manopapad/proper.git", {branch, "master"}}} 41 | ]}. 42 | -------------------------------------------------------------------------------- /src/gb_trees_ext.erl: -------------------------------------------------------------------------------- 1 | 2 | -module(gb_trees_ext). 3 | -extends(gb_trees). 4 | -export([fold/3]). 5 | 6 | % author: http://erlang.2086793.n4.nabble.com/gb-trees-fold-td2228614.html 7 | 8 | -spec fold(fun((term(), term(), term()) -> term()), term(), gb_trees:tree()) -> term(). 9 | fold(F, A, {_, T}) 10 | when is_function(F, 3) -> 11 | fold_1(F, A, T). 12 | 13 | fold_1(F, Acc0, {Key, Value, Small, Big}) -> 14 | Acc1 = fold_1(F, Acc0, Small), 15 | Acc = F(Key, Value, Acc1), 16 | fold_1(F, Acc, Big); 17 | fold_1(_, Acc, _) -> 18 | Acc. 19 | -------------------------------------------------------------------------------- /src/hanoidb.app.src: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | {application, hanoidb, 26 | [ 27 | {description, ""}, 28 | {vsn, "1.3.0"}, 29 | {registered, []}, 30 | {applications, [ 31 | kernel, 32 | stdlib, 33 | plain_fsm 34 | ]}, 35 | {mod, {hanoidb_app, []}}, 36 | {env, []} 37 | ]}. 38 | -------------------------------------------------------------------------------- /src/hanoidb.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb). 26 | -author('Kresten Krab Thorup '). 27 | 28 | 29 | -behavior(gen_server). 30 | 31 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2, 32 | terminate/2, code_change/3]). 33 | 34 | -export([open/1, open/2, open/3, open_link/1, open_link/2, open_link/3, 35 | transact/2, close/1, get/2, lookup/2, delete/2, put/3, put/4, 36 | fold/3, fold_range/4, destroy/1]). 37 | 38 | -export([get_opt/2, get_opt/3]). 39 | 40 | -include("hanoidb.hrl"). 41 | -include_lib("kernel/include/file.hrl"). 42 | -include_lib("include/hanoidb.hrl"). 43 | -include_lib("include/plain_rpc.hrl"). 44 | 45 | -record(state, { top :: pid(), 46 | nursery :: #nursery{}, 47 | dir :: string(), 48 | opt :: term(), 49 | max_level :: pos_integer()}). 50 | 51 | %% 0 means never expire 52 | -define(DEFAULT_EXPIRY_SECS, 0). 53 | 54 | -ifdef(DEBUG). 55 | -define(log(Fmt,Args),io:format(user,Fmt,Args)). 56 | -else. 57 | -define(log(Fmt,Args),ok). 58 | -endif. 59 | 60 | 61 | %% PUBLIC API 62 | 63 | -type hanoidb() :: pid(). 64 | -type key_range() :: #key_range{}. 65 | -type config_option() :: {compress, none | gzip | snappy | lz4} 66 | | {page_size, pos_integer()} 67 | | {read_buffer_size, pos_integer()} 68 | | {write_buffer_size, pos_integer()} 69 | | {merge_strategy, fast | predictable } 70 | | {sync_strategy, none | sync | {seconds, pos_integer()}} 71 | | {expiry_secs, non_neg_integer()} 72 | | {spawn_opt, list()} 73 | | {top_level, pos_integer()} 74 | . 75 | 76 | %% @doc 77 | %% Create or open a hanoidb store. Argument `Dir' names a 78 | %% directory in which to keep the data files. By convention, we 79 | %% name hanoidb data directories with extension ".hanoidb". 80 | - spec open(Dir::string()) -> {ok, hanoidb()} | ignore | {error, term()}. 81 | open(Dir) -> 82 | open(Dir, []). 83 | 84 | %% @doc Create or open a hanoidb store. 85 | - spec open(Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}. 86 | open(Dir, Opts) -> 87 | ok = start_app(), 88 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []), 89 | gen_server:start(?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]). 90 | 91 | %% @doc Create or open a hanoidb store with a registered name. 92 | - spec open(Name::{local, Name::atom()} | {global, GlobalName::term()} | {via, ViaName::term()}, 93 | Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}. 94 | open(Name, Dir, Opts) -> 95 | ok = start_app(), 96 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []), 97 | gen_server:start(Name, ?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]). 98 | 99 | %% @doc 100 | %% Create or open a hanoidb store as part of a supervision tree. 101 | %% Argument `Dir' names a directory in which to keep the data files. 102 | %% By convention, we name hanoidb data directories with extension 103 | %% ".hanoidb". 104 | - spec open_link(Dir::string()) -> {ok, hanoidb()} | ignore | {error, term()}. 105 | open_link(Dir) -> 106 | open_link(Dir, []). 107 | 108 | %% @doc Create or open a hanoidb store as part of a supervision tree. 109 | - spec open_link(Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}. 110 | open_link(Dir, Opts) -> 111 | ok = start_app(), 112 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []), 113 | gen_server:start_link(?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]). 114 | 115 | %% @doc Create or open a hanoidb store as part of a supervision tree 116 | %% with a registered name. 117 | - spec open_link(Name::{local, Name::atom()} | {global, GlobalName::term()} | {via, ViaName::term()}, 118 | Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}. 119 | open_link(Name, Dir, Opts) -> 120 | ok = start_app(), 121 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []), 122 | gen_server:start_link(Name, ?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]). 123 | 124 | %% @doc 125 | %% Close a Hanoi data store. 126 | - spec close(Ref::pid()) -> ok. 127 | close(Ref) -> 128 | try 129 | gen_server:call(Ref, close, infinity) 130 | catch 131 | exit:{noproc,_} -> ok; 132 | exit:noproc -> ok; 133 | %% Handle the case where the monitor triggers 134 | exit:{normal, _} -> ok 135 | end. 136 | 137 | -spec destroy(Ref::pid()) -> ok. 138 | destroy(Ref) -> 139 | try 140 | gen_server:call(Ref, destroy, infinity) 141 | catch 142 | exit:{noproc,_} -> ok; 143 | exit:noproc -> ok; 144 | %% Handle the case where the monitor triggers 145 | exit:{normal, _} -> ok 146 | end. 147 | 148 | get(Ref,Key) when is_binary(Key) -> 149 | gen_server:call(Ref, {get, Key}, infinity). 150 | 151 | %% for compatibility with original code 152 | lookup(Ref,Key) when is_binary(Key) -> 153 | gen_server:call(Ref, {get, Key}, infinity). 154 | 155 | -spec delete(hanoidb(), binary()) -> 156 | ok | {error, term()}. 157 | delete(Ref,Key) when is_binary(Key) -> 158 | gen_server:call(Ref, {delete, Key}, infinity). 159 | 160 | -spec put(hanoidb(), binary(), binary()) -> 161 | ok | {error, term()}. 162 | put(Ref,Key,Value) when is_binary(Key), is_binary(Value) -> 163 | gen_server:call(Ref, {put, Key, Value, infinity}, infinity). 164 | 165 | -spec put(hanoidb(), binary(), binary(), integer()) -> 166 | ok | {error, term()}. 167 | put(Ref,Key,Value,infinity) when is_binary(Key), is_binary(Value) -> 168 | gen_server:call(Ref, {put, Key, Value, infinity}, infinity); 169 | put(Ref,Key,Value,Expiry) when is_binary(Key), is_binary(Value) -> 170 | gen_server:call(Ref, {put, Key, Value, Expiry}, infinity). 171 | 172 | -type transact_spec() :: {put, binary(), binary()} | {delete, binary()}. 173 | -spec transact(hanoidb(), [transact_spec()]) -> 174 | ok | {error, term()}. 175 | transact(Ref, TransactionSpec) -> 176 | gen_server:call(Ref, {transact, TransactionSpec}, infinity). 177 | 178 | -type kv_fold_fun() :: fun((binary(),binary(),any())->any()). 179 | 180 | -spec fold(hanoidb(),kv_fold_fun(),any()) -> any(). 181 | fold(Ref,Fun,Acc0) -> 182 | fold_range(Ref,Fun,Acc0,#key_range{from_key= <<>>, to_key=undefined}). 183 | 184 | -spec fold_range(hanoidb(),kv_fold_fun(),any(),key_range()) -> any(). 185 | fold_range(Ref,Fun,Acc0,#key_range{limit=Limit}=Range) -> 186 | RangeType = 187 | if Limit < 10 -> blocking_range; 188 | true -> snapshot_range 189 | end, 190 | {ok, FoldWorkerPID} = hanoidb_fold_worker:start(self()), 191 | MRef = erlang:monitor(process, FoldWorkerPID), 192 | ?log("fold_range begin: self=~p, worker=~p monitor=~p~n", [self(), FoldWorkerPID, MRef]), 193 | ok = gen_server:call(Ref, {RangeType, FoldWorkerPID, Range}, infinity), 194 | Result = receive_fold_range(MRef, FoldWorkerPID, Fun, Acc0, Limit), 195 | ?log("fold_range done: self:~p, result=~p~n", [self(), Result]), 196 | Result. 197 | 198 | receive_fold_range(MRef,PID,_,Acc0, 0) -> 199 | erlang:exit(PID, shutdown), 200 | drain_worker(MRef,PID,Acc0); 201 | 202 | receive_fold_range(MRef,PID,Fun,Acc0, Limit) -> 203 | ?log("receive_fold_range:~p,~P~n", [PID,Acc0,10]), 204 | receive 205 | 206 | %% receive one K/V from fold_worker 207 | ?CALL(From, {fold_result, PID, K,V}) -> 208 | plain_rpc:send_reply(From, ok), 209 | case 210 | try 211 | {ok, Fun(K,V,Acc0)} 212 | catch 213 | Class:Exception -> 214 | % TODO ?log("Exception in hanoidb fold: ~p ~p", [Exception, erlang:get_stacktrace()]), 215 | {'EXIT', Class, Exception, erlang:get_stacktrace()} 216 | end 217 | of 218 | {ok, Acc1} -> 219 | receive_fold_range(MRef, PID, Fun, Acc1, decr(Limit)); 220 | Exit -> 221 | %% kill the fold worker ... 222 | erlang:exit(PID, shutdown), 223 | raise(drain_worker(MRef,PID,Exit)) 224 | end; 225 | 226 | ?CAST(_,{fold_limit, PID, _}) -> 227 | ?log("> fold_limit pid=~p, self=~p~n", [PID, self()]), 228 | erlang:demonitor(MRef, [flush]), 229 | Acc0; 230 | ?CAST(_,{fold_done, PID}) -> 231 | ?log("> fold_done pid=~p, self=~p~n", [PID, self()]), 232 | erlang:demonitor(MRef, [flush]), 233 | Acc0; 234 | {'DOWN', MRef, _, _PID, normal} -> 235 | ?log("> fold worker ~p ENDED~n", [_PID]), 236 | Acc0; 237 | {'DOWN', MRef, _, _PID, Reason} -> 238 | ?log("> fold worker ~p DOWN reason:~p~n", [_PID, Reason]), 239 | error({fold_worker_died, Reason}) 240 | end. 241 | 242 | decr(undefined) -> 243 | undefined; 244 | decr(N) -> 245 | N-1. 246 | 247 | %% 248 | %% Just calls erlang:raise with appropriate arguments 249 | %% 250 | raise({'EXIT', Class, Exception, Trace}) -> 251 | erlang:raise(Class, Exception, Trace). 252 | 253 | 254 | drain_worker(MRef, PID, Value) -> 255 | receive 256 | ?CALL(_From,{fold_result, PID, _, _}) -> 257 | drain_worker(MRef, PID, Value); 258 | {'DOWN', MRef, _, _, _} -> 259 | Value; 260 | ?CAST(_,{fold_limit, PID, _}) -> 261 | erlang:demonitor(MRef, [flush]), 262 | Value; 263 | ?CAST(_,{fold_done, PID}) -> 264 | erlang:demonitor(MRef, [flush]), 265 | Value 266 | after 0 -> 267 | Value 268 | end. 269 | 270 | 271 | init([Dir, Opts0]) -> 272 | %% ensure expory_secs option is set in config 273 | Opts = 274 | case get_opt(expiry_secs, Opts0) of 275 | undefined -> 276 | [{expiry_secs, ?DEFAULT_EXPIRY_SECS}|Opts0]; 277 | N when is_integer(N), N >= 0 -> 278 | [{expiry_secs, N}|Opts0] 279 | end, 280 | hanoidb_util:ensure_expiry(Opts), 281 | 282 | {Top, Nur, Max} = 283 | case file:read_file_info(Dir) of 284 | {ok, #file_info{ type=directory }} -> 285 | {ok, TopLevel, MinLevel, MaxLevel} = open_levels(Dir, Opts), 286 | {ok, Nursery} = hanoidb_nursery:recover(Dir, TopLevel, MinLevel, MaxLevel, Opts), 287 | {TopLevel, Nursery, MaxLevel}; 288 | {error, E} when E =:= enoent -> 289 | ok = file:make_dir(Dir), 290 | MinLevel = get_opt(top_level, Opts0, ?TOP_LEVEL), 291 | {ok, TopLevel} = hanoidb_level:open(Dir, MinLevel, undefined, Opts, self()), 292 | MaxLevel = MinLevel, 293 | {ok, Nursery} = hanoidb_nursery:new(Dir, MinLevel, MaxLevel, Opts), 294 | {TopLevel, Nursery, MaxLevel} 295 | end, 296 | {ok, #state{ top=Top, dir=Dir, nursery=Nur, opt=Opts, max_level=Max }}. 297 | 298 | 299 | open_levels(Dir, Options) -> 300 | {ok, Files} = file:list_dir(Dir), 301 | TopLevel0 = get_opt(top_level, Options, ?TOP_LEVEL), 302 | 303 | %% parse file names and find max level 304 | {MinLevel, MaxLevel} = 305 | lists:foldl(fun(FileName, {MinLevel, MaxLevel}) -> 306 | case parse_level(FileName) of 307 | {ok, Level} -> 308 | {erlang:min(MinLevel, Level), 309 | erlang:max(MaxLevel, Level)}; 310 | _ -> 311 | {MinLevel, MaxLevel} 312 | end 313 | end, 314 | {TopLevel0, TopLevel0}, 315 | Files), 316 | 317 | %% remove old nursery data file 318 | NurseryFileName = filename:join(Dir, "nursery.data"), 319 | _ = file:delete(NurseryFileName), 320 | 321 | %% Do enough incremental merge to be sure we won't deadlock in insert 322 | {TopLevel, MaxMerge} = 323 | lists:foldl(fun(LevelNo, {NextLevel, MergeWork0}) -> 324 | {ok, Level} = hanoidb_level:open(Dir, LevelNo, NextLevel, Options, self()), 325 | MergeWork = MergeWork0 + hanoidb_level:unmerged_count(Level), 326 | {Level, MergeWork} 327 | end, 328 | {undefined, 0}, 329 | lists:seq(MaxLevel, MinLevel, -1)), 330 | WorkPerIter = (MaxLevel - MinLevel + 1) * ?BTREE_SIZE(MinLevel), 331 | % error_logger:info_msg("do_merge ... {~p,~p,~p}~n", [TopLevel, WorkPerIter, MaxMerge]), 332 | do_merge(TopLevel, WorkPerIter, MaxMerge, MinLevel), 333 | {ok, TopLevel, MinLevel, MaxLevel}. 334 | 335 | do_merge(TopLevel, _Inc, N, _MinLevel) when N =< 0 -> 336 | ok = hanoidb_level:await_incremental_merge(TopLevel); 337 | do_merge(TopLevel, Inc, N, MinLevel) -> 338 | ok = hanoidb_level:begin_incremental_merge(TopLevel, ?BTREE_SIZE(MinLevel)), 339 | do_merge(TopLevel, Inc, N-Inc, MinLevel). 340 | 341 | 342 | parse_level(FileName) -> 343 | case re:run(FileName, "^[^\\d]+-(\\d+)\\.data$", [{capture,all_but_first,list}]) of 344 | {match,[StringVal]} -> 345 | {ok, list_to_integer(StringVal)}; 346 | _ -> 347 | nomatch 348 | end. 349 | 350 | 351 | handle_info({bottom_level, N}, #state{ nursery=Nursery, top=TopLevel }=State) 352 | when N > State#state.max_level -> 353 | State2 = State#state{ max_level = N, 354 | nursery= hanoidb_nursery:set_max_level(Nursery, N) }, 355 | 356 | _ = hanoidb_level:set_max_level(TopLevel, N), 357 | 358 | {noreply, State2}; 359 | 360 | handle_info(Info,State) -> 361 | error_logger:error_msg("Unknown info ~p~n", [Info]), 362 | {stop,bad_msg,State}. 363 | 364 | handle_cast(Info,State) -> 365 | error_logger:error_msg("Unknown cast ~p~n", [Info]), 366 | {stop,bad_msg,State}. 367 | 368 | 369 | %% premature delete -> cleanup 370 | terminate(normal, _State) -> 371 | ok; 372 | terminate(_Reason, _State) -> 373 | error_logger:info_msg("got terminate(~p, ~p)~n", [_Reason, _State]), 374 | ok. 375 | 376 | code_change(_OldVsn, State, _Extra) -> 377 | {ok, State}. 378 | 379 | 380 | handle_call({snapshot_range, FoldWorkerPID, Range}, _From, State=#state{ top=TopLevel, nursery=Nursery }) -> 381 | hanoidb_nursery:do_level_fold(Nursery, FoldWorkerPID, Range), 382 | Result = hanoidb_level:snapshot_range(TopLevel, FoldWorkerPID, Range), 383 | {reply, Result, State}; 384 | 385 | handle_call({blocking_range, FoldWorkerPID, Range}, _From, State=#state{ top=TopLevel, nursery=Nursery }) -> 386 | hanoidb_nursery:do_level_fold(Nursery, FoldWorkerPID, Range), 387 | Result = hanoidb_level:blocking_range(TopLevel, FoldWorkerPID, Range), 388 | {reply, Result, State}; 389 | 390 | handle_call({put, Key, Value, Expiry}, _From, State) when is_binary(Key), is_binary(Value) -> 391 | {ok, State2} = do_put(Key, Value, Expiry, State), 392 | {reply, ok, State2}; 393 | 394 | handle_call({transact, TransactionSpec}, _From, State) -> 395 | {ok, State2} = do_transact(TransactionSpec, State), 396 | {reply, ok, State2}; 397 | 398 | handle_call({delete, Key}, _From, State) when is_binary(Key) -> 399 | {ok, State2} = do_put(Key, ?TOMBSTONE, infinity, State), 400 | {reply, ok, State2}; 401 | 402 | handle_call({get, Key}, From, State=#state{ top=Top, nursery=Nursery } ) when is_binary(Key) -> 403 | case hanoidb_nursery:lookup(Key, Nursery) of 404 | {value, ?TOMBSTONE} -> 405 | {reply, not_found, State}; 406 | {value, Value} when is_binary(Value) -> 407 | {reply, {ok, Value}, State}; 408 | none -> 409 | _ = hanoidb_level:lookup(Top, Key, fun(Reply) -> gen_server:reply(From, Reply) end), 410 | {noreply, State} 411 | end; 412 | 413 | handle_call(close, _From, State=#state{ nursery=undefined }) -> 414 | {stop, normal, ok, State}; 415 | 416 | handle_call(close, _From, State=#state{ nursery=Nursery, top=Top, dir=Dir, max_level=MaxLevel, opt=Config }) -> 417 | try 418 | ok = hanoidb_nursery:finish(Nursery, Top), 419 | MinLevel = hanoidb_level:level(Top), 420 | {ok, Nursery2} = hanoidb_nursery:new(Dir, MinLevel, MaxLevel, Config), 421 | ok = hanoidb_level:close(Top), 422 | {stop, normal, ok, State#state{ nursery=Nursery2 }} 423 | catch 424 | E:R -> 425 | error_logger:info_msg("exception from close ~p:~p~n", [E,R]), 426 | {stop, normal, ok, State} 427 | end; 428 | 429 | handle_call(destroy, _From, State=#state{top=Top, nursery=Nursery }) -> 430 | TopLevelNumber = hanoidb_level:level(Top), 431 | ok = hanoidb_nursery:destroy(Nursery), 432 | ok = hanoidb_level:destroy(Top), 433 | {stop, normal, ok, State#state{ top=undefined, nursery=undefined, max_level=TopLevelNumber }}. 434 | 435 | -spec do_put(key(), value(), expiry(), #state{}) -> {ok, #state{}}. 436 | do_put(Key, Value, Expiry, State=#state{ nursery=Nursery, top=Top }) when Nursery =/= undefined -> 437 | {ok, Nursery2} = hanoidb_nursery:add(Key, Value, Expiry, Nursery, Top), 438 | {ok, State#state{nursery=Nursery2}}. 439 | 440 | do_transact([{put, Key, Value}], State) -> 441 | do_put(Key, Value, infinity, State); 442 | do_transact([{delete, Key}], State) -> 443 | do_put(Key, ?TOMBSTONE, infinity, State); 444 | do_transact([], State) -> 445 | {ok, State}; 446 | do_transact(TransactionSpec, State=#state{ nursery=Nursery, top=Top }) -> 447 | {ok, Nursery2} = hanoidb_nursery:transact(TransactionSpec, Nursery, Top), 448 | {ok, State#state{ nursery=Nursery2 }}. 449 | 450 | start_app() -> 451 | ok = ensure_started(syntax_tools), 452 | ok = ensure_started(plain_fsm), 453 | ok = ensure_started(?MODULE). 454 | 455 | ensure_started(Application) -> 456 | case application:start(Application) of 457 | ok -> 458 | ok; 459 | {error, {already_started, _}} -> 460 | ok; 461 | {error, Reason} -> 462 | {error, Reason} 463 | end. 464 | 465 | get_opt(Key, Opts) -> 466 | get_opt(Key, Opts, undefined). 467 | 468 | get_opt(Key, Opts, Default) -> 469 | case proplists:get_value(Key, Opts) of 470 | undefined -> 471 | case application:get_env(?MODULE, Key) of 472 | {ok, Value} -> Value; 473 | undefined -> Default 474 | end; 475 | Value -> 476 | Value 477 | end. 478 | -------------------------------------------------------------------------------- /src/hanoidb.hrl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | 26 | %% smallest levels are 1024 entries 27 | -define(TOP_LEVEL, 10). 28 | -define(BTREE_SIZE(Level), (1 bsl (Level))). 29 | -define(FILE_FORMAT, <<"HAN2">>). 30 | -define(FIRST_BLOCK_POS, byte_size(?FILE_FORMAT)). 31 | 32 | -define(TOMBSTONE, 'deleted'). 33 | 34 | -define(KEY_IN_FROM_RANGE(Key,Range), 35 | ((Range#key_range.from_inclusive andalso 36 | (Range#key_range.from_key =< Key)) 37 | orelse 38 | (Range#key_range.from_key < Key))). 39 | 40 | -define(KEY_IN_TO_RANGE(Key,Range), 41 | ((Range#key_range.to_key == undefined) 42 | orelse 43 | ((Range#key_range.to_inclusive andalso 44 | (Key =< Range#key_range.to_key)) 45 | orelse 46 | (Key < Range#key_range.to_key)))). 47 | 48 | -define(KEY_IN_RANGE(Key,Range), 49 | (?KEY_IN_FROM_RANGE(Key,Range) andalso ?KEY_IN_TO_RANGE(Key,Range))). 50 | 51 | 52 | -ifdef(pre18). 53 | -define(TIMESTAMP, now()). 54 | -else. 55 | -define(TIMESTAMP, erlang:timestamp()). 56 | -endif. 57 | 58 | -record(nursery, { log_file :: file:fd(), 59 | dir :: string(), 60 | cache :: gb_trees:tree(binary(), binary()), 61 | total_size=0 :: integer(), 62 | count=0 :: integer(), 63 | last_sync=?TIMESTAMP :: erlang:timestamp(), 64 | min_level :: integer(), 65 | max_level :: integer(), 66 | config=[] :: [{atom(), term()}], 67 | step=0 :: integer(), 68 | merge_done=0 :: integer()}). 69 | 70 | -type kventry() :: { key(), expvalue() } | [ kventry() ]. 71 | -type key() :: binary(). 72 | -type txspec() :: { delete, key() } | { put, key(), value() }. 73 | -type value() :: ?TOMBSTONE | binary(). 74 | -type expiry() :: infinity | integer(). 75 | -type filepos() :: { non_neg_integer(), non_neg_integer() }. 76 | -type expvalue() :: { value(), expiry() } 77 | | value() 78 | | filepos(). 79 | 80 | -ifdef(USE_EBLOOM). 81 | -define(HANOI_BLOOM_TYPE, ebloom). 82 | -else. 83 | -define(HANOI_BLOOM_TYPE, sbloom). 84 | -endif. 85 | 86 | -define(BLOOM_NEW(Size), hanoidb_util:bloom_new(Size, ?HANOI_BLOOM_TYPE)). 87 | -define(BLOOM_TO_BIN(Bloom), hanoidb_util:bloom_to_bin(Bloom)). 88 | -define(BIN_TO_BLOOM(Bin, Fmt), hanoidb_util:bin_to_bloom(Bin, Fmt)). 89 | -define(BLOOM_INSERT(Bloom, Key), hanoidb_util:bloom_insert(Bloom, Key)). 90 | -define(BLOOM_CONTAINS(Bloom, Key), hanoidb_util:bloom_contains(Bloom, Key)). 91 | 92 | %% tags used in the on-disk representation 93 | -define(TAG_KV_DATA, 16#80). 94 | -define(TAG_DELETED, 16#81). 95 | -define(TAG_POSLEN32, 16#82). 96 | -define(TAG_TRANSACT, 16#83). 97 | -define(TAG_KV_DATA2, 16#84). 98 | -define(TAG_DELETED2, 16#85). 99 | -define(TAG_END, 16#FF). 100 | 101 | 102 | -------------------------------------------------------------------------------- /src/hanoidb_app.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_app). 26 | -author('Kresten Krab Thorup '). 27 | 28 | -behaviour(application). 29 | 30 | %% Application callbacks 31 | -export([start/2, stop/1]). 32 | 33 | %% =================================================================== 34 | %% Application callbacks 35 | %% =================================================================== 36 | 37 | start(_StartType, _StartArgs) -> 38 | hanoidb_sup:start_link(). 39 | 40 | stop(_State) -> 41 | ok. 42 | -------------------------------------------------------------------------------- /src/hanoidb_bloom.erl: -------------------------------------------------------------------------------- 1 | % The contents of this file are subject to the Erlang Public License, Version 2 | %% 1.1, (the "License"); you may not use this file except in compliance with 3 | %% the License. You should have received a copy of the Erlang Public License 4 | %% along with this software. If not, it can be retrieved via the world wide web 5 | %% at http://www.erlang.org/. 6 | %% 7 | %% Software distributed under the License is distributed on an "AS IS" basis, 8 | %% WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for 9 | %% the specific language governing rights and limitations under the License. 10 | 11 | %% Based on: Scalable Bloom Filters 12 | %% Paulo Sérgio Almeida, Carlos Baquero, Nuno Preguiça, David Hutchison 13 | %% Information Processing Letters 14 | %% Volume 101, Issue 6, 31 March 2007, Pages 255-261 15 | %% 16 | %% Provides scalable bloom filters that can grow indefinitely while ensuring a 17 | %% desired maximum false positive probability. Also provides standard 18 | %% partitioned bloom filters with a maximum capacity. Bit arrays are 19 | %% dimensioned as a power of 2 to enable reusing hash values across filters 20 | %% through bit operations. Double hashing is used (no need for enhanced double 21 | %% hashing for partitioned bloom filters). 22 | 23 | %% Modified slightly by Justin Sheehy to make it a single file (incorporated 24 | %% the array-based bitarray internally). 25 | -module(hanoidb_bloom). 26 | -author("Paulo Sergio Almeida "). 27 | 28 | -export([sbf/1, sbf/2, sbf/3, sbf/4, 29 | bloom/1, bloom/2, 30 | member/2, add/2, 31 | size/1, capacity/1, 32 | encode/1, decode/1]). 33 | -import(math, [log/1, pow/2]). 34 | 35 | -ifdef(TEST). 36 | -ifdef(EQC). 37 | -include_lib("eqc/include/eqc.hrl"). 38 | -endif. 39 | -include_lib("eunit/include/eunit.hrl"). 40 | -endif. 41 | 42 | -define(W, 27). 43 | 44 | -ifdef(pre18). 45 | -type bitmask() :: array() | any(). 46 | -else. 47 | -type bitmask() :: arrays:array() | any(). 48 | -endif. 49 | 50 | -record(bloom, { 51 | e :: float(), % error probability 52 | n :: non_neg_integer(), % maximum number of elements 53 | mb :: non_neg_integer(), % 2^mb = m, the size of each slice (bitvector) 54 | size :: non_neg_integer(), % number of elements 55 | a :: [bitmask()] % list of bitvectors 56 | }). 57 | 58 | -record(sbf, { 59 | e :: float(), % error probability 60 | r :: float(), % error probability ratio 61 | s :: non_neg_integer(), % log 2 of growth ratio 62 | size :: non_neg_integer(), % number of elements 63 | b :: [#bloom{}] % list of plain bloom filters 64 | }). 65 | 66 | %% Constructors for (fixed capacity) bloom filters 67 | %% 68 | %% N - capacity 69 | %% E - error probability 70 | bloom(N) -> bloom(N, 0.001). 71 | bloom(N, E) when is_number(N), N > 0, 72 | is_float(E), E > 0, E < 1, 73 | N >= 4/E -> % rule of thumb; due to double hashing 74 | bloom(size, N, E); 75 | bloom(N, E) when is_number(N), N >= 0, 76 | is_float(E), E > 0, E < 1 -> 77 | bloom(bits, 32, E). 78 | 79 | bloom(Mode, N, E) -> 80 | K = case Mode of 81 | size -> 1 + trunc(log2(1/E)); 82 | bits -> 1 83 | end, 84 | P = pow(E, 1 / K), 85 | 86 | Mb = 87 | case Mode of 88 | size -> 89 | 1 + trunc(-log2(1 - pow(1 - P, 1 / N))); 90 | bits -> 91 | N 92 | end, 93 | M = 1 bsl Mb, 94 | D = trunc(log(1-P) / log(1-1/M)), 95 | #bloom{e=E, n=D, mb=Mb, size = 0, 96 | a = [bitmask_new(Mb) || _ <- lists:seq(1, K)]}. 97 | 98 | log2(X) -> log(X) / log(2). 99 | 100 | %% Constructors for scalable bloom filters 101 | %% 102 | %% N - initial capacity before expanding 103 | %% E - error probability 104 | %% S - growth ratio when full (log 2) can be 1, 2 or 3 105 | %% R - tightening ratio of error probability 106 | sbf(N) -> sbf(N, 0.001). 107 | sbf(N, E) -> sbf(N, E, 1). 108 | sbf(N, E, 1) -> sbf(N, E, 1, 0.85); 109 | sbf(N, E, 2) -> sbf(N, E, 2, 0.75); 110 | sbf(N, E, 3) -> sbf(N, E, 3, 0.65). 111 | sbf(N, E, S, R) when is_number(N), N > 0, 112 | is_float(E), E > 0, E < 1, 113 | is_integer(S), S > 0, S < 4, 114 | is_float(R), R > 0, R < 1, 115 | N >= 4/(E*(1-R)) -> % rule of thumb; due to double hashing 116 | #sbf{e=E, s=S, r=R, size=0, b=[bloom(N, E*(1-R))]}. 117 | 118 | %% Returns number of elements 119 | %% 120 | size(#bloom{size=Size}) -> Size; 121 | size(#sbf{size=Size}) -> Size. 122 | 123 | %% Returns capacity 124 | %% 125 | capacity(#bloom{n=N}) -> N; 126 | capacity(#sbf{}) -> infinity. 127 | 128 | %% Test for membership 129 | %% 130 | member(Elem, #bloom{mb=Mb}=B) -> 131 | Hashes = make_hashes(Mb, Elem), 132 | hash_member(Hashes, B); 133 | member(Elem, #sbf{b=[H|_]}=Sbf) -> 134 | Hashes = make_hashes(H#bloom.mb, Elem), 135 | hash_member(Hashes, Sbf). 136 | 137 | hash_member(Hashes, #bloom{mb=Mb, a=A}) -> 138 | Mask = 1 bsl Mb -1, 139 | {I1, I0} = make_indexes(Mask, Hashes), 140 | all_set(Mask, I1, I0, A); 141 | hash_member(Hashes, #sbf{b=B}) -> 142 | lists:any(fun(X) -> hash_member(Hashes, X) end, B). 143 | 144 | make_hashes(Mb, E) when Mb =< 16 -> 145 | erlang:phash2({E}, 1 bsl 32); 146 | make_hashes(Mb, E) when Mb =< 32 -> 147 | {erlang:phash2({E}, 1 bsl 32), erlang:phash2([E], 1 bsl 32)}. 148 | 149 | make_indexes(Mask, {H0, H1}) when Mask > 1 bsl 16 -> masked_pair(Mask, H0, H1); 150 | make_indexes(Mask, {H0, _}) -> make_indexes(Mask, H0); 151 | make_indexes(Mask, H0) -> masked_pair(Mask, H0 bsr 16, H0). 152 | 153 | masked_pair(Mask, X, Y) -> {X band Mask, Y band Mask}. 154 | 155 | all_set(_Mask, _I1, _I, []) -> true; 156 | all_set(Mask, I1, I, [H|T]) -> 157 | bitmask_get(I, H) andalso all_set(Mask, I1, (I+I1) band Mask, T). 158 | 159 | %% Adds element to set 160 | %% 161 | add(Elem, #bloom{mb=Mb} = B) -> 162 | Hashes = make_hashes(Mb, Elem), 163 | hash_add(Hashes, B); 164 | add(Elem, #sbf{size=Size, r=R, s=S, b=[H|T]=Bs}=Sbf) -> 165 | #bloom{mb=Mb, e=E, n=N, size=HSize} = H, 166 | Hashes = make_hashes(Mb, Elem), 167 | case hash_member(Hashes, Sbf) of 168 | true -> Sbf; 169 | false -> 170 | case HSize < N of 171 | true -> Sbf#sbf{size=Size+1, b=[hash_add(Hashes, H)|T]}; 172 | false -> 173 | B = add(Elem, bloom(bits, Mb + S, E * R)), 174 | Sbf#sbf{size=Size+1, b=[B|Bs]} 175 | end 176 | end. 177 | 178 | hash_add(Hashes, #bloom{mb=Mb, a=A, size=Size} = B) -> 179 | Mask = 1 bsl Mb -1, 180 | {I1, I0} = make_indexes(Mask, Hashes), 181 | B#bloom{size=Size+1, a=set_bits(Mask, I1, I0, A, [])}. 182 | 183 | set_bits(_Mask, _I1, _I, [], Acc) -> lists:reverse(Acc); 184 | set_bits(Mask, I1, I, [H|T], Acc) -> 185 | set_bits(Mask, I1, (I+I1) band Mask, T, [bitmask_set(I, H) | Acc]). 186 | 187 | 188 | %%%========== Dispatch to appropriate representation: 189 | bitmask_new(LogN) -> 190 | if LogN >= 20 -> % Use sparse representation. 191 | hanoidb_sparse_bitmap:new(LogN); 192 | true -> % Use dense representation. 193 | hanoidb_dense_bitmap:new(1 bsl LogN) 194 | end. 195 | 196 | bitmask_set(I, BM) -> 197 | case element(1,BM) of 198 | array -> bitarray_set(I, as_array(BM)); 199 | sparse_bitmap -> hanoidb_sparse_bitmap:set(I, BM); 200 | dense_bitmap_ets -> hanoidb_dense_bitmap:set(I, BM); 201 | dense_bitmap -> 202 | %% Surprise - we need to mutate a built representation: 203 | hanoidb_dense_bitmap:set(I, hanoidb_dense_bitmap:unbuild(BM)) 204 | end. 205 | 206 | %%% Convert to external form. 207 | bitmask_build(BM) -> 208 | case element(1,BM) of 209 | array -> BM; 210 | sparse_bitmap -> BM; 211 | dense_bitmap -> BM; 212 | dense_bitmap_ets -> hanoidb_dense_bitmap:build(BM) 213 | end. 214 | 215 | bitmask_get(I, BM) -> 216 | case element(1,BM) of 217 | array -> bitarray_get(I, as_array(BM)); 218 | sparse_bitmap -> hanoidb_sparse_bitmap:member(I, BM); 219 | dense_bitmap_ets -> hanoidb_dense_bitmap:member(I, BM); 220 | dense_bitmap -> hanoidb_dense_bitmap:member(I, BM) 221 | end. 222 | 223 | -ifdef(pre18). 224 | -spec as_array(bitmask()) -> array(). 225 | -else. 226 | -spec as_array(bitmask()) -> arrays:array(). 227 | -endif. 228 | as_array(BM) -> 229 | case array:is_array(BM) of 230 | true -> BM 231 | end. 232 | 233 | %%%========== Bitarray representation - suitable for sparse arrays ========== 234 | bitarray_new(N) -> array:new((N-1) div ?W + 1, {default, 0}). 235 | 236 | -ifdef(pre18). 237 | -spec bitarray_set( non_neg_integer(), array() ) -> array(). 238 | -else. 239 | -spec bitarray_set( non_neg_integer(), arrays:array() ) -> arrays:array(). 240 | -endif. 241 | 242 | bitarray_set(I, A1) -> 243 | A = as_array(A1), 244 | AI = I div ?W, 245 | V = array:get(AI, A), 246 | V1 = V bor (1 bsl (I rem ?W)), 247 | if V =:= V1 -> A; % The bit is already set 248 | true -> array:set(AI, V1, A) 249 | end. 250 | 251 | -ifdef(pre18). 252 | -spec bitarray_get( non_neg_integer(), array() ) -> boolean(). 253 | -else. 254 | -spec bitarray_get( non_neg_integer(), arrays:array() ) -> boolean(). 255 | -endif. 256 | bitarray_get(I, A) -> 257 | AI = I div ?W, 258 | V = array:get(AI, A), 259 | (V band (1 bsl (I rem ?W))) =/= 0. 260 | 261 | %%%^^^^^^^^^^ Bitarray representation - suitable for sparse arrays ^^^^^^^^^^ 262 | 263 | encode(Bloom) -> 264 | zlib:gzip(term_to_binary(bloom_build(Bloom))). 265 | 266 | decode(Bin) -> 267 | binary_to_term(zlib:gunzip(Bin)). 268 | 269 | %%% Convert to external form. 270 | bloom_build(Bloom=#bloom{a=Bitmasks}) -> 271 | Bloom#bloom{a=[bitmask_build(X) || X <- Bitmasks]}; 272 | bloom_build(Sbf=#sbf{b=Blooms}) -> 273 | Sbf#sbf{b=[bloom_build(X) || X <- Blooms]}. 274 | 275 | %% UNIT TESTS 276 | 277 | -ifdef(TEST). 278 | -ifdef(EQC). 279 | 280 | prop_bloom_test_() -> 281 | {timeout, 60, fun() -> ?assert(eqc:quickcheck(prop_bloom())) end}. 282 | 283 | g_keys() -> 284 | non_empty(list(non_empty(binary()))). 285 | 286 | prop_bloom() -> 287 | ?FORALL(Keys, g_keys(), 288 | begin 289 | Bloom = ?MODULE:bloom(Keys), 290 | F = fun(X) -> member(X, Bloom) end, 291 | lists:all(F, Keys) 292 | end). 293 | 294 | -endif. 295 | -endif. 296 | -------------------------------------------------------------------------------- /src/hanoidb_dense_bitmap.erl: -------------------------------------------------------------------------------- 1 | -module(hanoidb_dense_bitmap). 2 | 3 | -export([new/1, set/2, build/1, unbuild/1, member/2]). 4 | -define(BITS_PER_CELL, 32). 5 | 6 | -define(REPR_NAME, dense_bitmap). 7 | 8 | new(N) -> 9 | Tab = ets:new(dense_bitmap, [private, set]), 10 | Width = 1 + (N-1) div ?BITS_PER_CELL, 11 | Value = erlang:make_tuple(Width+1, 0, [{1,?REPR_NAME}]), 12 | ets:insert(Tab, Value), 13 | {dense_bitmap_ets, N, Width, Tab}. 14 | 15 | %% Set a bit. 16 | set(I, {dense_bitmap_ets, _,_, Tab}=DBM) -> 17 | Cell = 2 + I div ?BITS_PER_CELL, 18 | BitInCell = I rem ?BITS_PER_CELL, 19 | Old = ets:lookup_element(Tab, ?REPR_NAME, Cell), 20 | New = Old bor (1 bsl BitInCell), 21 | if New =:= Old -> 22 | ok; % The bit is already set 23 | true -> 24 | ets:update_element(Tab, ?REPR_NAME, {Cell,New}) 25 | end, 26 | DBM. 27 | 28 | build({dense_bitmap_ets, _, _, Tab}) -> 29 | [Row] = ets:lookup(Tab, ?REPR_NAME), 30 | ets:delete(Tab), 31 | Row. 32 | 33 | unbuild(Row) when element(1,Row)==?REPR_NAME -> 34 | Tab = ets:new(dense_bitmap, [private, set]), 35 | ets:insert(Tab, Row), 36 | {dense_bitmap_ets, undefined, undefined, Tab}. 37 | 38 | member(I, Row) when element(1,Row)==?REPR_NAME -> 39 | Cell = 2 + I div ?BITS_PER_CELL, 40 | BitInCell = I rem ?BITS_PER_CELL, 41 | CellValue = element(Cell, Row), 42 | CellValue band (1 bsl BitInCell) =/= 0; 43 | member(I, {dense_bitmap_ets, _,_, Tab}) -> 44 | Cell = 2 + I div ?BITS_PER_CELL, 45 | BitInCell = I rem ?BITS_PER_CELL, 46 | CellValue = ets:lookup_element(Tab, ?REPR_NAME, Cell), 47 | CellValue band (1 bsl BitInCell) =/= 0. 48 | -------------------------------------------------------------------------------- /src/hanoidb_fold_worker.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_fold_worker). 26 | -author('Kresten Krab Thorup '). 27 | 28 | -ifdef(DEBUG). 29 | -define(log(Fmt,Args),io:format(user,Fmt,Args)). 30 | -else. 31 | -define(log(Fmt,Args),ok). 32 | -endif. 33 | 34 | %% 35 | %% This worker is used to merge fold results from individual 36 | %% levels. First, it receives a message 37 | %% 38 | %% {initialize, [LevelWorker, ...]} 39 | %% 40 | %% And then from each LevelWorker, a sequence of 41 | %% 42 | %% {level_result, LevelWorker, Key1, Value} 43 | %% {level_result, LevelWorker, Key2, Value} 44 | %% {level_result, LevelWorker, Key3, Value} 45 | %% {level_result, LevelWorker, Key4, Value} 46 | %% {level_results, LevelWorker, [{Key,Value}...]} %% alternatively 47 | %% ... 48 | %% {level_done, LevelWorker} 49 | %% 50 | %% The order of level workers in the initialize messge is top-down, 51 | %% which is used to select between same-key messages from different 52 | %% levels. 53 | %% 54 | %% This fold_worker process will then send to a designated SendTo target 55 | %% a similar sequence of messages 56 | %% 57 | %% {fold_result, self(), Key1, Value} 58 | %% {fold_result, self(), Key2, Value} 59 | %% {fold_result, self(), Key3, Value} 60 | %% ... 61 | %% {fold_done, self()}. 62 | %% 63 | 64 | -export([start/1]). 65 | -behavior(plain_fsm). 66 | -export([data_vsn/0, code_change/3]). 67 | 68 | -include("hanoidb.hrl"). 69 | -include("plain_rpc.hrl"). 70 | 71 | -record(state, {sendto :: pid(), sendto_ref :: reference()}). 72 | 73 | start(SendTo) -> 74 | F = fun() -> 75 | ?log("fold_worker started ~p~n", [self()]), 76 | process_flag(trap_exit, true), 77 | MRef = erlang:monitor(process, SendTo), 78 | try 79 | initialize(#state{sendto=SendTo, sendto_ref=MRef}, []), 80 | ?log("fold_worker done ~p~n", [self()]) 81 | catch 82 | Class:Ex -> 83 | ?log("fold_worker exception ~p:~p ~p~n", [Class, Ex, erlang:get_stacktrace()]), 84 | error_logger:error_msg("Unexpected: ~p:~p ~p~n", [Class, Ex, erlang:get_stacktrace()]), 85 | exit({bad, Class, Ex, erlang:get_stacktrace()}) 86 | end 87 | end, 88 | PID = plain_fsm:spawn(?MODULE, F), 89 | {ok, PID}. 90 | 91 | initialize(State, PrefixFolders) -> 92 | Parent = plain_fsm:info(parent), 93 | receive 94 | {prefix, [_]=Folders} -> 95 | initialize(State, Folders); 96 | 97 | {initialize, Folders} -> 98 | Queues = [ {PID,queue:new()} || PID <- (PrefixFolders ++ Folders) ], 99 | Initial = [ {PID,undefined} || PID <- (PrefixFolders ++ Folders) ], 100 | fill(State, Initial, Queues, PrefixFolders ++ Folders); 101 | 102 | %% gen_fsm handling 103 | {system, From, Req} -> 104 | plain_fsm:handle_system_msg( 105 | Req, From, State, fun(S1) -> initialize(S1, PrefixFolders) end); 106 | 107 | {'DOWN', MRef, _, _, _} when MRef =:= State#state.sendto_ref -> 108 | ok; 109 | 110 | {'EXIT', Parent, Reason} -> 111 | plain_fsm:parent_EXIT(Reason, State) 112 | end. 113 | 114 | fill(State, Values, Queues, []) -> 115 | emit_next(State, Values, Queues); 116 | 117 | fill(State, Values, Queues, [PID|Rest]=PIDs) -> 118 | % io:format(user, "v=~P, q=~P, pids=~p~n", [Values, 10, Queues, 10, PIDs]), 119 | case lists:keyfind(PID, 1, Queues) of 120 | {PID, Q} -> 121 | case queue:out(Q) of 122 | {empty, Q} -> 123 | fill_from_inbox(State, Values, Queues, [PID], PIDs); 124 | 125 | {{value, Msg}, Q2} -> 126 | Queues2 = lists:keyreplace(PID, 1, Queues, {PID, Q2}), 127 | 128 | case Msg of 129 | done -> 130 | fill(State, lists:keydelete(PID, 1, Values), Queues2, Rest); 131 | {_Key, _Value}=KV -> 132 | fill(State, lists:keyreplace(PID, 1, Values, {PID, KV}), Queues2, Rest) 133 | end 134 | end 135 | end. 136 | 137 | fill_from_inbox(State, Values, Queues, [], PIDs) -> 138 | fill(State, Values, Queues, PIDs); 139 | 140 | fill_from_inbox(State, Values, Queues, [PID|_]=PIDs, SavePIDs) -> 141 | ?log("waiting for ~p~n", [PIDs]), 142 | receive 143 | {level_done, PID} -> 144 | ?log("got {done, ~p}~n", [PID]), 145 | Queues2 = enter(PID, done, Queues), 146 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs); 147 | 148 | {level_limit, PID, Key} -> 149 | ?log("got {limit, ~p}~n", [PID]), 150 | Queues2 = enter(PID, {Key, limit}, Queues), 151 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs); 152 | 153 | {level_result, PID, Key, Value} -> 154 | ?log("got {result, ~p}~n", [PID]), 155 | Queues2 = enter(PID, {Key, Value}, Queues), 156 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs); 157 | 158 | ?CALL(From,{level_results, PID, KVs}) -> 159 | ?log("got {results, ~p}~n", [PID]), 160 | plain_rpc:send_reply(From,ok), 161 | Queues2 = enter_many(PID, KVs, Queues), 162 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs); 163 | 164 | %% gen_fsm handling 165 | {system, From, Req} -> 166 | plain_fsm:handle_system_msg( 167 | Req, From, State, fun(S1) -> fill_from_inbox(S1, Values, Queues, PIDs, SavePIDs) end); 168 | 169 | {'DOWN', MRef, _, _, _} when MRef =:= State#state.sendto_ref -> 170 | ok; 171 | 172 | {'EXIT', Parent, Reason}=Msg -> 173 | case plain_fsm:info(parent) == Parent of 174 | true -> 175 | plain_fsm:parent_EXIT(Reason, State); 176 | false -> 177 | error_logger:info_msg("unhandled EXIT message ~p~n", [Msg]), 178 | fill_from_inbox(State, Values, Queues, PIDs, SavePIDs) 179 | end 180 | 181 | end. 182 | 183 | enter(PID, Msg, Queues) -> 184 | {PID, Q} = lists:keyfind(PID, 1, Queues), 185 | Q2 = queue:in(Msg, Q), 186 | lists:keyreplace(PID, 1, Queues, {PID, Q2}). 187 | 188 | enter_many(PID, Msgs, Queues) -> 189 | {PID, Q} = lists:keyfind(PID, 1, Queues), 190 | Q2 = lists:foldl(fun queue:in/2, Q, Msgs), 191 | lists:keyreplace(PID, 1, Queues, {PID, Q2}). 192 | 193 | emit_next(State, [], _Queues) -> 194 | ?log( "emit_next ~p~n", [[]]), 195 | Msg = {fold_done, self()}, 196 | Target = State#state.sendto, 197 | ?log( "~p ! ~p~n", [Target, Msg]), 198 | _ = plain_rpc:cast(Target, Msg), 199 | end_of_fold(State); 200 | 201 | emit_next(State, [{FirstPID,FirstKV}|Rest]=Values, Queues) -> 202 | ?log( "emit_next ~p~n", [Values]), 203 | case 204 | lists:foldl(fun({P,{K1,_}=KV}, {{K2,_},_}) when K1 < K2 -> 205 | {KV,[P]}; 206 | ({P,{K,_}}, {{K,_}=KV,List}) -> 207 | {KV, [P|List]}; 208 | (_, Found) -> 209 | Found 210 | end, 211 | {FirstKV,[FirstPID]}, 212 | Rest) 213 | of 214 | {{_, ?TOMBSTONE}, FillFrom} -> 215 | fill(State, Values, Queues, FillFrom); 216 | {{Key, limit}, _} -> 217 | ?log( "~p ! ~p~n", [State#state.sendto, {fold_limit, self(), Key}]), 218 | _ = plain_rpc:cast(State#state.sendto, {fold_limit, self(), Key}), 219 | end_of_fold(State); 220 | {{Key, Value}, FillFrom} -> 221 | ?log( "~p ! ~p~n", [State#state.sendto, {fold_result, self(), Key, '...'}]), 222 | plain_rpc:call(State#state.sendto, {fold_result, self(), Key, Value}), 223 | fill(State, Values, Queues, FillFrom) 224 | end. 225 | 226 | end_of_fold(_State) -> 227 | ok. 228 | 229 | data_vsn() -> 230 | 5. 231 | 232 | code_change(_OldVsn, _State, _Extra) -> 233 | {ok, {#state{}, data_vsn()}}. 234 | 235 | 236 | -------------------------------------------------------------------------------- /src/hanoidb_merger.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_merger). 26 | -author('Kresten Krab Thorup '). 27 | -author('Gregory Burd '). 28 | 29 | %% @doc Merging two Indexes 30 | 31 | -export([start/6, merge/6]). 32 | 33 | -include("hanoidb.hrl"). 34 | -include("include/plain_rpc.hrl"). 35 | 36 | %% A merger which is inactive for this long will sleep which means that it will 37 | %% close open files, and compress the current bloom filter. 38 | -define(HIBERNATE_TIMEOUT, 5000). 39 | 40 | %% Most likely, there will be plenty of I/O being generated by concurrent 41 | %% merges, so we default to running the entire merge in one process. 42 | -define(LOCAL_WRITER, true). 43 | 44 | 45 | -spec start(string(), string(), string(), integer(), boolean(), list()) -> pid(). 46 | start(A,B,X, Size, IsLastLevel, Options) -> 47 | Owner = self(), 48 | plain_fsm:spawn_link(?MODULE, fun() -> 49 | try 50 | {ok, OutCount} = hanoidb_merger:merge(A, B, X, 51 | Size, 52 | IsLastLevel, 53 | Options), 54 | 55 | Owner ! ?CAST(self(),{merge_done, OutCount, X}) 56 | catch 57 | C:E -> 58 | %% this semi-bogus code makes sure we always get a stack trace if merging fails 59 | error_logger:error_msg("~p: merge failed ~p:~p ~p -> ~s~n", 60 | [self(), C,E,erlang:get_stacktrace(), X]), 61 | erlang:raise(C,E,erlang:get_stacktrace()) 62 | end 63 | end). 64 | 65 | -spec merge(string(), string(), string(), integer(), boolean(), list()) -> {ok, integer()}. 66 | merge(A,B,C, Size, IsLastLevel, Options) -> 67 | {ok, IXA} = hanoidb_reader:open(A, [sequential|Options]), 68 | {ok, IXB} = hanoidb_reader:open(B, [sequential|Options]), 69 | {ok, Out} = hanoidb_writer:init([C, [{size, Size} | Options]]), 70 | AKVs = 71 | case hanoidb_reader:first_node(IXA) of 72 | {kvlist, AKV} -> AKV; 73 | none -> [] 74 | end, 75 | BKVs = 76 | case hanoidb_reader:first_node(IXB) of 77 | {kvlist, BKV} ->BKV; 78 | none -> [] 79 | end, 80 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {0, none}). 81 | 82 | terminate(Out) -> 83 | {ok, Count, Out1} = hanoidb_writer:handle_call(count, self(), Out), 84 | {stop, normal, ok, _Out2} = hanoidb_writer:handle_call(close, self(), Out1), 85 | {ok, Count}. 86 | 87 | step(S) -> 88 | step(S, 1). 89 | 90 | step({N, From}, Steps) -> 91 | {N-Steps, From}. 92 | 93 | hibernate_scan(Keep) -> 94 | erlang:garbage_collect(), 95 | receive 96 | {step, From, HowMany} -> 97 | {IXA, IXB, Out, IsLastLevel, AKVs, BKVs, N} = erlang:binary_to_term(Keep), 98 | scan(hanoidb_reader:deserialize(IXA), 99 | hanoidb_reader:deserialize(IXB), 100 | hanoidb_writer:deserialize(Out), 101 | IsLastLevel, AKVs, BKVs, {N+HowMany, From}); 102 | 103 | %% gen_fsm handling 104 | {system, From, Req} -> 105 | plain_fsm:handle_system_msg( 106 | Req, From, Keep, fun hibernate_scan/1); 107 | 108 | {'EXIT', Parent, Reason} -> 109 | case plain_fsm:info(parent) of 110 | Parent -> 111 | plain_fsm:parent_EXIT(Reason, Keep) 112 | end 113 | 114 | end. 115 | 116 | 117 | hibernate_scan_only(Keep) -> 118 | erlang:garbage_collect(), 119 | receive 120 | {step, From, HowMany} -> 121 | {IX, OutBin, IsLastLevel, KVs, N} = erlang:binary_to_term(Keep), 122 | scan_only(hanoidb_reader:deserialize(IX), 123 | hanoidb_writer:deserialize(OutBin), 124 | IsLastLevel, KVs, {N+HowMany, From}); 125 | 126 | %% gen_fsm handling 127 | {system, From, Req} -> 128 | plain_fsm:handle_system_msg( 129 | Req, From, Keep, fun hibernate_scan_only/1); 130 | 131 | {'EXIT', Parent, Reason} -> 132 | case plain_fsm:info(parent) of 133 | Parent -> 134 | plain_fsm:parent_EXIT(Reason, Keep) 135 | end 136 | end. 137 | 138 | 139 | receive_scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}) -> 140 | 141 | receive 142 | {step, From, HowMany} -> 143 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N+HowMany, From}); 144 | 145 | %% gen_fsm handling 146 | {system, From, Req} -> 147 | plain_fsm:handle_system_msg( 148 | Req, From, {IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}}, 149 | fun({IXA2, IXB2, Out2, IsLastLevel2, AKVs2, BKVs2, {N2, FromPID2}}) -> 150 | receive_scan(IXA2, IXB2, Out2, IsLastLevel2, AKVs2, BKVs2, {N2, FromPID2}) 151 | end); 152 | 153 | {'EXIT', Parent, Reason} -> 154 | case plain_fsm:info(parent) of 155 | Parent -> 156 | plain_fsm:parent_EXIT(Reason, {IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}}) 157 | end 158 | 159 | after ?HIBERNATE_TIMEOUT -> 160 | Args = {hanoidb_reader:serialize(IXA), 161 | hanoidb_reader:serialize(IXB), 162 | hanoidb_writer:serialize(Out), IsLastLevel, AKVs, BKVs, N}, 163 | Keep = erlang:term_to_binary(Args, [compressed]), 164 | hibernate_scan(Keep) 165 | end. 166 | 167 | 168 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}) when N < 1, AKVs =/= [], BKVs =/= [] -> 169 | case FromPID of 170 | none -> 171 | ok; 172 | {PID, Ref} -> 173 | PID ! {Ref, step_done} 174 | end, 175 | 176 | receive_scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}); 177 | 178 | scan(IXA, IXB, Out, IsLastLevel, [], BKVs, Step) -> 179 | case hanoidb_reader:next_node(IXA) of 180 | {kvlist, AKVs} -> 181 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, Step); 182 | end_of_data -> 183 | hanoidb_reader:close(IXA), 184 | scan_only(IXB, Out, IsLastLevel, BKVs, Step) 185 | end; 186 | 187 | scan(IXA, IXB, Out, IsLastLevel, AKVs, [], Step) -> 188 | case hanoidb_reader:next_node(IXB) of 189 | {kvlist, BKVs} -> 190 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, Step); 191 | end_of_data -> 192 | hanoidb_reader:close(IXB), 193 | scan_only(IXA, Out, IsLastLevel, AKVs, Step) 194 | end; 195 | 196 | scan(IXA, IXB, Out, IsLastLevel, [{Key1,Value1}|AT]=_AKVs, [{Key2,_Value2}|_IX]=BKVs, Step) 197 | when Key1 < Key2 -> 198 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key1, Value1}, Out), 199 | scan(IXA, IXB, Out3, IsLastLevel, AT, BKVs, step(Step)); 200 | scan(IXA, IXB, Out, IsLastLevel, [{Key1,_Value1}|_AT]=AKVs, [{Key2,Value2}|IX]=_BKVs, Step) 201 | when Key1 > Key2 -> 202 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key2, Value2}, Out), 203 | scan(IXA, IXB, Out3, IsLastLevel, AKVs, IX, step(Step)); 204 | scan(IXA, IXB, Out, IsLastLevel, [{_Key1,_Value1}|AT]=_AKVs, [{Key2,Value2}|IX]=_BKVs, Step) -> 205 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key2, Value2}, Out), 206 | scan(IXA, IXB, Out3, IsLastLevel, AT, IX, step(Step, 2)). 207 | 208 | 209 | receive_scan_only(IX, Out, IsLastLevel, KVs, {N, FromPID}) -> 210 | 211 | 212 | receive 213 | {step, From, HowMany} -> 214 | scan_only(IX, Out, IsLastLevel, KVs, {N+HowMany, From}); 215 | 216 | %% gen_fsm handling 217 | {system, From, Req} -> 218 | plain_fsm:handle_system_msg( 219 | Req, From, {IX, Out, IsLastLevel, KVs, {N, FromPID}}, 220 | fun({IX2, Out2, IsLastLevel2, KVs2, {N2, FromPID2}}) -> 221 | receive_scan_only(IX2, Out2, IsLastLevel2, KVs2, {N2, FromPID2}) 222 | end); 223 | 224 | {'EXIT', Parent, Reason} -> 225 | case plain_fsm:info(parent) of 226 | Parent -> 227 | plain_fsm:parent_EXIT(Reason, {IX, Out, IsLastLevel, KVs, {N, FromPID}}) 228 | end 229 | 230 | after ?HIBERNATE_TIMEOUT -> 231 | Args = {hanoidb_reader:serialize(IX), 232 | hanoidb_writer:serialize(Out), IsLastLevel, KVs, N}, 233 | Keep = erlang:term_to_binary(Args, [compressed]), 234 | hibernate_scan_only(Keep) 235 | end. 236 | 237 | 238 | 239 | scan_only(IX, Out, IsLastLevel, KVs, {N, FromPID}) when N < 1, KVs =/= [] -> 240 | case FromPID of 241 | none -> 242 | ok; 243 | {PID, Ref} -> 244 | PID ! {Ref, step_done} 245 | end, 246 | 247 | receive_scan_only(IX, Out, IsLastLevel, KVs, {N, FromPID}); 248 | 249 | scan_only(IX, Out, IsLastLevel, [], {_, FromPID}=Step) -> 250 | case hanoidb_reader:next_node(IX) of 251 | {kvlist, KVs} -> 252 | scan_only(IX, Out, IsLastLevel, KVs, Step); 253 | end_of_data -> 254 | case FromPID of 255 | none -> 256 | ok; 257 | {PID, Ref} -> 258 | PID ! {Ref, step_done} 259 | end, 260 | hanoidb_reader:close(IX), 261 | terminate(Out) 262 | end; 263 | 264 | scan_only(IX, Out, true, [{_,?TOMBSTONE}|Rest], Step) -> 265 | scan_only(IX, Out, true, Rest, step(Step)); 266 | 267 | scan_only(IX, Out, IsLastLevel, [{Key,Value}|Rest], Step) -> 268 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key, Value}, Out), 269 | scan_only(IX, Out3, IsLastLevel, Rest, step(Step)). 270 | -------------------------------------------------------------------------------- /src/hanoidb_nursery.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_nursery). 26 | -author('Kresten Krab Thorup '). 27 | 28 | -export([new/4, recover/5, finish/2, lookup/2, add/4, add/5]). 29 | -export([do_level_fold/3, set_max_level/2, transact/3, destroy/1]). 30 | 31 | -include("include/hanoidb.hrl"). 32 | -include("hanoidb.hrl"). 33 | -include_lib("kernel/include/file.hrl"). 34 | 35 | -spec new(string(), integer(), integer(), [_]) -> {ok, #nursery{}} | {error, term()}. 36 | 37 | -define(LOGFILENAME(Dir), filename:join(Dir, "nursery.log")). 38 | 39 | %% do incremental merge every this many inserts 40 | %% this value *must* be less than or equal to 41 | %% 2^TOP_LEVEL == ?BTREE_SIZE(?TOP_LEVEL) 42 | -define(INC_MERGE_STEP, ?BTREE_SIZE(MinLevel) div 2). 43 | 44 | new(Directory, MinLevel, MaxLevel, Config) -> 45 | hanoidb_util:ensure_expiry(Config), 46 | 47 | {ok, File} = file:open(?LOGFILENAME(Directory), 48 | [raw, exclusive, write, delayed_write, append]), 49 | {ok, #nursery{ log_file=File, dir=Directory, cache= gb_trees:empty(), 50 | min_level=MinLevel, max_level=MaxLevel, config=Config }}. 51 | 52 | 53 | recover(Directory, TopLevel, MinLevel, MaxLevel, Config) 54 | when MinLevel =< MaxLevel, is_integer(MinLevel), is_integer(MaxLevel) -> 55 | hanoidb_util:ensure_expiry(Config), 56 | case file:read_file_info(?LOGFILENAME(Directory)) of 57 | {ok, _} -> 58 | ok = do_recover(Directory, TopLevel, MinLevel, MaxLevel, Config), 59 | new(Directory, MinLevel, MaxLevel, Config); 60 | {error, enoent} -> 61 | new(Directory, MinLevel, MaxLevel, Config) 62 | end. 63 | 64 | do_recover(Directory, TopLevel, MinLevel, MaxLevel, Config) -> 65 | %% repair the log file; storing it in nursery2 66 | LogFileName = ?LOGFILENAME(Directory), 67 | {ok, Nursery} = read_nursery_from_log(Directory, MinLevel, MaxLevel, Config), 68 | ok = finish(Nursery, TopLevel), 69 | %% assert log file is gone 70 | {error, enoent} = file:read_file_info(LogFileName), 71 | ok. 72 | 73 | fill_cache({Key, Value}, Cache) 74 | when is_binary(Value); Value =:= ?TOMBSTONE -> 75 | gb_trees:enter(Key, Value, Cache); 76 | fill_cache({Key, {Value, _TStamp}=Entry}, Cache) 77 | when is_binary(Value); Value =:= ?TOMBSTONE -> 78 | gb_trees:enter(Key, Entry, Cache); 79 | fill_cache([], Cache) -> 80 | Cache; 81 | fill_cache(Transactions, Cache) 82 | when is_list(Transactions) -> 83 | lists:foldl(fun fill_cache/2, Cache, Transactions). 84 | 85 | read_nursery_from_log(Directory, MinLevel, MaxLevel, Config) -> 86 | {ok, LogBinary} = file:read_file(?LOGFILENAME(Directory)), 87 | Cache = 88 | case hanoidb_util:decode_crc_data(LogBinary, [], []) of 89 | {ok, KVs} -> 90 | fill_cache(KVs, gb_trees:empty()); 91 | {partial, KVs, _ErrorData} -> 92 | error_logger:info_msg("ignoring undecypherable bytes in ~p~n", [?LOGFILENAME(Directory)]), 93 | fill_cache(KVs, gb_trees:empty()) 94 | end, 95 | {ok, #nursery{ dir=Directory, cache=Cache, count=gb_trees:size(Cache), min_level=MinLevel, max_level=MaxLevel, config=Config }}. 96 | 97 | %% @doc Add a Key/Value to the nursery 98 | %% @end 99 | -spec do_add(#nursery{}, binary(), binary()|?TOMBSTONE, non_neg_integer() | infinity, pid()) -> {ok, #nursery{}} | {full, #nursery{}}. 100 | do_add(Nursery, Key, Value, infinity, Top) -> 101 | do_add(Nursery, Key, Value, 0, Top); 102 | do_add(Nursery=#nursery{log_file=File, cache=Cache, total_size=TotalSize, count=Count, config=Config}, Key, Value, KeyExpiryTime, Top) -> 103 | DatabaseExpiryTime = hanoidb:get_opt(expiry_secs, Config), 104 | 105 | {Data, Cache2} = 106 | if (KeyExpiryTime + DatabaseExpiryTime) == 0 -> 107 | %% Both the database expiry and this key's expiry are unset or set to 0 108 | %% (aka infinity) so never automatically expire the value. 109 | { hanoidb_util:crc_encapsulate_kv_entry(Key, Value), 110 | gb_trees:enter(Key, Value, Cache) }; 111 | true -> 112 | Expiry = 113 | if DatabaseExpiryTime == 0 -> 114 | %% It was the database's setting that was 0 so expire this 115 | %% value after KeyExpiryTime seconds elapse. 116 | hanoidb_util:expiry_time(KeyExpiryTime); 117 | true -> 118 | if KeyExpiryTime == 0 -> 119 | hanoidb_util:expiry_time(DatabaseExpiryTime); 120 | true -> 121 | hanoidb_util:expiry_time(min(KeyExpiryTime, DatabaseExpiryTime)) 122 | end 123 | end, 124 | { hanoidb_util:crc_encapsulate_kv_entry(Key, {Value, Expiry}), 125 | gb_trees:enter(Key, {Value, Expiry}, Cache) } 126 | end, 127 | 128 | ok = file:write(File, Data), 129 | Nursery1 = do_sync(File, Nursery), 130 | {ok, Nursery2} = do_inc_merge(Nursery1#nursery{ cache=Cache2, 131 | total_size=TotalSize + erlang:iolist_size(Data), 132 | count=Count + 1 }, 1, Top), 133 | case has_room(Nursery2, 1) of 134 | true -> 135 | {ok, Nursery2}; 136 | false -> 137 | {full, Nursery2} 138 | end. 139 | 140 | do_sync(File, Nursery) -> 141 | LastSync = 142 | case application:get_env(hanoidb, sync_strategy) of 143 | {ok, sync} -> 144 | file:datasync(File), 145 | os:timestamp(); 146 | {ok, {seconds, N}} -> 147 | MicrosSinceLastSync = timer:now_diff(os:timestamp(), Nursery#nursery.last_sync), 148 | if (MicrosSinceLastSync div 1000000) >= N -> 149 | file:datasync(File), 150 | os:timestamp(); 151 | true -> 152 | Nursery#nursery.last_sync 153 | end; 154 | _ -> 155 | Nursery#nursery.last_sync 156 | end, 157 | Nursery#nursery{last_sync = LastSync}. 158 | 159 | 160 | lookup(Key, #nursery{cache=Cache}) -> 161 | case gb_trees:lookup(Key, Cache) of 162 | {value, {Value, TStamp}} -> 163 | case hanoidb_util:has_expired(TStamp) of 164 | true -> 165 | {value, ?TOMBSTONE}; 166 | false -> 167 | {value, Value} 168 | end; 169 | Reply -> 170 | Reply 171 | end. 172 | 173 | %% @doc 174 | %% Finish this nursery (encode it to a btree, and delete the nursery file) 175 | %% @end 176 | -spec finish(Nursery::#nursery{}, TopLevel::pid()) -> ok. 177 | finish(#nursery{ dir=Dir, cache=Cache, log_file=LogFile, merge_done=DoneMerge, 178 | count=Count, config=Config, min_level=MinLevel }, TopLevel) -> 179 | 180 | hanoidb_util:ensure_expiry(Config), 181 | 182 | %% First, close the log file (if it is open) 183 | case LogFile of 184 | undefined -> ok; 185 | _ -> ok = file:close(LogFile) 186 | end, 187 | 188 | case Count of 189 | N when N > 0 -> 190 | %% next, flush cache to a new BTree 191 | BTreeFileName = filename:join(Dir, "nursery.data"), 192 | {ok, BT} = hanoidb_writer:open(BTreeFileName, [{size, ?BTREE_SIZE(MinLevel)}, 193 | {compress, none} | Config]), 194 | try 195 | ok = gb_trees_ext:fold(fun(Key, Value, Acc) -> 196 | ok = hanoidb_writer:add(BT, Key, Value), 197 | Acc 198 | end, ok, Cache) 199 | after 200 | ok = hanoidb_writer:close(BT) 201 | end, 202 | 203 | %% Inject the B-Tree (blocking RPC) 204 | ok = hanoidb_level:inject(TopLevel, BTreeFileName), 205 | 206 | %% Issue some work if this is a top-level inject (blocks until previous such 207 | %% incremental merge is finished). 208 | if DoneMerge >= ?BTREE_SIZE(MinLevel) -> 209 | ok; 210 | true -> 211 | hanoidb_level:begin_incremental_merge(TopLevel, ?BTREE_SIZE(MinLevel) - DoneMerge) 212 | end; 213 | % {ok, _Nursery2} = do_inc_merge(Nursery, Count, TopLevel); 214 | 215 | _ -> 216 | ok 217 | end, 218 | 219 | %% then, delete the log file 220 | LogFileName = filename:join(Dir, "nursery.log"), 221 | file:delete(LogFileName), 222 | ok. 223 | 224 | destroy(#nursery{ dir=Dir, log_file=LogFile }) -> 225 | %% first, close the log file 226 | if LogFile /= undefined -> 227 | ok = file:close(LogFile); 228 | true -> 229 | ok 230 | end, 231 | %% then delete it 232 | LogFileName = filename:join(Dir, "nursery.log"), 233 | file:delete(LogFileName), 234 | ok. 235 | 236 | -spec add(key(), value(), #nursery{}, pid()) -> {ok, #nursery{}}. 237 | add(Key, Value, Nursery, Top) -> 238 | add(Key, Value, infinity, Nursery, Top). 239 | 240 | -spec add(key(), value(), expiry(), #nursery{}, pid()) -> {ok, #nursery{}}. 241 | add(Key, Value, Expiry, Nursery, Top) -> 242 | case do_add(Nursery, Key, Value, Expiry, Top) of 243 | {ok, Nursery0} -> 244 | {ok, Nursery0}; 245 | {full, Nursery0} -> 246 | flush(Nursery0, Top) 247 | end. 248 | 249 | -spec flush(#nursery{}, pid()) -> {ok, #nursery{}}. 250 | flush(Nursery=#nursery{ dir=Dir, min_level=MinLevel, max_level=MaxLevel, config=Config }, Top) -> 251 | ok = finish(Nursery, Top), 252 | {error, enoent} = file:read_file_info(filename:join(Dir, "nursery.log")), 253 | hanoidb_nursery:new(Dir, MinLevel, MaxLevel, Config). 254 | 255 | has_room(#nursery{ count=Count, min_level=MinLevel }, N) -> 256 | (Count + N + 1) < ?BTREE_SIZE(MinLevel). 257 | 258 | ensure_space(Nursery, NeededRoom, Top) -> 259 | case has_room(Nursery, NeededRoom) of 260 | true -> 261 | Nursery; 262 | false -> 263 | {ok, Nursery1} = flush(Nursery, Top), 264 | Nursery1 265 | end. 266 | 267 | transact(Spec, Nursery, Top) -> 268 | transact1(Spec, ensure_space(Nursery, length(Spec), Top), Top). 269 | 270 | transact1(Spec, Nursery1=#nursery{ log_file=File, cache=Cache0, total_size=TotalSize, config=Config }, Top) -> 271 | Expiry = 272 | case hanoidb:get_opt(expiry_secs, Config) of 273 | 0 -> 274 | infinity; 275 | DatabaseExpiryTime -> 276 | hanoidb_util:expiry_time(DatabaseExpiryTime) 277 | end, 278 | 279 | Data = hanoidb_util:crc_encapsulate_transaction(Spec, Expiry), 280 | ok = file:write(File, Data), 281 | 282 | Nursery2 = do_sync(File, Nursery1), 283 | 284 | Cache2 = lists:foldl(fun({put, Key, Value}, Cache) -> 285 | case Expiry of 286 | infinity -> 287 | gb_trees:enter(Key, Value, Cache); 288 | _ -> 289 | gb_trees:enter(Key, {Value, Expiry}, Cache) 290 | end; 291 | ({delete, Key}, Cache) -> 292 | case Expiry of 293 | infinity -> 294 | gb_trees:enter(Key, ?TOMBSTONE, Cache); 295 | _ -> 296 | gb_trees:enter(Key, {?TOMBSTONE, Expiry}, Cache) 297 | end 298 | end, 299 | Cache0, 300 | Spec), 301 | 302 | Count = gb_trees:size(Cache2), 303 | 304 | do_inc_merge(Nursery2#nursery{ cache=Cache2, total_size=TotalSize+erlang:iolist_size(Data), count=Count }, length(Spec), Top). 305 | 306 | do_inc_merge(Nursery=#nursery{ step=Step, merge_done=Done, min_level=MinLevel }, N, TopLevel) -> 307 | if Step+N >= ?INC_MERGE_STEP -> 308 | hanoidb_level:begin_incremental_merge(TopLevel, Step + N), 309 | {ok, Nursery#nursery{ step=0, merge_done=Done + Step + N }}; 310 | true -> 311 | {ok, Nursery#nursery{ step=Step + N }} 312 | end. 313 | 314 | do_level_fold(#nursery{cache=Cache}, FoldWorkerPID, KeyRange) -> 315 | Ref = erlang:make_ref(), 316 | FoldWorkerPID ! {prefix, [Ref]}, 317 | case gb_trees_ext:fold( 318 | fun(_, _, {LastKey, limit}) -> 319 | {LastKey, limit}; 320 | (Key, Value, {LastKey, Count}) -> 321 | case ?KEY_IN_RANGE(Key, KeyRange) andalso (not is_expired(Value)) of 322 | true -> 323 | BinOrTombstone = get_value(Value), 324 | FoldWorkerPID ! {level_result, Ref, Key, BinOrTombstone}, 325 | case BinOrTombstone of 326 | ?TOMBSTONE -> 327 | {Key, Count}; 328 | _ -> 329 | {Key, decrement(Count)} 330 | end; 331 | false -> 332 | {LastKey, Count} 333 | end 334 | end, 335 | {undefined, KeyRange#key_range.limit}, 336 | Cache) 337 | of 338 | {LastKey, limit} when LastKey =/= undefined -> 339 | FoldWorkerPID ! {level_limit, Ref, LastKey}; 340 | _ -> 341 | FoldWorkerPID ! {level_done, Ref} 342 | end, 343 | ok. 344 | 345 | set_max_level(Nursery = #nursery{}, MaxLevel) -> 346 | Nursery#nursery{ max_level = MaxLevel }. 347 | 348 | decrement(undefined) -> 349 | undefined; 350 | decrement(1) -> 351 | limit; 352 | decrement(Number) -> 353 | Number-1. 354 | 355 | %%% 356 | 357 | % TODO this is duplicate code also found in hanoidb_reader 358 | is_expired(?TOMBSTONE) -> 359 | false; 360 | is_expired({_Value, TStamp}) -> 361 | hanoidb_util:has_expired(TStamp); 362 | is_expired(Bin) when is_binary(Bin) -> 363 | false. 364 | 365 | get_value({Value, TStamp}) when is_integer(TStamp); TStamp =:= infinity -> 366 | Value; 367 | get_value(Value) when Value =:= ?TOMBSTONE; is_binary(Value) -> 368 | Value. 369 | 370 | -------------------------------------------------------------------------------- /src/hanoidb_reader.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_reader). 26 | -author('Kresten Krab Thorup '). 27 | 28 | -include_lib("kernel/include/file.hrl"). 29 | -include("include/hanoidb.hrl"). 30 | -include("hanoidb.hrl"). 31 | -include("include/plain_rpc.hrl"). 32 | 33 | -define(ASSERT_WHEN(X), when X). 34 | 35 | -export([open/1, open/2,close/1,lookup/2,fold/3,range_fold/4, destroy/1]). 36 | -export([first_node/1,next_node/1]). 37 | -export([serialize/1, deserialize/1]). 38 | 39 | -record(node, {level :: non_neg_integer(), 40 | members=[] :: list(any()) | binary() }). 41 | 42 | -record(index, {file :: file:io_device(), 43 | root= none :: #node{} | none, 44 | bloom :: term(), 45 | name :: string(), 46 | config=[] :: term() }). 47 | 48 | -type read_file() :: #index{}. 49 | -export_type([read_file/0]). 50 | 51 | -spec open(Name::string()) -> {ok, read_file()} | {error, any()}. 52 | open(Name) -> 53 | open(Name, [random]). 54 | 55 | -type config() :: [sequential | folding | random | {atom(), term()}]. 56 | -spec open(Name::string(), config()) -> {ok, read_file()} | {error, any()}. 57 | open(Name, Config) -> 58 | case proplists:get_bool(sequential, Config) of 59 | true -> 60 | ReadBufferSize = hanoidb:get_opt(read_buffer_size, Config, 512 * 1024), 61 | case file:open(Name, [raw,read,{read_ahead, ReadBufferSize},binary]) of 62 | {ok, File} -> 63 | {ok, #index{file=File, name=Name, config=Config}}; 64 | {error, _}=Err -> 65 | Err 66 | end; 67 | 68 | false -> 69 | {ok, File} = 70 | case proplists:get_bool(folding, Config) of 71 | true -> 72 | ReadBufferSize = hanoidb:get_opt(read_buffer_size, Config, 512 * 1024), 73 | file:open(Name, [read, {read_ahead, ReadBufferSize}, binary]); 74 | false -> 75 | file:open(Name, [read, binary]) 76 | end, 77 | 78 | {ok, FileInfo} = file:read_file_info(Name), 79 | 80 | %% read and validate magic tag 81 | {ok, ?FILE_FORMAT} = file:pread(File, 0, byte_size(?FILE_FORMAT)), 82 | 83 | %% read root position 84 | {ok, <>} = file:pread(File, FileInfo#file_info.size - 8, 8), 85 | {ok, <>} = file:pread(File, FileInfo#file_info.size - 12, 4), 86 | {ok, BloomData} = file:pread(File, (FileInfo#file_info.size - 12 - BloomSize), BloomSize), 87 | {ok, Bloom} = hanoidb_util:bin_to_bloom(BloomData), 88 | 89 | %% read in the root node 90 | Root = 91 | case read_node(File, RootPos) of 92 | {ok, Node} -> 93 | Node; 94 | eof -> 95 | none 96 | end, 97 | 98 | {ok, #index{file=File, root=Root, bloom=Bloom, name=Name, config=Config}} 99 | end. 100 | 101 | destroy(#index{file=File, name=Name}) -> 102 | ok = file:close(File), 103 | file:delete(Name). 104 | 105 | serialize(#index{file=File, bloom=undefined }=Index) -> 106 | {ok, Position} = file:position(File, cur), 107 | ok = file:close(File), 108 | {seq_read_file, Index, Position}. 109 | 110 | deserialize({seq_read_file, Index, Position}) -> 111 | {ok, #index{file=File}=Index2} = open(Index#index.name, Index#index.config), 112 | {ok, Position} = file:position(File, {bof, Position}), 113 | Index2. 114 | 115 | 116 | 117 | 118 | fold(Fun, Acc0, #index{file=File}) -> 119 | {ok, Node} = read_node(File,?FIRST_BLOCK_POS), 120 | fold0(File,fun({K,V},Acc) -> Fun(K,V,Acc) end,Node,Acc0). 121 | 122 | fold0(File,Fun,#node{level=0, members=BinPage},Acc0) when is_binary(BinPage) -> 123 | Acc1 = vbisect:foldl(fun(K, V, Acc2) -> Fun({K, decode_binary_value(V)}, Acc2) end,Acc0,BinPage), 124 | fold1(File,Fun,Acc1); 125 | fold0(File,Fun,#node{level=0, members=List},Acc0) when is_list(List) -> 126 | Acc1 = lists:foldl(Fun,Acc0,List), 127 | fold1(File,Fun,Acc1); 128 | fold0(File,Fun,_InnerNode,Acc0) -> 129 | fold1(File,Fun,Acc0). 130 | 131 | fold1(File,Fun,Acc0) -> 132 | case next_leaf_node(File) of 133 | eof -> 134 | Acc0; 135 | {ok, Node} -> 136 | fold0(File,Fun,Node,Acc0) 137 | end. 138 | 139 | -spec range_fold(fun((binary(),binary(),any()) -> any()), any(), #index{}, #key_range{}) -> 140 | {limit, any(), binary()} | {done, any()}. 141 | range_fold(Fun, Acc0, #index{file=File,root=Root}, Range) -> 142 | case Range#key_range.from_key =< first_key(Root) of 143 | true -> 144 | {ok, _} = file:position(File, ?FIRST_BLOCK_POS), 145 | range_fold_from_here(Fun, Acc0, File, Range, Range#key_range.limit); 146 | false -> 147 | case find_leaf_node(File,Range#key_range.from_key,Root,?FIRST_BLOCK_POS) of 148 | {ok, {Pos,_}} -> 149 | {ok, _} = file:position(File, Pos), 150 | range_fold_from_here(Fun, Acc0, File, Range, Range#key_range.limit); 151 | {ok, Pos} -> 152 | {ok, _} = file:position(File, Pos), 153 | range_fold_from_here(Fun, Acc0, File, Range, Range#key_range.limit); 154 | none -> 155 | {done, Acc0} 156 | end 157 | end. 158 | 159 | first_key(#node{members=Dict}) -> 160 | {_,FirstKey} = fold_until_stop(fun({K,_},_) -> {stop, K} end, none, Dict), 161 | FirstKey. 162 | 163 | fold_until_stop(Fun,Acc,List) when is_list(List) -> 164 | fold_until_stop2(Fun, {continue, Acc}, List); 165 | fold_until_stop(Fun,Acc0,Bin) when is_binary(Bin) -> 166 | vbisect:fold_until_stop(fun({Key,VBin},Acc1) -> 167 | % io:format("-> DOING ~p,~p~n", [Key,Acc1]), 168 | Fun({Key, decode_binary_value(VBin)}, Acc1) 169 | end, 170 | Acc0, 171 | Bin). 172 | 173 | fold_until_stop2(_Fun,{stop,Result},_) -> 174 | {stopped, Result}; 175 | fold_until_stop2(_Fun,{continue, Acc},[]) -> 176 | {ok, Acc}; 177 | fold_until_stop2(Fun,{continue, Acc},[H|T]) -> 178 | fold_until_stop2(Fun,Fun(H,Acc),T). 179 | 180 | % TODO this is duplicate code also found in hanoidb_nursery 181 | is_expired(?TOMBSTONE) -> 182 | false; 183 | is_expired({_Value, TStamp}) -> 184 | hanoidb_util:has_expired(TStamp); 185 | is_expired(Bin) when is_binary(Bin) -> 186 | false. 187 | 188 | get_value({Value, _TStamp}) -> 189 | Value; 190 | get_value(Value) -> 191 | Value. 192 | 193 | range_fold_from_here(Fun, Acc0, File, Range, undefined) -> 194 | % io:format("RANGE_FOLD_FROM_HERE(~p,~p)~n", [Acc0,File]), 195 | case next_leaf_node(File) of 196 | eof -> 197 | {done, Acc0}; 198 | 199 | {ok, #node{members=Members}} -> 200 | case fold_until_stop(fun({Key,_}, Acc) when not ?KEY_IN_TO_RANGE(Key,Range) -> 201 | {stop, {done, Acc}}; 202 | ({Key,Value}, Acc) when ?KEY_IN_FROM_RANGE(Key, Range) -> 203 | case is_expired(Value) of 204 | true -> 205 | {continue, Acc}; 206 | false -> 207 | {continue, Fun(Key, get_value(Value), Acc)} 208 | end; 209 | (_Huh, Acc) -> 210 | % io:format("SKIPPING ~p~n", [_Huh]), 211 | {continue, Acc} 212 | end, 213 | Acc0, 214 | Members) of 215 | {stopped, Result} -> Result; 216 | {ok, Acc1} -> 217 | range_fold_from_here(Fun, Acc1, File, Range, undefined) 218 | end 219 | end; 220 | 221 | range_fold_from_here(Fun, Acc0, File, Range, N0) -> 222 | case next_leaf_node(File) of 223 | eof -> 224 | {done, Acc0}; 225 | 226 | {ok, #node{members=Members}} -> 227 | case fold_until_stop(fun({Key,_}, {0,Acc}) -> 228 | {stop, {limit, Acc, Key}}; 229 | ({Key,_}, {_,Acc}) when not ?KEY_IN_TO_RANGE(Key,Range)-> 230 | {stop, {done, Acc}}; 231 | ({Key,?TOMBSTONE}, {N1,Acc}) when ?KEY_IN_FROM_RANGE(Key,Range) -> 232 | {continue, {N1, Fun(Key, ?TOMBSTONE, Acc)}}; 233 | ({Key,{?TOMBSTONE,TStamp}}, {N1,Acc}) when ?KEY_IN_FROM_RANGE(Key,Range) -> 234 | case hanoidb_util:has_expired(TStamp) of 235 | true -> 236 | {continue, {N1,Acc}}; 237 | false -> 238 | {continue, {N1, Fun(Key, ?TOMBSTONE, Acc)}} 239 | end; 240 | ({Key,Value}, {N1,Acc}) when ?KEY_IN_FROM_RANGE(Key,Range) -> 241 | case is_expired(Value) of 242 | true -> 243 | {continue, {N1,Acc}}; 244 | false -> 245 | {continue, {N1-1, Fun(Key, get_value(Value), Acc)}} 246 | end; 247 | (_, Acc) -> 248 | {continue, Acc} 249 | end, 250 | {N0, Acc0}, 251 | Members) 252 | of 253 | {stopped, Result} -> 254 | Result; 255 | {ok, {N2, Acc1}} -> 256 | range_fold_from_here(Fun, Acc1, File, Range, N2) 257 | end 258 | end. 259 | 260 | find_leaf_node(_File,_FromKey,#node{level=0},Pos) -> 261 | {ok, Pos}; 262 | find_leaf_node(File,FromKey,#node{members=Members,level=N},_) when is_list(Members) -> 263 | case find_start(FromKey, Members) of 264 | {ok, ChildPos} -> 265 | recursive_find(File, FromKey, N, ChildPos); 266 | not_found -> 267 | none 268 | end; 269 | find_leaf_node(File,FromKey,#node{members=Members,level=N},_) when is_binary(Members) -> 270 | case vbisect:find_geq(FromKey,Members) of 271 | {ok, _, <>} -> 272 | % io:format("** FIND_LEAF_NODE(~p,~p) -> {~p,~p}~n", [FromKey, N, Pos,Len]), 273 | recursive_find(File, FromKey, N, {Pos,Len}); 274 | none -> 275 | % io:format("** FIND_LEAF_NODE(~p,~p) -> none~n", [FromKey, N]), 276 | none 277 | end; 278 | find_leaf_node(_,_,none,_) -> 279 | none. 280 | 281 | recursive_find(_File,_FromKey,1,ChildPos) -> 282 | {ok, ChildPos}; 283 | recursive_find(File,FromKey,N,ChildPos) when N>1 -> 284 | case read_node(File,ChildPos) of 285 | {ok, ChildNode} -> 286 | find_leaf_node(File, FromKey,ChildNode,ChildPos); 287 | eof -> 288 | none 289 | end. 290 | 291 | 292 | %% used by the merger, needs list value 293 | first_node(#index{file=File}) -> 294 | case read_node(File, ?FIRST_BLOCK_POS) of 295 | {ok, #node{level=0, members=Members}} -> 296 | {kvlist, decode_member_list(Members)}; 297 | eof-> 298 | none 299 | end. 300 | 301 | %% used by the merger, needs list value 302 | next_node(#index{file=File}=_Index) -> 303 | case next_leaf_node(File) of 304 | {ok, #node{level=0, members=Members}} -> 305 | {kvlist, decode_member_list(Members)}; 306 | eof -> 307 | end_of_data 308 | end. 309 | 310 | decode_member_list(List) when is_list(List) -> 311 | List; 312 | decode_member_list(BinDict) when is_binary(BinDict) -> 313 | vbisect:foldr( fun(Key,Value,Acc) -> 314 | [{Key, decode_binary_value(Value) }|Acc] 315 | end, 316 | [], 317 | BinDict). 318 | 319 | close(#index{file=undefined}) -> 320 | ok; 321 | close(#index{file=File}) -> 322 | file:close(File). 323 | 324 | 325 | lookup(#index{file=File, root=Node, bloom=Bloom}, Key) -> 326 | case ?BLOOM_CONTAINS(Bloom, Key) of 327 | true -> 328 | case lookup_in_node(File, Node, Key) of 329 | not_found -> 330 | not_found; 331 | {ok, {Value, TStamp}} ?ASSERT_WHEN(Value =:= ?TOMBSTONE; is_binary(Value)) -> 332 | case hanoidb_util:has_expired(TStamp) of 333 | true -> not_found; 334 | false -> {ok, Value} 335 | end; 336 | {ok, Value}=Reply ?ASSERT_WHEN(Value =:= ?TOMBSTONE; is_binary(Value)) -> 337 | Reply 338 | end; 339 | false -> 340 | not_found 341 | end. 342 | 343 | lookup_in_node(_File,#node{level=0,members=Members}, Key) -> 344 | find_in_leaf(Key,Members); 345 | 346 | lookup_in_node(File,#node{members=Members},Key) when is_binary(Members) -> 347 | case vbisect:find_geq(Key,Members) of 348 | {ok, _Key, <>} -> 349 | % io:format("FOUND ~p @ ~p~n", [_Key, {Pos,Size}]), 350 | case read_node(File,{Pos,Size}) of 351 | {ok, Node} -> 352 | lookup_in_node(File, Node, Key); 353 | eof -> 354 | not_found 355 | end; 356 | none -> 357 | not_found 358 | end; 359 | 360 | lookup_in_node(File,#node{members=Members},Key) -> 361 | case find_1(Key, Members) of 362 | {ok, {Pos,Size}} -> 363 | %% do this in separate process, to avoid having to 364 | %% garbage collect all the inner node junk 365 | PID = proc_lib:spawn_link(fun() -> 366 | receive 367 | ?CALL(From,read) -> 368 | case read_node(File, {Pos,Size}) of 369 | {ok, Node} -> 370 | Result = lookup_in_node2(File, Node, Key), 371 | plain_rpc:send_reply(From, Result); 372 | eof -> 373 | plain_rpc:send_reply(From, {error, eof}) 374 | end 375 | end 376 | end), 377 | try plain_rpc:call(PID, read) 378 | catch 379 | Class:Ex -> 380 | error_logger:error_msg("crashX: ~p:~p ~p~n", [Class,Ex,erlang:get_stacktrace()]), 381 | not_found 382 | end; 383 | 384 | not_found -> 385 | not_found 386 | end. 387 | 388 | 389 | lookup_in_node2(_File,#node{level=0,members=Members},Key) -> 390 | case lists:keyfind(Key,1,Members) of 391 | false -> 392 | not_found; 393 | {_,Value} -> 394 | {ok, Value} 395 | end; 396 | 397 | lookup_in_node2(File,#node{members=Members},Key) -> 398 | case find_1(Key, Members) of 399 | {ok, {Pos,Size}} -> 400 | case read_node(File, {Pos,Size}) of 401 | {ok, Node} -> 402 | lookup_in_node2(File, Node, Key); 403 | eof -> 404 | {error, eof} 405 | end; 406 | not_found -> 407 | not_found 408 | end. 409 | 410 | 411 | find_1(K, [{K1,V},{K2,_}|_]) when K >= K1, K < K2 -> 412 | {ok, V}; 413 | find_1(K, [{K1,V}]) when K >= K1 -> 414 | {ok, V}; 415 | find_1(K, [_|T]) -> 416 | find_1(K,T); 417 | find_1(_, _) -> 418 | not_found. 419 | 420 | 421 | find_start(K, [{_,V},{K2,_}|_]) when K < K2 -> 422 | {ok, V}; 423 | find_start(_, [{_,{_,_}=V}]) -> 424 | {ok, V}; 425 | find_start(K, KVs) -> 426 | find_1(K, KVs). 427 | 428 | 429 | -spec read_node(file:io_device(), non_neg_integer() | { non_neg_integer(), non_neg_integer() }) -> 430 | {ok, #node{}} | eof. 431 | 432 | read_node(File, {Pos, Size}) -> 433 | % error_logger:info_msg("read_node ~p ~p ~p~n", [File, Pos, Size]), 434 | {ok, <<_:32/unsigned, Level:16/unsigned, Data/binary>>} = file:pread(File, Pos, Size), 435 | hanoidb_util:decode_index_node(Level, Data); 436 | 437 | read_node(File, Pos) -> 438 | % error_logger:info_msg("read_node ~p ~p~n", [File, Pos]), 439 | {ok, Pos} = file:position(File, Pos), 440 | Result = read_node(File), 441 | % error_logger:info_msg("decoded ~p ~p~n", [Pos, Result]), 442 | Result. 443 | 444 | read_node(File) -> 445 | % error_logger:info_msg("read_node ~p~n", [File]), 446 | {ok, <>} = file:read(File, 6), 447 | % error_logger:info_msg("decoded ~p ~p~n", [Len, Level]), 448 | case Len of 449 | 0 -> 450 | eof; 451 | _ -> 452 | {ok, Data} = file:read(File, Len-2), 453 | hanoidb_util:decode_index_node(Level, Data) 454 | end. 455 | 456 | 457 | next_leaf_node(File) -> 458 | case file:read(File, 6) of 459 | eof -> 460 | %% premature end-of-file 461 | eof; 462 | {ok, <<0:32/unsigned, _:16/unsigned>>} -> 463 | eof; 464 | {ok, <>} -> 465 | {ok, Data} = file:read(File, Len-2), 466 | hanoidb_util:decode_index_node(0, Data); 467 | {ok, <>} -> 468 | {ok, _} = file:position(File, {cur,Len-2}), 469 | next_leaf_node(File) 470 | end. 471 | 472 | 473 | find_in_leaf(Key,Bin) when is_binary(Bin) -> 474 | case vbisect:find(Key,Bin) of 475 | {ok, BinValue} -> 476 | {ok, decode_binary_value(BinValue)}; 477 | error -> 478 | not_found 479 | end; 480 | find_in_leaf(Key,List) when is_list(List) -> 481 | case lists:keyfind(Key, 1, List) of 482 | {_, Value} -> 483 | {ok, Value}; 484 | false -> 485 | not_found 486 | end. 487 | 488 | decode_binary_value(<>) -> 489 | Value; 490 | decode_binary_value(<>) -> 491 | {Value, TStamp}; 492 | decode_binary_value(<>) -> 493 | ?TOMBSTONE; 494 | decode_binary_value(<>) -> 495 | {?TOMBSTONE, TStamp}; 496 | decode_binary_value(<>) -> 497 | {Pos, Len}. 498 | -------------------------------------------------------------------------------- /src/hanoidb_sparse_bitmap.erl: -------------------------------------------------------------------------------- 1 | -module(hanoidb_sparse_bitmap). 2 | -export([new/1, set/2, member/2]). 3 | 4 | -define(REPR_NAME, sparse_bitmap). 5 | 6 | new(Bits) when is_integer(Bits), Bits>0 -> 7 | {?REPR_NAME, Bits, []}. 8 | 9 | set(N, {?REPR_NAME, Bits, Tree}) -> 10 | {?REPR_NAME, Bits, set_to_tree(N, 1 bsl (Bits-1), Tree)}. 11 | 12 | set_to_tree(N, HighestBit, Mask) when HighestBit<32 -> 13 | Nbit = 1 bsl N, 14 | case Mask of 15 | []-> Nbit; 16 | _ -> Nbit bor Mask 17 | end; 18 | set_to_tree(N, _HighestBit, []) -> N; 19 | set_to_tree(N, HighestBit, [TLo|THi]) -> 20 | pushdown(N, HighestBit, TLo, THi); 21 | set_to_tree(N, _HighestBit, N) -> N; 22 | set_to_tree(N, HighestBit, M) when is_integer(M) -> 23 | set_to_tree(N, HighestBit, pushdown(M, HighestBit, [], [])). 24 | 25 | pushdown(N, HighestBit, TLo, THi) -> 26 | NHigh = N band HighestBit, 27 | if NHigh =:= 0 -> [set_to_tree(N, HighestBit bsr 1, TLo) | THi]; 28 | true -> [TLo | set_to_tree(N bxor NHigh, HighestBit bsr 1, THi)] 29 | end. 30 | 31 | member(N, {?REPR_NAME, Bits, Tree}) -> 32 | member_in_tree(N, 1 bsl (Bits-1), Tree). 33 | 34 | member_in_tree(_N, _HighestBit, []) -> false; 35 | member_in_tree(N, HighestBit, Mask) when HighestBit<32 -> 36 | Nbit = 1 bsl N, 37 | Nbit band Mask > 0; 38 | member_in_tree(N, _HighestBit, M) when is_integer(M) -> N =:= M; 39 | member_in_tree(N, HighestBit, [TLo|THi]) -> 40 | NHigh = N band HighestBit, 41 | if NHigh =:= 0 -> member_in_tree(N, HighestBit bsr 1, TLo); 42 | true -> member_in_tree(N bxor NHigh, HighestBit bsr 1, THi) 43 | end. 44 | -------------------------------------------------------------------------------- /src/hanoidb_sup.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_sup). 26 | -author('Kresten Krab Thorup '). 27 | 28 | -behaviour(supervisor). 29 | 30 | %% API 31 | -export([start_link/0]). 32 | 33 | %% Supervisor callbacks 34 | -export([init/1]). 35 | 36 | %% Helper macro for declaring children of supervisor 37 | -define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}). 38 | 39 | %% =================================================================== 40 | %% API functions 41 | %% =================================================================== 42 | 43 | start_link() -> 44 | supervisor:start_link({local, ?MODULE}, ?MODULE, []). 45 | 46 | %% =================================================================== 47 | %% Supervisor callbacks 48 | %% =================================================================== 49 | 50 | init([]) -> 51 | {ok, { {one_for_one, 5, 10}, []} }. 52 | 53 | -------------------------------------------------------------------------------- /src/hanoidb_util.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_util). 26 | -author('Kresten Krab Thorup '). 27 | 28 | -export([ compress/2 29 | , uncompress/1 30 | , index_file_name/1 31 | , estimate_node_size_increment/3 32 | , encode_index_node/2 33 | , decode_index_node/2 34 | , crc_encapsulate_kv_entry/2 35 | , decode_crc_data/3 36 | , file_exists/1 37 | , crc_encapsulate_transaction/2 38 | , tstamp/0 39 | , expiry_time/1 40 | , has_expired/1 41 | , ensure_expiry/1 42 | 43 | , bloom_type/1 44 | , bloom_new/2 45 | , bloom_to_bin/1 46 | , bin_to_bloom/1 47 | , bin_to_bloom/2 48 | , bloom_insert/2 49 | , bloom_contains/2 50 | ]). 51 | 52 | -include("src/hanoidb.hrl"). 53 | 54 | -define(ERLANG_ENCODED, 131). 55 | -define(CRC_ENCODED, 127). 56 | -define(BISECT_ENCODED, 126). 57 | 58 | 59 | -define(FILE_ENCODING, bisect). 60 | 61 | -compile({inline, [crc_encapsulate/1, crc_encapsulate_kv_entry/2 ]}). 62 | 63 | 64 | -spec index_file_name(string()) -> string(). 65 | index_file_name(Name) -> 66 | Name. 67 | 68 | -spec file_exists(string()) -> boolean(). 69 | file_exists(FileName) -> 70 | case file:read_file_info(FileName) of 71 | {ok, _} -> 72 | true; 73 | {error, enoent} -> 74 | false 75 | end. 76 | 77 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp}) 78 | when is_integer(Value) -> byte_size(Key) + 5 + 4; 79 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp}) 80 | when is_binary(Value) -> byte_size(Key) + 5 + 4 + byte_size(Value); 81 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp}) 82 | when is_atom(Value) -> byte_size(Key) + 8 + 4; 83 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp}) 84 | when is_tuple(Value) -> byte_size(Key) + 13 + 4; 85 | estimate_node_size_increment(_KVList, Key, Value) 86 | when is_integer(Value) -> byte_size(Key) + 5 + 4; 87 | estimate_node_size_increment(_KVList, Key, Value) 88 | when is_binary(Value) -> byte_size(Key) + 5 + 4 + byte_size(Value); 89 | estimate_node_size_increment(_KVList, Key, Value) 90 | when is_atom(Value) -> byte_size(Key) + 8 + 4; 91 | estimate_node_size_increment(_KVList, Key, Value) 92 | when is_tuple(Value) -> byte_size(Key) + 13 + 4. 93 | 94 | -define(NO_COMPRESSION, 0). 95 | -define(SNAPPY_COMPRESSION, 1). 96 | -define(GZIP_COMPRESSION, 2). 97 | -define(LZ4_COMPRESSION, 3). 98 | 99 | use_compressed(UncompressedSize, CompressedSize) when CompressedSize < UncompressedSize -> 100 | true; 101 | use_compressed(_UncompressedSize, _CompressedSize) -> 102 | false. 103 | 104 | compress(snappy, Bin) -> 105 | {ok, CompressedBin} = snappy:compress(Bin), 106 | case use_compressed(erlang:iolist_size(Bin), erlang:iolist_size(CompressedBin)) of 107 | true -> 108 | {?SNAPPY_COMPRESSION, CompressedBin}; 109 | false -> 110 | {?NO_COMPRESSION, Bin} 111 | end; 112 | compress(lz4, Bin) -> 113 | {ok, CompressedBin} = lz4:compress(erlang:iolist_to_binary(Bin)), 114 | case use_compressed(erlang:iolist_size(Bin), erlang:iolist_size(CompressedBin)) of 115 | true -> 116 | {?LZ4_COMPRESSION, CompressedBin}; 117 | false -> 118 | {?NO_COMPRESSION, Bin} 119 | end; 120 | compress(gzip, Bin) -> 121 | CompressedBin = zlib:gzip(Bin), 122 | case use_compressed(erlang:iolist_size(Bin), erlang:iolist_size(CompressedBin)) of 123 | true -> 124 | {?GZIP_COMPRESSION, CompressedBin}; 125 | false -> 126 | {?NO_COMPRESSION, Bin} 127 | end; 128 | compress(none, Bin) -> 129 | {?NO_COMPRESSION, Bin}. 130 | 131 | uncompress(<>) -> 132 | Data; 133 | uncompress(<>) -> 134 | {ok, UncompressedData} = snappy:decompress(Data), 135 | UncompressedData; 136 | uncompress(<>) -> 137 | lz4:uncompress(Data); 138 | uncompress(<>) -> 139 | zlib:gunzip(Data). 140 | 141 | encode_index_node(KVList, Method) -> 142 | TermData = 143 | case ?FILE_ENCODING of 144 | bisect -> 145 | Binary = vbisect:from_orddict(lists:map(fun binary_encode_kv/1, KVList)), 146 | CRC = erlang:crc32(Binary), 147 | [?BISECT_ENCODED, <>, Binary]; 148 | hanoi2 -> 149 | [ ?TAG_END | 150 | lists:map(fun ({Key,Value}) -> 151 | crc_encapsulate_kv_entry(Key, Value) 152 | end, 153 | KVList) ] 154 | end, 155 | {MethodName, OutData} = compress(Method, TermData), 156 | {ok, [MethodName | OutData]}. 157 | 158 | decode_index_node(Level, Data) -> 159 | TermData = uncompress(Data), 160 | case decode_kv_list(TermData) of 161 | {ok, KVList} -> 162 | {ok, {node, Level, KVList}}; 163 | {bisect, Binary} -> 164 | % io:format("[page level=~p~n", [Level]), 165 | % vbisect:foldl(fun(K,V,_) -> io:format(" ~p -> ~p,~n", [K,V]) end, 0, Binary), 166 | % io:format("]~n",[]), 167 | {ok, {node, Level, Binary}} 168 | end. 169 | 170 | 171 | binary_encode_kv({Key, {Value,infinity}}) -> 172 | binary_encode_kv({Key,Value}); 173 | binary_encode_kv({Key, {?TOMBSTONE, TStamp}}) -> 174 | {Key, <>}; 175 | binary_encode_kv({Key, ?TOMBSTONE}) -> 176 | {Key, <>}; 177 | binary_encode_kv({Key, {Value, TStamp}}) when is_binary(Value) -> 178 | {Key, <>}; 179 | binary_encode_kv({Key, Value}) when is_binary(Value)-> 180 | {Key, <>}; 181 | binary_encode_kv({Key, {Pos, Len}}) when Len < 16#ffffffff -> 182 | {Key, <>}. 183 | 184 | 185 | -spec crc_encapsulate_kv_entry(binary(), expvalue()) -> iolist(). 186 | crc_encapsulate_kv_entry(Key, {Value, infinity}) -> 187 | crc_encapsulate_kv_entry(Key, Value); 188 | crc_encapsulate_kv_entry(Key, {?TOMBSTONE, TStamp}) -> % 189 | crc_encapsulate( [?TAG_DELETED2, <> | Key] ); 190 | crc_encapsulate_kv_entry(Key, ?TOMBSTONE) -> 191 | crc_encapsulate( [?TAG_DELETED | Key] ); 192 | crc_encapsulate_kv_entry(Key, {Value, TStamp}) when is_binary(Value) -> 193 | crc_encapsulate( [?TAG_KV_DATA2, <>, Key, Value] ); 194 | crc_encapsulate_kv_entry(Key, Value) when is_binary(Value) -> 195 | crc_encapsulate( [?TAG_KV_DATA, <<(byte_size(Key)):32/unsigned>>, Key, Value] ); 196 | crc_encapsulate_kv_entry(Key, {Pos,Len}) when Len < 16#ffffffff -> 197 | crc_encapsulate( [?TAG_POSLEN32, <>, Key] ). 198 | 199 | -spec crc_encapsulate_transaction( [ txspec() ], expiry() ) -> iolist(). 200 | crc_encapsulate_transaction(TransactionSpec, Expiry) -> 201 | crc_encapsulate([?TAG_TRANSACT | 202 | lists:map(fun({delete, Key}) -> 203 | crc_encapsulate_kv_entry(Key, {?TOMBSTONE, Expiry}); 204 | ({put, Key, Value}) -> 205 | crc_encapsulate_kv_entry(Key, {Value, Expiry}) 206 | end, 207 | TransactionSpec)]). 208 | 209 | -spec crc_encapsulate( iolist() ) -> iolist(). 210 | crc_encapsulate(Blob) -> 211 | CRC = erlang:crc32(Blob), 212 | Size = erlang:iolist_size(Blob), 213 | [<< (Size):32/unsigned, CRC:32/unsigned >>, Blob, ?TAG_END]. 214 | 215 | -spec decode_kv_list( binary() ) -> {ok, [ kventry() ]} | {partial, [kventry()], iolist()}. 216 | decode_kv_list(<>) -> 217 | decode_crc_data(Custom, [], []); 218 | decode_kv_list(<>=TermData) -> 219 | {ok, erlang:term_to_binary(TermData)}; 220 | decode_kv_list(<>) -> 221 | decode_crc_data(Custom, [], []); 222 | decode_kv_list(<>) -> 223 | CRCTest = erlang:crc32( Binary ), 224 | if CRC == CRCTest -> 225 | {bisect, Binary}; 226 | true -> 227 | {bisect, vbisect:from_orddict([])} 228 | end. 229 | 230 | -spec decode_crc_data(binary(), list(), list()) -> {ok, [kventry()]} | {partial, [kventry()], iolist()}. 231 | decode_crc_data(<<>>, [], Acc) -> 232 | {ok, lists:reverse(Acc)}; 233 | decode_crc_data(<<>>, BrokenData, Acc) -> 234 | {partial, lists:reverse(Acc), BrokenData}; 235 | % TODO: we *could* simply return the good parts of the data... 236 | % would that be so wrong? 237 | decode_crc_data(<< BinSize:32/unsigned, CRC:32/unsigned, Bin:BinSize/binary, ?TAG_END, Rest/binary >>, Broken, Acc) -> 238 | CRCTest = erlang:crc32( Bin ), 239 | if CRC == CRCTest -> 240 | decode_crc_data(Rest, Broken, [decode_kv_data(Bin) | Acc]); 241 | true -> 242 | % TODO: chunk is broken, ignore it. Maybe we should tell someone? 243 | decode_crc_data(Rest, [Bin|Broken], Acc) 244 | end; 245 | decode_crc_data(Bad, Broken, Acc) -> 246 | %% If a chunk is broken, try to find the next ?TAG_END and 247 | %% start decoding from there. 248 | {Skipped, MaybeGood} = find_next_value(Bad), 249 | decode_crc_data(MaybeGood, [Skipped|Broken], Acc). 250 | 251 | -spec find_next_value(binary()) -> { binary(), binary() }. 252 | find_next_value(<<>>) -> 253 | {<<>>, <<>>}; 254 | find_next_value(Bin) -> 255 | case binary:match (Bin, <>) of 256 | {Pos, _Len} -> 257 | <> = Bin, 258 | {SkipBin, MaybeGood}; 259 | nomatch -> 260 | {Bin, <<>>} 261 | end. 262 | 263 | -spec decode_kv_data( binary() ) -> kventry(). 264 | decode_kv_data(<>) -> 265 | {Key, Value}; 266 | decode_kv_data(<>) -> 267 | {Key, ?TOMBSTONE}; 268 | decode_kv_data(<>) -> 269 | {Key, {Value, TStamp}}; 270 | decode_kv_data(<>) -> 271 | {Key, {?TOMBSTONE, TStamp}}; 272 | decode_kv_data(<>) -> 273 | {Key, {Pos,Len}}; 274 | decode_kv_data(<>) -> 275 | {ok, TX} = decode_crc_data(Rest, [], []), 276 | TX. 277 | 278 | %% @doc Return number of seconds since 1970 279 | -spec tstamp() -> pos_integer(). 280 | tstamp() -> 281 | {Mega, Sec, _Micro} = os:timestamp(), 282 | (Mega * 1000000) + Sec. 283 | 284 | %% @doc Return time when values expire (i.e. Now + ExpirySecs), or 0. 285 | -spec expiry_time(pos_integer()) -> pos_integer(). 286 | expiry_time(ExpirySecs) when ExpirySecs > 0 -> 287 | tstamp() + ExpirySecs. 288 | 289 | -spec has_expired(pos_integer()) -> true|false. 290 | has_expired(Expiration) when Expiration > 0 -> 291 | Expiration < tstamp(); 292 | has_expired(infinity) -> 293 | false. 294 | 295 | 296 | ensure_expiry(Opts) -> 297 | case hanoidb:get_opt(expiry_secs, Opts) of 298 | undefined -> 299 | try exit(err) 300 | catch 301 | exit:err -> 302 | io:format(user, "~p~n", [erlang:get_stacktrace()]) 303 | end, 304 | exit(expiry_secs_not_set); 305 | N when N >= 0 -> 306 | ok 307 | end. 308 | 309 | bloom_type({ebloom, _}) -> 310 | ebloom; 311 | bloom_type({sbloom, _}) -> 312 | sbloom. 313 | 314 | bloom_new(Size, sbloom) -> 315 | {ok, {sbloom, hanoidb_bloom:bloom(Size, 0.01)}}; 316 | bloom_new(Size, ebloom) -> 317 | {ok, Bloom} = ebloom:new(Size, 0.01, Size), 318 | {ok, {ebloom, Bloom}}. 319 | 320 | bloom_to_bin({sbloom, Bloom}) -> 321 | hanoidb_bloom:encode(Bloom); 322 | bloom_to_bin({ebloom, Bloom}) -> 323 | ebloom:serialize(Bloom). 324 | 325 | bin_to_bloom(GZiped = <<16#1F, 16#8B, _/binary>>) -> 326 | bin_to_bloom(GZiped, sbloom); 327 | bin_to_bloom(TermBin = <<131, _/binary>>) -> 328 | erlang:term_to_binary(TermBin); 329 | bin_to_bloom(Blob) -> 330 | bin_to_bloom(Blob, ebloom). 331 | 332 | bin_to_bloom(Binary, sbloom) -> 333 | {ok, {sbloom, hanoidb_bloom:decode(Binary)}}; 334 | bin_to_bloom(Binary, ebloom) -> 335 | {ok, Bloom} = ebloom:deserialize(Binary), 336 | {ok, {ebloom, Bloom}}. 337 | 338 | bloom_insert({sbloom, Bloom}, Key) -> 339 | {ok, {sbloom, hanoidb_bloom:add(Key, Bloom)}}; 340 | bloom_insert({ebloom, Bloom}, Key) -> 341 | ok = ebloom:insert(Bloom, Key), 342 | {ok, {ebloom, Bloom}}. 343 | 344 | bloom_contains({sbloom, Bloom}, Key) -> 345 | hanoidb_bloom:member(Key, Bloom); 346 | bloom_contains({ebloom, Bloom}, Key) -> 347 | ebloom:contains(Bloom, Key). 348 | 349 | -------------------------------------------------------------------------------- /src/hanoidb_writer.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_writer). 26 | -author('Kresten Krab Thorup '). 27 | 28 | -include("hanoidb.hrl"). 29 | 30 | %% 31 | %% Streaming btree writer. Accepts only monotonically increasing keys for put. 32 | %% 33 | 34 | -define(NODE_SIZE, 8*1024). 35 | 36 | -behavior(gen_server). 37 | 38 | %% gen_server callbacks 39 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2, 40 | terminate/2, code_change/3, serialize/1, deserialize/1]). 41 | 42 | -export([open/1, open/2, add/3, count/1, close/1]). 43 | 44 | -record(node, {level :: integer(), 45 | members=[] :: [ {key(), expvalue()} ], 46 | size=0 :: integer()}). 47 | 48 | -record(state, {index_file :: file:io_device() | undefined, 49 | index_file_pos :: integer(), 50 | 51 | last_node_pos :: pos_integer(), 52 | last_node_size :: pos_integer(), 53 | 54 | nodes = [] :: list(#node{}), 55 | 56 | name :: string(), 57 | 58 | bloom :: {ebloom, term()} | {sbloom, term()}, 59 | block_size = ?NODE_SIZE :: integer(), 60 | compress = none :: none | snappy | gzip | lz4, 61 | opts = [] :: list(any()), 62 | 63 | value_count = 0 :: integer(), 64 | tombstone_count = 0 :: integer() 65 | }). 66 | 67 | 68 | %%% PUBLIC API 69 | 70 | open(Name,Options) -> 71 | hanoidb_util:ensure_expiry(Options), 72 | gen_server:start_link(?MODULE, [Name, Options], []). 73 | 74 | open(Name) -> 75 | gen_server:start_link(?MODULE, [Name,[{expiry_secs,0}]], []). 76 | 77 | add(Ref, Key, Value) -> 78 | gen_server:cast(Ref, {add, Key, Value}). 79 | 80 | %% @doc Return number of KVs added to this writer so far 81 | count(Ref) -> 82 | gen_server:call(Ref, count, infinity). 83 | 84 | %% @doc Close the btree index file 85 | close(Ref) -> 86 | gen_server:call(Ref, close, infinity). 87 | 88 | %%% 89 | 90 | init([Name, Options]) -> 91 | hanoidb_util:ensure_expiry(Options), 92 | Size = proplists:get_value(size, Options, 2048), 93 | 94 | case do_open(Name, Options, [exclusive]) of 95 | {ok, IdxFile} -> 96 | ok = file:write(IdxFile, ?FILE_FORMAT), 97 | {ok, Bloom} = ?BLOOM_NEW(Size), 98 | BlockSize = hanoidb:get_opt(block_size, Options, ?NODE_SIZE), 99 | {ok, #state{ name=Name, 100 | index_file_pos=?FIRST_BLOCK_POS, index_file=IdxFile, 101 | bloom = Bloom, 102 | block_size = BlockSize, 103 | compress = hanoidb:get_opt(compress, Options, none), 104 | opts = Options 105 | }}; 106 | {error, _}=Error -> 107 | error_logger:error_msg("hanoidb_writer cannot open ~p: ~p~n", [Name, Error]), 108 | {stop, Error} 109 | end. 110 | 111 | 112 | handle_cast({add, Key, {?TOMBSTONE, TStamp}}, State) 113 | when is_binary(Key) -> 114 | NewState = 115 | case hanoidb_util:has_expired(TStamp) of 116 | true -> 117 | State; 118 | false -> 119 | {ok, State2} = append_node(0, Key, {?TOMBSTONE, TStamp}, State), 120 | State2 121 | end, 122 | {noreply, NewState}; 123 | handle_cast({add, Key, ?TOMBSTONE}, State) 124 | when is_binary(Key) -> 125 | {ok, NewState} = append_node(0, Key, ?TOMBSTONE, State), 126 | {noreply, NewState}; 127 | handle_cast({add, Key, {Value, TStamp}}, State) 128 | when is_binary(Key), is_binary(Value) -> 129 | NewState = 130 | case hanoidb_util:has_expired(TStamp) of 131 | true -> 132 | State; 133 | false -> 134 | {ok, State2} = append_node(0, Key, {Value, TStamp}, State), 135 | State2 136 | end, 137 | {noreply, NewState}; 138 | handle_cast({add, Key, Value}, State) 139 | when is_binary(Key), is_binary(Value) -> 140 | {ok, State2} = append_node(0, Key, Value, State), 141 | {noreply, State2}. 142 | 143 | handle_call(count, _From, State = #state{ value_count=VC, tombstone_count=TC }) -> 144 | {ok, VC+TC, State}; 145 | handle_call(close, _From, State) -> 146 | {ok, State2} = archive_nodes(State), 147 | {stop, normal, ok, State2}. 148 | 149 | handle_info(Info, State) -> 150 | error_logger:error_msg("Unknown info ~p~n", [Info]), 151 | {stop, bad_msg, State}. 152 | 153 | terminate(normal,_State) -> 154 | ok; 155 | terminate(_Reason, State) -> 156 | %% premature delete -> cleanup 157 | _ignore = file:close(State#state.index_file), 158 | file:delete(hanoidb_util:index_file_name(State#state.name)). 159 | 160 | code_change(_OldVsn, State, _Extra) -> 161 | {ok, State}. 162 | 163 | 164 | %% INTERNAL FUNCTIONS 165 | serialize(#state{ bloom=Bloom, index_file=File, index_file_pos=Position }=State) -> 166 | case file:position(File, {eof, 0}) of 167 | {ok, Position} -> 168 | ok; 169 | {ok, WrongPosition} -> 170 | exit({bad_position, Position, WrongPosition}) 171 | end, 172 | ok = file:close(File), 173 | erlang:term_to_binary( { State#state{ index_file=undefined, bloom=undefined }, ?BLOOM_TO_BIN(Bloom), hanoidb_util:bloom_type(Bloom) } ). 174 | 175 | deserialize(Binary) -> 176 | {State, Bin, Type} = erlang:binary_to_term(Binary), 177 | {ok, Bloom} = ?BIN_TO_BLOOM(Bin, Type), 178 | {ok, IdxFile} = do_open(State#state.name, State#state.opts, []), 179 | State#state{ bloom=Bloom, index_file=IdxFile }. 180 | 181 | 182 | do_open(Name, Options, OpenOpts) -> 183 | WriteBufferSize = hanoidb:get_opt(write_buffer_size, Options, 512 * 1024), 184 | file:open(hanoidb_util:index_file_name(Name), 185 | [raw, append, {delayed_write, WriteBufferSize, 2000} | OpenOpts]). 186 | 187 | 188 | %% @doc flush pending nodes and write trailer 189 | archive_nodes(#state{ nodes=[], last_node_pos=LastNodePos, last_node_size=_LastNodeSize, bloom=Bloom, index_file=IdxFile }=State) -> 190 | 191 | BloomBin = ?BLOOM_TO_BIN(Bloom), 192 | true = is_binary(BloomBin), 193 | BloomSize = byte_size(BloomBin), 194 | RootPos = 195 | case LastNodePos of 196 | undefined -> 197 | %% store contains no entries 198 | ok = file:write(IdxFile, <<0:32/unsigned, 0:16/unsigned>>), 199 | ?FIRST_BLOCK_POS; 200 | _ -> 201 | LastNodePos 202 | end, 203 | Trailer = [ << 0:32/unsigned>> , BloomBin, << BloomSize:32/unsigned, RootPos:64/unsigned >> ], 204 | 205 | ok = file:write(IdxFile, Trailer), 206 | ok = file:datasync(IdxFile), 207 | ok = file:close(IdxFile), 208 | {ok, State#state{ index_file=undefined, index_file_pos=undefined, bloom=undefined }}; 209 | 210 | archive_nodes(State=#state{ nodes=[#node{level=N, members=[{_,{Pos,_Len}}]}], last_node_pos=Pos }) 211 | when N > 0 -> 212 | %% Ignore this node, its stack consists of one node with one {pos,len} member 213 | archive_nodes(State#state{ nodes=[] }); 214 | 215 | archive_nodes(State) -> 216 | {ok, State2} = flush_node_buffer(State), 217 | archive_nodes(State2). 218 | 219 | 220 | append_node(Level, Key, Value, State=#state{ nodes=[] }) -> 221 | append_node(Level, Key, Value, State#state{ nodes=[ #node{ level=Level } ] }); 222 | append_node(Level, Key, Value, State=#state{ nodes=[ #node{level=Level2 } |_]=Stack }) 223 | when Level < Level2 -> 224 | append_node(Level, Key, Value, State#state{ nodes=[ #node{ level=(Level2 - 1) } | Stack] }); 225 | append_node(Level, Key, Value, #state{ nodes=[ #node{level=Level, members=List, size=NodeSize}=CurrNode | RestNodes ], value_count=VC, tombstone_count=TC, bloom=Bloom }=State) 226 | when Bloom /= undefined -> 227 | %% The top-of-stack node is at the level we wish to insert at. 228 | 229 | %% Assert that keys are increasing: 230 | case List of 231 | [] -> 232 | ok; 233 | [{PrevKey,_}|_] -> 234 | if 235 | (Key >= PrevKey) -> ok; 236 | true -> 237 | error_logger:error_msg("keys not ascending ~p < ~p~n", [PrevKey, Key]), 238 | exit({badarg, Key}) 239 | end 240 | end, 241 | NewSize = NodeSize + hanoidb_util:estimate_node_size_increment(List, Key, Value), 242 | 243 | {ok,Bloom2} = case Level of 244 | 0 -> 245 | ?BLOOM_INSERT(Bloom, Key); 246 | _ -> 247 | {ok,Bloom} 248 | end, 249 | 250 | {TC1, VC1} = 251 | case Level of 252 | 0 -> 253 | case Value of 254 | ?TOMBSTONE -> 255 | {TC+1, VC}; 256 | {?TOMBSTONE, _} -> %% Matched when this Value can expire 257 | {TC+1, VC}; 258 | _ -> 259 | {TC, VC+1} 260 | end; 261 | _ -> 262 | {TC, VC} 263 | end, 264 | 265 | NodeMembers = [{Key, Value} | List], 266 | State2 = State#state{ nodes=[CurrNode#node{members=NodeMembers, size=NewSize} | RestNodes], 267 | value_count=VC1, tombstone_count=TC1, bloom=Bloom2 }, 268 | 269 | case NewSize >= State#state.block_size of 270 | true -> 271 | flush_node_buffer(State2); 272 | false -> 273 | {ok, State2} 274 | end. 275 | 276 | flush_node_buffer(#state{nodes=[#node{ level=Level, members=NodeMembers }|RestNodes], compress=Compress, index_file_pos=NodePos } = State) -> 277 | 278 | OrderedMembers = lists:reverse(NodeMembers), 279 | {ok, BlockData} = hanoidb_util:encode_index_node(OrderedMembers, Compress), 280 | 281 | BlockSize = erlang:iolist_size(BlockData), 282 | Data = [ <<(BlockSize+2):32/unsigned, Level:16/unsigned>> | BlockData ], 283 | DataSize = BlockSize + 6, 284 | 285 | ok = file:write(State#state.index_file, Data), 286 | 287 | {FirstKey, _} = hd(OrderedMembers), 288 | append_node(Level + 1, FirstKey, {NodePos, DataSize}, 289 | State#state{ nodes = RestNodes, 290 | index_file_pos = NodePos + DataSize, 291 | last_node_pos = NodePos, 292 | last_node_size = DataSize }). 293 | -------------------------------------------------------------------------------- /src/plain_rpc.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% plain_rpc: RPC module to accompany plain_fsm 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% This file is provided to you under the Apache License, Version 2.0 (the 9 | %% "License"); you may not use this file except in compliance with the License. 10 | %% You may obtain a copy of the License at 11 | %% 12 | %% http://www.apache.org/licenses/LICENSE-2.0 13 | %% 14 | %% Unless required by applicable law or agreed to in writing, software 15 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 16 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 17 | %% License for the specific language governing permissions and limitations 18 | %% under the License. 19 | %% 20 | %% ---------------------------------------------------------------------------- 21 | 22 | -module(plain_rpc). 23 | -author('Kresten Krab Thorup '). 24 | 25 | -export([send_call/2, receive_reply/1, send_reply/2, call/2, call/3, cast/2]). 26 | 27 | -include("include/plain_rpc.hrl"). 28 | 29 | 30 | send_call(PID, Request) -> 31 | Ref = erlang:monitor(process, PID), 32 | PID ! ?CALL({self(), Ref}, Request), 33 | Ref. 34 | 35 | cast(PID, Msg) -> 36 | PID ! ?CAST(self(), Msg). 37 | 38 | receive_reply(MRef) -> 39 | receive 40 | ?REPLY(MRef, Reply) -> 41 | erlang:demonitor(MRef, [flush]), 42 | Reply; 43 | {'DOWN', MRef, _, _, Reason} -> 44 | exit(Reason) 45 | end. 46 | 47 | send_reply({PID,Ref}, Reply) -> 48 | _ = erlang:send(PID, ?REPLY(Ref, Reply)), 49 | ok. 50 | 51 | call(PID,Request) -> 52 | call(PID, Request, infinity). 53 | 54 | call(PID,Request,Timeout) -> 55 | MRef = erlang:monitor(process, PID), 56 | PID ! ?CALL({self(), MRef}, Request), 57 | receive 58 | ?REPLY(MRef, Reply) -> 59 | erlang:demonitor(MRef, [flush]), 60 | Reply; 61 | {'DOWN', MRef, _, _, Reason} -> 62 | exit(Reason) 63 | after Timeout -> 64 | erlang:demonitor(MRef, [flush]), 65 | exit({rpc_timeout, Request}) 66 | end. 67 | 68 | 69 | -------------------------------------------------------------------------------- /src/vbisect.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2014 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% This file is provided to you under the Apache License, Version 2.0 (the 9 | %% "License"); you may not use this file except in compliance with the License. 10 | %% You may obtain a copy of the License at 11 | %% 12 | %% http://www.apache.org/licenses/LICENSE-2.0 13 | %% 14 | %% Unless required by applicable law or agreed to in writing, software 15 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 16 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 17 | %% License for the specific language governing permissions and limitations 18 | %% under the License. 19 | %% 20 | %% ---------------------------------------------------------------------------- 21 | 22 | 23 | -module(vbisect). 24 | 25 | -export([from_orddict/1, 26 | from_gb_tree/1, 27 | to_gb_tree/1, 28 | first_key/1, 29 | find/2, find_geq/2, 30 | foldl/3, foldr/3, fold_until_stop/3, 31 | to_orddict/1, 32 | merge/3]). 33 | 34 | -define(MAGIC, "vbis"). 35 | -type key() :: binary(). 36 | -type value() :: binary(). 37 | -type bindict() :: binary(). 38 | 39 | -ifdef(TEST). 40 | -include_lib("eunit/include/eunit.hrl"). 41 | -endif. 42 | 43 | -spec from_gb_tree(gb_trees:tree()) -> bindict(). 44 | from_gb_tree({Count,Node}) when Count =< 16#ffffffff -> 45 | {_BinSize,IOList} = encode_gb_node(Node), 46 | erlang:iolist_to_binary([ <> | IOList ]). 47 | 48 | encode_gb_node({Key, Value, Smaller, Bigger}) when is_binary(Key), is_binary(Value) -> 49 | {BinSizeSmaller, IOSmaller} = encode_gb_node(Smaller), 50 | {BinSizeBigger, IOBigger} = encode_gb_node(Bigger), 51 | 52 | KeySize = byte_size(Key), 53 | ValueSize = byte_size(Value), 54 | { 2 + KeySize 55 | + 4 + ValueSize 56 | + 4 + BinSizeSmaller 57 | + BinSizeBigger, 58 | 59 | [ << KeySize:16, Key/binary, 60 | BinSizeSmaller:32 >>, IOSmaller, 61 | << ValueSize:32, Value/binary >> | IOBigger ] }; 62 | 63 | encode_gb_node(nil) -> 64 | { 0, [] }. 65 | 66 | to_gb_tree(<>) -> 67 | { Count, to_gb_node(Nodes) }. 68 | 69 | to_gb_node( <<>> ) -> 70 | nil; 71 | 72 | to_gb_node( << KeySize:16, Key:KeySize/binary, 73 | BinSizeSmaller:32, Smaller:BinSizeSmaller/binary, 74 | ValueSize:32, Value:ValueSize/binary, 75 | Bigger/binary >> ) -> 76 | {Key, Value, 77 | to_gb_node(Smaller), 78 | to_gb_node(Bigger)}. 79 | 80 | -spec find(Key::key(), Dict::bindict()) -> 81 | { ok, value() } | error. 82 | find(Key, <>) -> 83 | find_node(byte_size(Key), Key, Binary). 84 | 85 | find_node(KeySize, Key, <> = Bin) -> 89 | if 90 | Key < HereKey -> 91 | Skip = 6 + HereKeySize, 92 | << _:Skip/binary, Smaller:BinSizeSmaller/binary, _/binary>> = Bin, 93 | find_node(KeySize, Key, Smaller); 94 | HereKey < Key -> 95 | Skip = 10 + HereKeySize + BinSizeSmaller + ValueSize, 96 | << _:Skip/binary, Bigger/binary>> = Bin, 97 | find_node(KeySize, Key, Bigger); 98 | true -> 99 | {ok, Value} 100 | end; 101 | 102 | find_node(_, _, <<>>) -> 103 | error. 104 | 105 | to_orddict(BinDict) -> 106 | foldr(fun(Key,Value,Acc) -> 107 | [{Key,Value}|Acc] 108 | end, 109 | [], 110 | BinDict). 111 | 112 | merge(Fun, BinDict1, BinDict2) -> 113 | OD1 = to_orddict(BinDict1), 114 | OD2 = to_orddict(BinDict2), 115 | OD3 = orddict:merge(Fun, OD1, OD2), 116 | from_orddict(OD3). 117 | 118 | -spec first_key( bindict() ) -> binary() | none. 119 | first_key(BinDict) -> 120 | {_, Key} = fold_until_stop(fun({K,_},_) -> {stop, K} end, none, BinDict), 121 | Key. 122 | 123 | %% @doc Find largest {K,V} where K is smaller than or equal to key. 124 | %% This is good for an inner node where key is the smallest key 125 | %% in the child node. 126 | 127 | -spec find_geq(Key::binary(), Binary::binary()) -> 128 | none | {ok, Key::key(), Value::value()}. 129 | 130 | find_geq(Key, <>) -> 131 | find_geq_node(byte_size(Key), Key, Binary, none). 132 | 133 | find_geq_node(_, _, <<>>, Else) -> 134 | Else; 135 | 136 | find_geq_node(KeySize, Key, <> = Bin, Else) -> 140 | if 141 | Key < HereKey -> 142 | Skip = 6 + HereKeySize, 143 | << _:Skip/binary, Smaller:BinSizeSmaller/binary, _/binary>> = Bin, 144 | find_geq_node(KeySize, Key, Smaller, Else); 145 | HereKey < Key -> 146 | Skip = 10 + HereKeySize + BinSizeSmaller + ValueSize, 147 | << _:Skip/binary, Bigger/binary>> = Bin, 148 | find_geq_node(KeySize, Key, Bigger, {ok, HereKey, Value}); 149 | true -> 150 | {ok, HereKey, Value} 151 | end. 152 | 153 | -spec foldl(fun((Key::key(), Value::value(), Acc::term()) -> term()), term(), bindict()) -> 154 | term(). 155 | foldl(Fun, Acc, <>) -> 156 | foldl_node(Fun, Acc, Binary). 157 | 158 | foldl_node(_Fun, Acc, <<>>) -> 159 | Acc; 160 | 161 | foldl_node(Fun, Acc, <>) -> 165 | Acc1 = foldl_node(Fun, Acc, Smaller), 166 | Acc2 = Fun(Key, Value, Acc1), 167 | foldl_node(Fun, Acc2, Bigger). 168 | 169 | 170 | -spec fold_until_stop(function(), term(), bindict()) -> {stopped, term()} | {ok, term()}. 171 | 172 | fold_until_stop(Fun, Acc, <>) -> 173 | fold_until_stop2(Fun, {continue, Acc}, Bin). 174 | 175 | fold_until_stop2(_Fun,{stop,Result},_) -> 176 | {stopped, Result}; 177 | fold_until_stop2(_Fun,{continue, Acc},<<>>) -> 178 | {ok, Acc}; 179 | fold_until_stop2(Fun,{continue, Acc}, <>) -> 183 | 184 | case fold_until_stop2(Fun, {continue, Acc}, Smaller) of 185 | {stopped, Result} -> 186 | {stopped, Result}; 187 | {ok, Acc1} -> 188 | ContinueOrStopAcc = Fun({Key,Value}, Acc1), 189 | fold_until_stop2(Fun, ContinueOrStopAcc, Bigger) 190 | end. 191 | 192 | 193 | -spec foldr(fun((Key::key(), Value::value(), Acc::term()) -> term()), term(), bindict()) -> 194 | term(). 195 | foldr(Fun, Acc, <>) -> 196 | foldr_node(Fun, Acc, Binary). 197 | 198 | foldr_node(_Fun, Acc, <<>>) -> 199 | Acc; 200 | 201 | foldr_node(Fun, Acc, <>) -> 205 | Acc1 = foldr_node(Fun, Acc, Bigger), 206 | Acc2 = Fun(Key, Value, Acc1), 207 | foldr_node(Fun, Acc2, Smaller). 208 | 209 | 210 | from_orddict(OrdDict) -> 211 | from_gb_tree(gb_trees:from_orddict(OrdDict)). 212 | 213 | -ifdef(TEST). 214 | 215 | speed_test_() -> 216 | {timeout, 600, 217 | fun() -> 218 | Start = 100000000000000, 219 | N = 100000, 220 | Keys = lists:seq(Start, Start+N), 221 | KeyValuePairs = lists:map(fun (I) -> {<>, <<255:8/integer>>} end, 222 | Keys), 223 | 224 | %% Will mostly be unique, if N is bigger than 10000 225 | ReadKeys = [<<(lists:nth(random:uniform(N), Keys)):64/integer>> || _ <- lists:seq(1, 1000)], 226 | B = from_orddict(KeyValuePairs), 227 | time_reads(B, N, ReadKeys) 228 | end}. 229 | 230 | 231 | geq_test() -> 232 | B = from_orddict([{<<2>>,<<2>>},{<<4>>,<<4>>},{<<6>>,<<6>>},{<<122>>,<<122>>}]), 233 | none = find_geq(<<1>>, B), 234 | {ok, <<2>>, <<2>>} = find_geq(<<2>>, B), 235 | {ok, <<2>>, <<2>>} = find_geq(<<3>>, B), 236 | {ok, <<4>>, <<4>>} = find_geq(<<5>>, B), 237 | {ok, <<6>>, <<6>>} = find_geq(<<100>>, B), 238 | {ok, <<122>>, <<122>>} = find_geq(<<150>>, B), 239 | true. 240 | 241 | 242 | time_reads(B, Size, ReadKeys) -> 243 | Parent = self(), 244 | spawn( 245 | fun() -> 246 | Runs = 20, 247 | Timings = 248 | lists:map( 249 | fun (_) -> 250 | StartTime = now(), 251 | find_many(B, ReadKeys), 252 | timer:now_diff(now(), StartTime) 253 | end, lists:seq(1, Runs)), 254 | 255 | Rps = 1000000 / ((lists:sum(Timings) / length(Timings)) / 1000), 256 | error_logger:info_msg("Average over ~p runs, ~p keys in dict~n" 257 | "Average fetch ~p keys: ~p us, max: ~p us~n" 258 | "Average fetch 1 key: ~p us~n" 259 | "Theoretical sequential RPS: ~w~n", 260 | [Runs, Size, length(ReadKeys), 261 | lists:sum(Timings) / length(Timings), 262 | lists:max(Timings), 263 | (lists:sum(Timings) / length(Timings)) / length(ReadKeys), 264 | trunc(Rps)]), 265 | 266 | Parent ! done 267 | end), 268 | receive done -> ok after 1000 -> ok end. 269 | 270 | -spec find_many(bindict(), [key()]) -> non_neg_integer(). 271 | find_many(B, Keys) -> 272 | lists:foldl(fun (K, N) -> 273 | case find(K, B) of 274 | {ok, _} -> N+1; 275 | error -> N 276 | end 277 | end, 278 | 0, Keys). 279 | 280 | -endif. 281 | -------------------------------------------------------------------------------- /test/hanoidb_drv.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | %% @doc Drive a set of LSM BTrees 26 | -module(hanoidb_drv). 27 | 28 | -behaviour(gen_server). 29 | 30 | %% API 31 | -export([start_link/0]). 32 | 33 | -export([ 34 | delete_exist/2, 35 | get_exist/2, 36 | get_fail/2, 37 | open/1, close/1, 38 | put/3, 39 | fold_range/4, 40 | stop/0]). 41 | 42 | %% gen_server callbacks 43 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2, 44 | terminate/2, code_change/3]). 45 | 46 | -define(SERVER, ?MODULE). 47 | 48 | -record(state, { btrees = dict:new() % Map from a name to its tree 49 | }). 50 | 51 | %%%=================================================================== 52 | 53 | start_link() -> 54 | gen_server:start_link({local, ?SERVER}, ?MODULE, [], []). 55 | 56 | call(X) -> 57 | gen_server:call(?SERVER, X, infinity). 58 | 59 | get_exist(N, K) -> 60 | call({get, N, K}). 61 | 62 | get_fail(N, K) -> 63 | call({get, N, K}). 64 | 65 | delete_exist(N, K) -> 66 | call({delete_exist, N, K}). 67 | 68 | open(N) -> 69 | call({open, N}). 70 | 71 | close(N) -> 72 | call({close, N}). 73 | 74 | put(N, K, V) -> 75 | call({put, N, K, V}). 76 | 77 | fold_range(T, Fun, Acc0, Range) -> 78 | call({fold_range, T, Fun, Acc0, Range}). 79 | 80 | stop() -> 81 | call(stop). 82 | 83 | %%%=================================================================== 84 | 85 | init([]) -> 86 | {ok, #state{}}. 87 | 88 | handle_call({open, N}, _, #state { btrees = D} = State) -> 89 | case hanoidb:open(N) of 90 | {ok, Tree} -> 91 | {reply, ok, State#state { btrees = dict:store(N, Tree, D)}}; 92 | Otherwise -> 93 | {reply, {error, Otherwise}, State} 94 | end; 95 | handle_call({close, N}, _, #state { btrees = D} = State) -> 96 | Tree = dict:fetch(N, D), 97 | case hanoidb:close(Tree) of 98 | ok -> 99 | {reply, ok, State#state { btrees = dict:erase(N, D)}}; 100 | Otherwise -> 101 | {reply, {error, Otherwise}, State} 102 | end; 103 | handle_call({fold_range, Name, Fun, Acc0, Range}, 104 | _From, 105 | #state { btrees = D } = State) -> 106 | Tree = dict:fetch(Name, D), 107 | Result = hanoidb:fold_range(Tree, Fun, Acc0, Range), 108 | {reply, Result, State}; 109 | handle_call({put, N, K, V}, _, #state { btrees = D} = State) -> 110 | Tree = dict:fetch(N, D), 111 | case hanoidb:put(Tree, K, V) of 112 | ok -> 113 | {reply, ok, State}; 114 | Other -> 115 | {reply, {error, Other}, State} 116 | end; 117 | handle_call({delete_exist, N, K}, _, #state { btrees = D} = State) -> 118 | Tree = dict:fetch(N, D), 119 | Reply = hanoidb:delete(Tree, K), 120 | {reply, Reply, State}; 121 | handle_call({get, N, K}, _, #state { btrees = D} = State) -> 122 | Tree = dict:fetch(N, D), 123 | Reply = hanoidb:get(Tree, K), 124 | {reply, Reply, State}; 125 | handle_call(stop, _, #state{ btrees = D } = State ) -> 126 | [ hanoidb:close(Tree) || {_,Tree} <- dict:to_list(D) ], 127 | {stop, normal, ok, State}; 128 | handle_call(_Request, _From, State) -> 129 | Reply = ok, 130 | {reply, Reply, State}. 131 | 132 | handle_cast(_Msg, State) -> 133 | {noreply, State}. 134 | 135 | handle_info(_Info, State) -> 136 | {noreply, State}. 137 | 138 | terminate(_Reason, _State) -> 139 | ok. 140 | 141 | code_change(_OldVsn, State, _Extra) -> 142 | {ok, State}. 143 | 144 | -------------------------------------------------------------------------------- /test/hanoidb_merger_tests.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_merger_tests). 26 | 27 | -ifdef(TEST). 28 | -include_lib("eunit/include/eunit.hrl"). 29 | -endif. 30 | 31 | -compile(export_all). 32 | 33 | merge_test() -> 34 | 35 | file:delete("test1"), 36 | file:delete("test2"), 37 | file:delete("test3"), 38 | 39 | {ok, BT1} = hanoidb_writer:open("test1", [{expiry_secs, 0}]), 40 | lists:foldl(fun(N,_) -> 41 | ok = hanoidb_writer:add(BT1, <>, <<"data",N:128>>) 42 | end, 43 | ok, 44 | lists:seq(1,10000,2)), 45 | ok = hanoidb_writer:close(BT1), 46 | 47 | 48 | {ok, BT2} = hanoidb_writer:open("test2", [{expiry_secs, 0}]), 49 | lists:foldl(fun(N,_) -> 50 | ok = hanoidb_writer:add(BT2, <>, <<"data",N:128>>) 51 | end, 52 | ok, 53 | lists:seq(2,5001,1)), 54 | ok = hanoidb_writer:close(BT2), 55 | 56 | 57 | self() ! {step, {self(), none}, 2000000000}, 58 | {Time,{ok,Count}} = timer:tc(hanoidb_merger, merge, ["test1", "test2", "test3", 10000, true, [{expiry_secs, 0}]]), 59 | 60 | % error_logger:info_msg("time to merge: ~p/sec (time=~p, count=~p)~n", [1000000/(Time/Count), Time/1000000, Count]), 61 | 62 | ok. 63 | 64 | -------------------------------------------------------------------------------- /test/hanoidb_tests.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_tests). 26 | 27 | -include("include/hanoidb.hrl"). 28 | -include("src/hanoidb.hrl"). 29 | 30 | -ifdef(TEST). 31 | -ifdef(TRIQ). 32 | -include_lib("triq/include/triq.hrl"). 33 | -include_lib("triq/include/triq_statem.hrl"). 34 | -else. 35 | -include_lib("proper/include/proper.hrl"). 36 | -endif. 37 | -include_lib("eunit/include/eunit.hrl"). 38 | -endif. 39 | 40 | -ifdef(PROPER). 41 | -behaviour(proper_statem). 42 | -endif. 43 | 44 | -compile(export_all). 45 | 46 | -export([command/1, initial_state/0, 47 | next_state/3, postcondition/3, 48 | precondition/2]). 49 | 50 | -ifdef(pre18). 51 | -define(OTP_DICT, dict()). 52 | -else. 53 | -define(OTP_DICT, dict:dict()). 54 | -endif. 55 | 56 | -record(tree, { elements = dict:new() :: ?OTP_DICT }). 57 | -record(state, { open = dict:new() :: ?OTP_DICT, 58 | closed = dict:new() :: ?OTP_DICT}). 59 | -define(SERVER, hanoidb_drv). 60 | 61 | full_test_() -> 62 | {setup, spawn, fun () -> ok end, fun (_) -> ok end, 63 | [ 64 | ?_test(test_tree_simple_1()), 65 | ?_test(test_tree_simple_2()), 66 | ?_test(test_tree_simple_4()), 67 | ?_test(test_tree_simple_5()) 68 | ]}. 69 | 70 | longer_tree_test_() -> 71 | {setup, 72 | spawn, 73 | fun () -> ok end, 74 | fun (_) -> ok end, 75 | [ 76 | {timeout, 300, ?_test(test_tree())} 77 | ]}. 78 | 79 | longer_qc_test_() -> 80 | {setup, 81 | spawn, 82 | fun () -> ok end, 83 | fun (_) -> ok end, 84 | [ 85 | {timeout, 120, ?_test(test_qc())} 86 | ]}. 87 | 88 | -ifdef(TRIQ). 89 | test_qc() -> 90 | [?assertEqual(true, triq:module(?MODULE))]. 91 | -else. 92 | qc_opts() -> [{numtests, 800}]. 93 | test_qc() -> 94 | [?assertEqual([], proper:module(?MODULE, qc_opts()))]. 95 | -endif. 96 | 97 | %% Generators 98 | %% ---------------------------------------------------------------------- 99 | 100 | -define(NUM_TREES, 10). 101 | 102 | %% Generate a name for a btree 103 | g_btree_name() -> 104 | ?LET(I, choose(1,?NUM_TREES), 105 | btree_name(I)). 106 | 107 | %% Generate a key for the Tree 108 | g_key() -> 109 | binary(). 110 | 111 | %% Generate a value for the Tree 112 | g_value() -> 113 | binary(). 114 | 115 | g_fail_key() -> 116 | ?LET(T, choose(1,999999999999), 117 | term_to_binary(T)). 118 | 119 | g_open_tree(Open) -> 120 | oneof(dict:fetch_keys(Open)). 121 | 122 | %% Pick a name of a non-empty Btree 123 | g_non_empty_btree(Open) -> 124 | ?LET(TreesWithKeys, dict:filter(fun(_K, #tree { elements = D}) -> 125 | dict:size(D) > 0 126 | end, 127 | Open), 128 | oneof(dict:fetch_keys(TreesWithKeys))). 129 | 130 | g_existing_key(Name, Open) -> 131 | #tree { elements = Elems } = dict:fetch(Name, Open), 132 | oneof(dict:fetch_keys(Elems)). 133 | 134 | g_non_existing_key(Name, Open) -> 135 | ?SUCHTHAT(Key, g_fail_key(), 136 | begin 137 | #tree { elements = D } = dict:fetch(Name, Open), 138 | not dict:is_key(Key, D) 139 | end). 140 | 141 | g_fold_operation() -> 142 | oneof([{fun (K, V, Acc) -> [{K, V} | Acc] end, []}]). 143 | 144 | btree_name(I) -> 145 | "Btree_" ++ integer_to_list(I). 146 | 147 | %% Statem test 148 | %% ---------------------------------------------------------------------- 149 | initial_state() -> 150 | ClosedBTrees = lists:foldl(fun(N, Closed) -> 151 | dict:store(btree_name(N), 152 | #tree { }, 153 | Closed) 154 | end, 155 | dict:new(), 156 | lists:seq(1,?NUM_TREES)), 157 | #state { closed=ClosedBTrees }. 158 | 159 | 160 | command(#state { open = Open, closed = Closed } = S) -> 161 | frequency( 162 | [ {20, {call, ?SERVER, open, [oneof(dict:fetch_keys(Closed))]}} 163 | || closed_dicts(S)] 164 | ++ [ {20, {call, ?SERVER, close, [oneof(dict:fetch_keys(Open))]}} 165 | || open_dicts(S)] 166 | ++ [ {2000, {call, ?SERVER, put, cmd_put_args(S)}} 167 | || open_dicts(S)] 168 | ++ [ {1500, {call, ?SERVER, get_fail, cmd_get_fail_args(S)}} 169 | || open_dicts(S)] 170 | ++ [ {1500, {call, ?SERVER, get_exist, cmd_get_args(S)}} 171 | || open_dicts(S), open_dicts_with_keys(S)] 172 | ++ [ {500, {call, ?SERVER, delete_exist, cmd_delete_args(S)}} 173 | || open_dicts(S), open_dicts_with_keys(S)] 174 | ++ [ {125, {call, ?SERVER, fold_range, cmd_sync_fold_range_args(S)}} 175 | || open_dicts(S), open_dicts_with_keys(S)] 176 | ). 177 | 178 | %% Precondition (abstract) 179 | precondition(S, {call, ?SERVER, fold_range, [_Tree, _F, _A0, Range]}) -> 180 | is_valid_range(Range) andalso open_dicts(S) andalso open_dicts_with_keys(S); 181 | precondition(S, {call, ?SERVER, delete_exist, [_Name, _K]}) -> 182 | open_dicts(S) andalso open_dicts_with_keys(S); 183 | precondition(S, {call, ?SERVER, get_fail, [_Name, _K]}) -> 184 | open_dicts(S); 185 | precondition(S, {call, ?SERVER, get_exist, [_Name, _K]}) -> 186 | open_dicts(S) andalso open_dicts_with_keys(S); 187 | precondition(#state { open = Open }, {call, ?SERVER, put, [Name, _K, _V]}) -> 188 | dict:is_key(Name, Open); 189 | precondition(#state { open = Open, closed = Closed }, 190 | {call, ?SERVER, open, [Name]}) -> 191 | (not (dict:is_key(Name, Open))) and (dict:is_key(Name, Closed)); 192 | precondition(#state { open = Open, closed = Closed }, 193 | {call, ?SERVER, close, [Name]}) -> 194 | (dict:is_key(Name, Open)) and (not dict:is_key(Name, Closed)). 195 | 196 | is_valid_range(#key_range{ from_key=FromKey, from_inclusive=FromIncl, 197 | to_key=ToKey, to_inclusive=ToIncl, 198 | limit=Limit }) 199 | when 200 | (Limit == undefined) orelse (Limit > 0), 201 | is_binary(FromKey), 202 | (ToKey == undefined) orelse is_binary(ToKey), 203 | FromKey =< ToKey, 204 | is_boolean(FromIncl), 205 | is_boolean(ToIncl) 206 | -> 207 | if (FromKey == ToKey) -> 208 | (FromIncl == true) and (ToIncl == true); 209 | true -> 210 | true 211 | end; 212 | is_valid_range(_) -> 213 | false. 214 | 215 | 216 | %% Next state manipulation (abstract / concrete) 217 | next_state(S, _Res, {call, ?SERVER, fold_range, [_Tree, _F, _A0, _Range]}) -> 218 | S; 219 | next_state(S, _Res, {call, ?SERVER, get_fail, [_Name, _Key]}) -> 220 | S; 221 | next_state(S, _Res, {call, ?SERVER, get_exist, [_Name, _Key]}) -> 222 | S; 223 | next_state(#state { open = Open} = S, _Res, 224 | {call, ?SERVER, delete_exist, [Name, Key]}) -> 225 | S#state { open = dict:update(Name, 226 | fun(#tree { elements = Dict}) -> 227 | #tree { elements = 228 | dict:erase(Key, Dict)} 229 | end, 230 | Open)}; 231 | next_state(#state { open = Open} = S, _Res, 232 | {call, ?SERVER, put, [Name, Key, Value]}) -> 233 | S#state { open = dict:update( 234 | Name, 235 | fun(#tree { elements = Dict}) -> 236 | #tree { elements = 237 | dict:store(Key, Value, Dict) } 238 | end, 239 | Open)}; 240 | next_state(#state { open = Open, closed=Closed} = S, 241 | _Res, {call, ?SERVER, open, [Name]}) -> 242 | S#state { open = dict:store(Name, dict:fetch(Name, Closed) , Open), 243 | closed = dict:erase(Name, Closed) }; 244 | next_state(#state { open = Open, closed=Closed} = S, _Res, 245 | {call, ?SERVER, close, [Name]}) -> 246 | S#state { closed = dict:store(Name, dict:fetch(Name, Open) , Closed), 247 | open = dict:erase(Name, Open) }. 248 | 249 | %% Postcondition check (concrete) 250 | postcondition(#state { open = Open}, 251 | {call, ?SERVER, fold_range, [Tree, F, A0, Range]}, Result) -> 252 | #tree { elements = TDict } = dict:fetch(Tree, Open), 253 | DictResult = lists:sort(dict_range_query(TDict, F, A0, Range)), 254 | CallResult = lists:sort(Result), 255 | DictResult == CallResult; 256 | postcondition(_S, 257 | {call, ?SERVER, get_fail, [_Name, _Key]}, not_found) -> 258 | true; 259 | postcondition(#state { open = Open }, 260 | {call, ?SERVER, get_exist, [Name, Key]}, {ok, Value}) -> 261 | #tree { elements = Elems } = dict:fetch(Name, Open), 262 | dict:fetch(Key, Elems) == Value; 263 | postcondition(_S, {call, ?SERVER, delete_exist, [_Name, _Key]}, ok) -> 264 | true; 265 | postcondition(_S, {call, ?SERVER, put, [_Name, _Key, _Value]}, ok) -> 266 | true; 267 | postcondition(_S, {call, ?SERVER, open, [_Name]}, ok) -> 268 | true; 269 | postcondition(_S, {call, ?SERVER, close, [_Name]}, ok) -> 270 | true; 271 | postcondition(_State, _Call, _Result) -> 272 | % error_logger:error_report([{not_matching_any_postcondition, _State, _Call, _Result}]), 273 | false. 274 | 275 | 276 | %% Main property. Running a random set of commands is in agreement 277 | %% with a dict. 278 | prop_dict_agree() -> 279 | ?FORALL(Cmds, commands(?MODULE), 280 | ?TRAPEXIT( 281 | begin 282 | hanoidb_drv:start_link(), 283 | {History,State,Result} = run_commands(?MODULE, Cmds), 284 | hanoidb_drv:stop(), 285 | cleanup_test_trees(State), 286 | ?WHENFAIL(io:format("History: ~w\nState: ~w\nResult: ~w\n", 287 | [History,State,Result]), 288 | Result =:= ok) 289 | end)). 290 | 291 | %% UNIT TESTS 292 | %% ---------------------------------------------------------------------- 293 | test_tree_simple_1() -> 294 | {ok, Tree} = hanoidb:open("simple"), 295 | ok = hanoidb:put(Tree, <<>>, <<"data", 77:128>>), 296 | {ok, <<"data", 77:128>>} = hanoidb:get(Tree, <<>>), 297 | ok = hanoidb:close(Tree). 298 | 299 | test_tree_simple_2() -> 300 | {ok, Tree} = hanoidb:open("simple"), 301 | ok = hanoidb:put(Tree, <<"ã">>, <<"µ">>), 302 | {ok, <<"µ">>} = hanoidb:get(Tree, <<"ã">>), 303 | ok = hanoidb:delete(Tree, <<"ã">>), 304 | not_found = hanoidb:get(Tree, <<"ã">>), 305 | ok = hanoidb:close(Tree). 306 | 307 | test_tree_simple_4() -> 308 | Key = <<56,11,62,42,35,163,16,100,9,224,8,228,130,94,198,2,126,117,243, 309 | 1,122,175,79,159,212,177,30,153,71,91,85,233,41,199,190,58,3, 310 | 173,220,9>>, 311 | Value = <<212,167,12,6,105,152,17,80,243>>, 312 | {ok, Tree} = hanoidb:open("simple"), 313 | ok = hanoidb:put(Tree, Key, Value), 314 | ?assertEqual({ok, Value}, hanoidb:get(Tree, Key)), 315 | ok = hanoidb:close(Tree). 316 | 317 | test_tree_simple_5() -> 318 | {ok, Tree} = hanoidb:open("simple"), 319 | ok = hanoidb:put(Tree, <<"foo">>, <<"bar">>, 2), 320 | {ok, <<"bar">>} = hanoidb:get(Tree, <<"foo">>), 321 | ok = timer:sleep(3000), 322 | not_found = hanoidb:get(Tree, <<"foo">>), 323 | ok = hanoidb:close(Tree). 324 | 325 | test_tree() -> 326 | {ok, Tree} = hanoidb:open("simple2"), 327 | lists:foldl(fun(N,_) -> 328 | ok = hanoidb:put(Tree, <>, <<"data",N:128>>) 329 | end, 330 | ok, 331 | lists:seq(2,10000,1)), 332 | % io:format(user, "INSERT DONE 1~n", []), 333 | 334 | lists:foldl(fun(N,_) -> 335 | ok = hanoidb:put(Tree, <>, <<"data",N:128>>) 336 | end, 337 | ok, 338 | lists:seq(4000,6000,1)), 339 | % io:format(user, "INSERT DONE 2~n", []), 340 | 341 | hanoidb:delete(Tree, <<1500:128>>), 342 | % io:format(user, "DELETE DONE 3~n", []), 343 | 344 | {Time1,{ok,Count1}} = timer:tc(?MODULE, run_fold, [Tree,1000,2000,9]), 345 | % error_logger:info_msg("time to fold: ~p/sec (time=~p, count=~p)~n", [1000000/(Time1/Count1), Time1/1000000, Count1]), 346 | 347 | {Time2,{ok,Count2}} = timer:tc(?MODULE, run_fold, [Tree,1000,2000,1000]), 348 | % error_logger:info_msg("time to fold: ~p/sec (time=~p, count=~p)~n", [1000000/(Time2/Count2), Time2/1000000, Count2]), 349 | ok = hanoidb:close(Tree). 350 | 351 | run_fold(Tree,From,To,Limit) -> 352 | F = fun(<>, _Value, {N, C}) -> 353 | {N + 1, C + 1}; 354 | (<<1501:128>>, _Value, {1500, C}) -> 355 | {1502, C + 1} 356 | end, 357 | {_, Count} = hanoidb:fold_range(Tree, F, 358 | {From, 0}, 359 | #key_range{from_key= <>, to_key= <<(To+1):128>>, limit=Limit}), 360 | {ok, Count}. 361 | 362 | 363 | %% Command processing 364 | %% ---------------------------------------------------------------------- 365 | cmd_close_args(#state { open = Open }) -> 366 | oneof(dict:fetch_keys(Open)). 367 | 368 | cmd_put_args(#state { open = Open }) -> 369 | ?LET({Name, Key, Value}, 370 | {oneof(dict:fetch_keys(Open)), g_key(), g_value()}, 371 | [Name, Key, Value]). 372 | 373 | 374 | cmd_get_fail_args(#state { open = Open}) -> 375 | ?LET(Name, g_open_tree(Open), 376 | ?LET(Key, g_non_existing_key(Name, Open), 377 | [Name, Key])). 378 | 379 | cmd_get_args(#state { open = Open}) -> 380 | ?LET(Name, g_non_empty_btree(Open), 381 | ?LET(Key, g_existing_key(Name, Open), 382 | [Name, Key])). 383 | 384 | cmd_delete_args(#state { open = Open}) -> 385 | ?LET(Name, g_non_empty_btree(Open), 386 | ?LET(Key, g_existing_key(Name, Open), 387 | [Name, Key])). 388 | 389 | cmd_sync_range_args(#state { open = Open }) -> 390 | ?LET(Tree, g_non_empty_btree(Open), 391 | ?LET({K1, K2}, {g_existing_key(Tree, Open), 392 | g_existing_key(Tree, Open)}, 393 | [Tree, #key_range{from_key=K1, to_key=K2}])). 394 | 395 | cmd_sync_fold_range_args(State) -> 396 | ?LET([Tree, Range], cmd_sync_range_args(State), 397 | ?LET({F, Acc0}, g_fold_operation(), 398 | [Tree, F, Acc0, Range])). 399 | 400 | %% Context management 401 | %% ---------------------------------------------------------------------- 402 | cleanup_test_trees(#state { open = Open, closed = Closed }) -> 403 | [cleanup_tree(N) || N <- dict:fetch_keys(Open)], 404 | [cleanup_tree(N) || N <- dict:fetch_keys(Closed)]. 405 | 406 | cleanup_tree(Tree) -> 407 | case file:list_dir(Tree) of 408 | {error, enoent} -> 409 | ok; 410 | {ok, FileNames} -> 411 | [ok = file:delete(filename:join([Tree, Fname])) 412 | || Fname <- FileNames], 413 | file:del_dir(Tree) 414 | end. 415 | 416 | %% Various Helper routines 417 | %% ---------------------------------------------------------------------- 418 | 419 | open_dicts_with_keys(#state { open = Open}) -> 420 | lists:any(fun({_, #tree { elements = D}}) -> 421 | dict:size(D) > 0 422 | end, 423 | dict:to_list(Open)). 424 | 425 | open_dicts(#state { open = Open}) -> 426 | dict:size(Open) > 0. 427 | 428 | closed_dicts(#state { closed = Closed}) -> 429 | dict:size(Closed) > 0. 430 | 431 | dict_range_query(Dict, Fun, Acc0, Range) -> 432 | KVs = dict_range_query(Dict, Range), 433 | lists:foldl(fun({K, V}, Acc) -> 434 | Fun(K, V, Acc) 435 | end, 436 | Acc0, 437 | KVs). 438 | 439 | dict_range_query(Dict, Range) -> 440 | [{K, V} || {K, V} <- dict:to_list(Dict), 441 | ?KEY_IN_RANGE(K, Range)]. 442 | 443 | -------------------------------------------------------------------------------- /test/hanoidb_writer_tests.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(hanoidb_writer_tests). 26 | 27 | -ifdef(TEST). 28 | -ifdef(TEST). 29 | -ifdef(TRIQ). 30 | -include_lib("triq/include/triq.hrl"). 31 | -include_lib("triq/include/triq_statem.hrl"). 32 | -else. 33 | -include_lib("proper/include/proper.hrl"). 34 | -endif. 35 | -include_lib("eunit/include/eunit.hrl"). 36 | -endif. 37 | 38 | -ifdef(PROPER). 39 | -behaviour(proper_statem). 40 | -endif. 41 | -endif. 42 | 43 | -include("include/hanoidb.hrl"). 44 | 45 | -compile(export_all). 46 | 47 | simple_test() -> 48 | 49 | file:delete("testdata"), 50 | {ok, BT} = hanoidb_writer:open("testdata"), 51 | ok = hanoidb_writer:add(BT, <<"A">>, <<"Avalue">>), 52 | ok = hanoidb_writer:add(BT, <<"B">>, <<"Bvalue">>), 53 | ok = hanoidb_writer:close(BT), 54 | 55 | {ok, IN} = hanoidb_reader:open("testdata"), 56 | {ok, <<"Avalue">>} = hanoidb_reader:lookup(IN, <<"A">>), 57 | ok = hanoidb_reader:close(IN), 58 | 59 | ok = file:delete("testdata"). 60 | 61 | 62 | simple1_test() -> 63 | 64 | file:delete("testdata"), 65 | {ok, BT} = hanoidb_writer:open("testdata", [{block_size, 102},{expiry_secs, 0}]), 66 | 67 | Max = 102, 68 | Seq = lists:seq(0, Max), 69 | 70 | {Time1,_} = timer:tc( 71 | fun() -> 72 | lists:foreach( 73 | fun(Int) -> 74 | ok = hanoidb_writer:add(BT, <>, <<"valuevalue/", Int:128>>) 75 | end, 76 | Seq), 77 | ok = hanoidb_writer:close(BT) 78 | end, 79 | []), 80 | 81 | error_logger:info_msg("time to insert: ~p/sec~n", [1000000/(Time1/Max)]), 82 | 83 | {ok, IN} = hanoidb_reader:open("testdata", [{expiry_secs,0}]), 84 | Middle = Max div 2, 85 | io:format("LOOKING UP ~p~n", [<>]), 86 | {ok, <<"valuevalue/", Middle:128>>} = hanoidb_reader:lookup(IN, <>), 87 | 88 | 89 | {Time2,Count} = timer:tc( 90 | fun() -> hanoidb_reader:fold(fun(_Key, <<"valuevalue/", N:128>>, N) -> 91 | N+1 92 | end, 93 | 0, 94 | IN) 95 | end, 96 | []), 97 | 98 | io:format("time to scan: ~p/sec~n", [1000000/(Time2 div Max)]), 99 | 100 | Max = Count-1, 101 | 102 | {Time3,{done,Count2}} = timer:tc( 103 | fun() -> hanoidb_reader:range_fold(fun(_Key, <<"valuevalue/", N:128>>, N) -> 104 | % io:format("[~p]~n", N), 105 | N+1 106 | end, 107 | 0, 108 | IN, 109 | #key_range{ from_key= <<>>, to_key=undefined }) 110 | end, 111 | []), 112 | 113 | 114 | 115 | %error_logger:info_msg("time to range_fold: ~p/sec~n", [1000000/(Time3 div Max)]), 116 | 117 | io:format("count2=~p~n", [Count2]), 118 | 119 | Max = Count2-1, 120 | 121 | ok = hanoidb_reader:close(IN). 122 | 123 | -------------------------------------------------------------------------------- /tools/basho_bench_driver_hanoidb.erl: -------------------------------------------------------------------------------- 1 | %% ---------------------------------------------------------------------------- 2 | %% 3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage 4 | %% 5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 6 | %% http://trifork.com/ info@trifork.com 7 | %% 8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 9 | %% http://basho.com/ info@basho.com 10 | %% 11 | %% This file is provided to you under the Apache License, Version 2.0 (the 12 | %% "License"); you may not use this file except in compliance with the License. 13 | %% You may obtain a copy of the License at 14 | %% 15 | %% http://www.apache.org/licenses/LICENSE-2.0 16 | %% 17 | %% Unless required by applicable law or agreed to in writing, software 18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 20 | %% License for the specific language governing permissions and limitations 21 | %% under the License. 22 | %% 23 | %% ---------------------------------------------------------------------------- 24 | 25 | -module(basho_bench_driver_hanoidb). 26 | 27 | -record(state, { tree, 28 | filename, 29 | flags, 30 | sync_interval, 31 | last_sync }). 32 | 33 | -export([new/1, 34 | run/4]). 35 | 36 | -include("hanoidb.hrl"). 37 | -include_lib("basho_bench/include/basho_bench.hrl"). 38 | 39 | -record(key_range, { from_key = <<>> :: binary(), 40 | from_inclusive = true :: boolean(), 41 | to_key :: binary() | undefined, 42 | to_inclusive = false :: boolean(), 43 | limit :: pos_integer() | undefined }). 44 | 45 | %% ==================================================================== 46 | %% API 47 | %% ==================================================================== 48 | 49 | new(_Id) -> 50 | %% Make sure bitcask is available 51 | case code:which(hanoidb) of 52 | non_existing -> 53 | ?FAIL_MSG("~s requires hanoidb to be available on code path.\n", 54 | [?MODULE]); 55 | _ -> 56 | ok 57 | end, 58 | 59 | %% Get the target directory 60 | Dir = basho_bench_config:get(hanoidb_dir, "."), 61 | Filename = filename:join(Dir, "test.hanoidb"), 62 | Config = basho_bench_config:get(hanoidb_flags, []), 63 | 64 | %% Look for sync interval config 65 | case basho_bench_config:get(hanoidb_sync_interval, infinity) of 66 | Value when is_integer(Value) -> 67 | SyncInterval = Value; 68 | infinity -> 69 | SyncInterval = infinity 70 | end, 71 | 72 | %% Get any bitcask flags 73 | case hanoidb:open(Filename, Config) of 74 | {error, Reason} -> 75 | ?FAIL_MSG("Failed to open hanoidb in ~s: ~p\n", [Filename, Reason]); 76 | {ok, FBTree} -> 77 | {ok, #state { tree = FBTree, 78 | filename = Filename, 79 | sync_interval = SyncInterval, 80 | last_sync = os:timestamp() }} 81 | end. 82 | 83 | run(get, KeyGen, _ValueGen, State) -> 84 | case hanoidb:lookup(State#state.tree, KeyGen()) of 85 | {ok, _Value} -> 86 | {ok, State}; 87 | not_found -> 88 | {ok, State}; 89 | {error, Reason} -> 90 | {error, Reason} 91 | end; 92 | run(put, KeyGen, ValueGen, State) -> 93 | case hanoidb:put(State#state.tree, KeyGen(), ValueGen()) of 94 | ok -> 95 | {ok, State}; 96 | {error, Reason} -> 97 | {error, Reason} 98 | end; 99 | run(delete, KeyGen, _ValueGen, State) -> 100 | case hanoidb:delete(State#state.tree, KeyGen()) of 101 | ok -> 102 | {ok, State}; 103 | {error, Reason} -> 104 | {error, Reason} 105 | end; 106 | 107 | run(fold_100, KeyGen, _ValueGen, State) -> 108 | [From,To] = lists:usort([KeyGen(), KeyGen()]), 109 | case hanoidb:sync_fold_range(State#state.tree, 110 | fun(_Key,_Value,Count) -> 111 | Count+1 112 | end, 113 | 0, 114 | #key_range{ from_key=From, 115 | to_key=To, 116 | limit=100 }) of 117 | Count when Count >= 0; Count =< 100 -> 118 | {ok,State}; 119 | Count -> 120 | {error, {bad_fold_count, Count}} 121 | end. 122 | -------------------------------------------------------------------------------- /tools/visualize-hanoi.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ## ---------------------------------------------------------------------------- 4 | ## 5 | ## hanoi: LSM-trees (Log-Structured Merge Trees) Indexed Storage 6 | ## 7 | ## Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved. 8 | ## http://trifork.com/ info@trifork.com 9 | ## 10 | ## Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved. 11 | ## http://basho.com/ info@basho.com 12 | ## 13 | ## This file is provided to you under the Apache License, Version 2.0 (the 14 | ## "License"); you may not use this file except in compliance with the License. 15 | ## You may obtain a copy of the License at 16 | ## 17 | ## http://www.apache.org/licenses/LICENSE-2.0 18 | ## 19 | ## Unless required by applicable law or agreed to in writing, software 20 | ## distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 21 | ## WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 22 | ## License for the specific language governing permissions and limitations 23 | ## under the License. 24 | ## 25 | ## ---------------------------------------------------------------------------- 26 | 27 | function periodic() { 28 | t=0 29 | while sleep 1 ; do 30 | let "t=t+1" 31 | printf "%5d [" "$t" 32 | 33 | for ((i=0; i<35; i++)) ; do 34 | if ! [ -f "A-$i.data" ] ; then 35 | echo -n " " 36 | elif ! [ -f "B-$i.data" ] ; then 37 | echo -n "-" 38 | elif ! [ -f "C-$i.data" ] ; then 39 | echo -n "#" 40 | elif ! [ -f "X-$i.data" ] ; then 41 | echo -n "=" 42 | else 43 | echo -n "*" 44 | fi 45 | done 46 | echo 47 | done 48 | } 49 | 50 | merge_diff() { 51 | SA=`ls -l A-${ID}.data 2> /dev/null | awk '{print $5}'` 52 | SB=`ls -l B-${ID}.data 2> /dev/null | awk '{print $5}'` 53 | SX=`ls -l X-${ID}.data 2> /dev/null | awk '{print $5}'` 54 | if [ \( -n "$SA" \) -a \( -n "$SB" \) -a \( -n "$SX" \) ]; then 55 | export RES=`expr ${SX}0 / \( $SA + $SB \)` 56 | else 57 | export RES="?" 58 | fi 59 | } 60 | 61 | function dynamic() { 62 | local old s t start now 63 | t=0 64 | start=`date +%s` 65 | while true ; do 66 | s="" 67 | for ((i=8; i<22; i++)) ; do 68 | if [ -f "C-$i.data" ] ; then 69 | s="${s}C" 70 | else 71 | s="$s " 72 | fi 73 | if [ -f "B-$i.data" ] ; then 74 | s="${s}B" 75 | else 76 | s="$s " 77 | fi 78 | if [ -f "A-$i.data" ] ; then 79 | s="${s}A" 80 | else 81 | s="$s " 82 | fi 83 | if [ -f "X-$i.data" ] ; then 84 | export ID="$i" 85 | merge_diff 86 | s="${s}$RES" 87 | elif [ -f "M-$i.data" ] ; then 88 | s="${s}M" 89 | else 90 | s="$s " 91 | fi 92 | s="$s|" 93 | done 94 | 95 | if [[ "$s" != "$old" ]] ; then 96 | let "t=t+1" 97 | now=`date +%s` 98 | let "now=now-start" 99 | free=`df -m . 2> /dev/null | tail -1 | awk '{print $4}'` 100 | used=`du -m 2> /dev/null | awk '{print $1}' ` 101 | printf "%5d %6d [%s\n" "$t" "$now" "$s ${used}MB (${free}MB free)" 102 | old="$s" 103 | else 104 | # Sleep a little bit: 105 | sleep 1 106 | fi 107 | done 108 | } 109 | 110 | dynamic 111 | --------------------------------------------------------------------------------