├── .gitignore
├── .travis.yml
├── DESIGN.md
├── LICENSE
├── Makefile
├── README.md
├── TODO
├── doc
├── 10.1.1.44.2782.pdf
├── compare-innodb-vs-hanoi.png
├── design_diagrams.graffle
├── design_diagrams.pdf
└── sample_result_mba_20min.png
├── include
├── hanoidb.hrl
└── plain_rpc.hrl
├── rebar.config
├── src
├── gb_trees_ext.erl
├── hanoidb.app.src
├── hanoidb.erl
├── hanoidb.hrl
├── hanoidb_app.erl
├── hanoidb_bloom.erl
├── hanoidb_dense_bitmap.erl
├── hanoidb_fold_worker.erl
├── hanoidb_level.erl
├── hanoidb_merger.erl
├── hanoidb_nursery.erl
├── hanoidb_reader.erl
├── hanoidb_sparse_bitmap.erl
├── hanoidb_sup.erl
├── hanoidb_util.erl
├── hanoidb_writer.erl
├── plain_rpc.erl
└── vbisect.erl
├── test
├── hanoidb_drv.erl
├── hanoidb_merger_tests.erl
├── hanoidb_tests.erl
└── hanoidb_writer_tests.erl
└── tools
├── basho_bench_driver_hanoidb.erl
└── visualize-hanoi.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | ebin
2 | deps
3 | *~
4 | .eunit
5 | .project
6 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: erlang
2 | otp_release:
3 | - R16B03
4 | - R15B03
5 | - 17.0
6 | - 18.0
7 |
8 |
9 |
--------------------------------------------------------------------------------
/DESIGN.md:
--------------------------------------------------------------------------------
1 | # Hanoi's Design
2 |
3 | ### Basics
4 | If there are N records, there are in log2(N) levels (each being a plain B-tree in a file named "A-*level*.data"). The file `A-0.data` has 1 record, `A-1.data` has 2 records, `A-2.data` has 4 records, and so on: `A-n.data` has 2n records.
5 |
6 | In "stable state", each level file is either full (there) or empty (not there); so if there are e.g. 20 records stored, then there are only data in filed `A-2.data` (4 records) and `A-4.data` (16 records).
7 |
8 | OK, I've told you a lie. In practice, it is not practical to create a new file for each insert (injection at level #0), so we allows you to define the "top level" to be a number higher that #0; currently defaulting to #5 (32 records). That means that you take the amortization "hit" for ever 32 inserts.
9 |
10 | ### Lookup
11 | Lookup is quite simple: starting at `A-0.data`, the sought for Key is searched in the B-tree there. If nothing is found, search continues to the next data file. So if there are *N* levels, then *N* disk-based B-tree lookups are performed. Each lookup is "guarded" by a bloom filter to improve the likelihood that disk-based searches are only done when likely to succeed.
12 |
13 | ### Insertion
14 | Insertion works by a mechanism known as B-tree injection. Insertion always starts by constructing a fresh B-tree with 1 element in it, and "injecting" that B-tree into level #0. So you always inject a B-tree of the same size as the size of the level you're injecting it into.
15 |
16 | - If the level being injected into empty (there is no A-*level*.data file), then the injected B-tree becomes the contents for that level (we just rename the file).
17 | - Otherwise,
18 | - The injected tree file is renamed to B-*level*.data;
19 | - The files A-*level*.data and B-*level*.data are merged into a new temporary B-tree (of roughly double size), X-*level*.data.
20 | - The outcome of the merge is then injected into the next level.
21 |
22 | While merging, lookups at level *n* first consults the B-*n*.data file, then the A-*n*.data file. At a given level, there can only be one merge operation active.
23 |
24 | ### Overwrite and Delete
25 | Overwrite is done by simply doing a new insertion. Since search always starts from the top (level #0 ... level#*n*), newer values will be at a lower level, and thus be found before older values. When merging, values stored in the injected tree (that come from a lower-numbered level) have priority over the contained tree.
26 |
27 | Deletes are the same: they are also done by inserting a tombstone (a special value outside the domain of values). When a tombstone is merged at the currently highest numbered level it will be discarded. So tombstones have to bubble "down" to the highest numbered level before it can be truly evicted.
28 |
29 |
30 | ## Merge Logic
31 | The really clever thing about this storage mechanism is that merging is guaranteed to be able to "keep up" with insertion. Bitcask for instance has a similar merging phase, but it is separated from insertion. This means that there can suddenly be a lot of catching up to do. The flip side is that you can then decide to do all merging at off-peak hours, but it is yet another thing that need to be configured.
32 |
33 | With LSM B-Trees; back-pressure is provided by the injection mechanism, which only returns when an injection is complete. Thus, every 2nd insert needs to wait for level #0 to finish the required merging; which - assuming merging has linear I/O complexity - is enough to guarantee that the merge mechanism can keep up at higher-numbered levels.
34 |
35 | A further trouble is that merging does in fact not have completely linear I/O complexity, because reading from a small file that was recently written is faster that reading from a file that was written a long time ago (because of OS-level caching); thus doing a merge at level #*N+1* is sometimes more than twice as slow as doing a merge at level #*N*. Because of this, sustained insert pressure may produce a situation where the system blocks while merging, though it does require an extremely high level of inserts. We're considering ways to alleviate this.
36 |
37 | Merging can be going on concurrently at each level (in preparation for an injection to the next level), which lets you utilize available multi-core capacity to merge.
38 |
39 |
40 | ```
41 | ABC are data files at a given level
42 | A oldest
43 | C newest
44 | X is being merged into from [A+B]
45 |
46 | 270 76 [AB X|ABCX|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
47 | 271 76 [ABCX|ABCX|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
48 | 272 77 [A |AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
49 | 273 77 [AB X|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
50 | 274 77 [ABCX|AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
51 | 275 78 [A |ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
52 | 276 78 [AB X|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
53 | 277 79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A | | | | | | | | | |
54 | 278 79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX| C |AB | | | | | | | | | |
55 | 279 79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX| C |AB X| | | | | | | | | |
56 | 280 79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|ABCX|A |AB X| | | | | | | | | |
57 | 281 79 [ABCX|ABCX|ABCX|ABCX|ABCX|ABCX| C |AB |AB X| | | | | | | | | |
58 | 282 80 [ABCX|ABCX|ABCX| BC |AB |AB |AB X|AB X|AB X| | | | | | | | | |
59 | 283 80 [ABCX|ABCX|ABCX| C |AB X|AB |AB X|AB X|AB X| | | | | | | | | |
60 | 284 80 [A |AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X| | | | | | | | | |
61 | 285 80 [AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X| | | | | | | | | |
62 | 286 80 [ABCX|AB X|AB X|AB X|AB X|AB X|AB X|AB X|AB X| | | | | | | | | |
63 | 287 80 [A |ABCX|AB X|AB X|AB X|AB X|AB X|AB X|AB X| | | | | | | | | |
64 | ```
65 |
66 |
67 | When merge finishes, X is moved to the next level [becomes first open slot, in order of A,B,C], and the files merged (AB in this case) are deleted. If there is a C, then that becomes A of the next size.
68 | When X is closed and clean, it is actually intermittently renamed M so that if there is a crash after a merge finishes, and before it is accepted at the next level then the merge work is not lost, i.e. an M file is also clean/closed properly. Thus, if there are M's that means that the incremental merge was not fast enough.
69 |
70 | ABC files have 2^level KVs in it, regardless of the size of those KVs. XM files have 2^(level+1) approximately ... since tombstone merges might reduce the numbers or repeat PUTs of cause.
71 |
72 | ### File Descriptors
73 | Hanoi needs a lot of file descriptors, currently 6*⌈log2(N)-TOP_LEVEL⌉, with a nursery of size 2TOP_LEVEL, and N Key/Value pairs in the store. Thus, storing 1.000.000 KV's need 72 file descriptors, storing 1.000.000.000 records needs 132 file descriptors, 1.000.000.000.000 records needs 192.
74 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 |
2 | Apache License
3 | Version 2.0, January 2004
4 | http://www.apache.org/licenses/
5 |
6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7 |
8 | 1. Definitions.
9 |
10 | "License" shall mean the terms and conditions for use, reproduction,
11 | and distribution as defined by Sections 1 through 9 of this document.
12 |
13 | "Licensor" shall mean the copyright owner or entity authorized by
14 | the copyright owner that is granting the License.
15 |
16 | "Legal Entity" shall mean the union of the acting entity and all
17 | other entities that control, are controlled by, or are under common
18 | control with that entity. For the purposes of this definition,
19 | "control" means (i) the power, direct or indirect, to cause the
20 | direction or management of such entity, whether by contract or
21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
22 | outstanding shares, or (iii) beneficial ownership of such entity.
23 |
24 | "You" (or "Your") shall mean an individual or Legal Entity
25 | exercising permissions granted by this License.
26 |
27 | "Source" form shall mean the preferred form for making modifications,
28 | including but not limited to software source code, documentation
29 | source, and configuration files.
30 |
31 | "Object" form shall mean any form resulting from mechanical
32 | transformation or translation of a Source form, including but
33 | not limited to compiled object code, generated documentation,
34 | and conversions to other media types.
35 |
36 | "Work" shall mean the work of authorship, whether in Source or
37 | Object form, made available under the License, as indicated by a
38 | copyright notice that is included in or attached to the work
39 | (an example is provided in the Appendix below).
40 |
41 | "Derivative Works" shall mean any work, whether in Source or Object
42 | form, that is based on (or derived from) the Work and for which the
43 | editorial revisions, annotations, elaborations, or other modifications
44 | represent, as a whole, an original work of authorship. For the purposes
45 | of this License, Derivative Works shall not include works that remain
46 | separable from, or merely link (or bind by name) to the interfaces of,
47 | the Work and Derivative Works thereof.
48 |
49 | "Contribution" shall mean any work of authorship, including
50 | the original version of the Work and any modifications or additions
51 | to that Work or Derivative Works thereof, that is intentionally
52 | submitted to Licensor for inclusion in the Work by the copyright owner
53 | or by an individual or Legal Entity authorized to submit on behalf of
54 | the copyright owner. For the purposes of this definition, "submitted"
55 | means any form of electronic, verbal, or written communication sent
56 | to the Licensor or its representatives, including but not limited to
57 | communication on electronic mailing lists, source code control systems,
58 | and issue tracking systems that are managed by, or on behalf of, the
59 | Licensor for the purpose of discussing and improving the Work, but
60 | excluding communication that is conspicuously marked or otherwise
61 | designated in writing by the copyright owner as "Not a Contribution."
62 |
63 | "Contributor" shall mean Licensor and any individual or Legal Entity
64 | on behalf of whom a Contribution has been received by Licensor and
65 | subsequently incorporated within the Work.
66 |
67 | 2. Grant of Copyright License. Subject to the terms and conditions of
68 | this License, each Contributor hereby grants to You a perpetual,
69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70 | copyright license to reproduce, prepare Derivative Works of,
71 | publicly display, publicly perform, sublicense, and distribute the
72 | Work and such Derivative Works in Source or Object form.
73 |
74 | 3. Grant of Patent License. Subject to the terms and conditions of
75 | this License, each Contributor hereby grants to You a perpetual,
76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77 | (except as stated in this section) patent license to make, have made,
78 | use, offer to sell, sell, import, and otherwise transfer the Work,
79 | where such license applies only to those patent claims licensable
80 | by such Contributor that are necessarily infringed by their
81 | Contribution(s) alone or by combination of their Contribution(s)
82 | with the Work to which such Contribution(s) was submitted. If You
83 | institute patent litigation against any entity (including a
84 | cross-claim or counterclaim in a lawsuit) alleging that the Work
85 | or a Contribution incorporated within the Work constitutes direct
86 | or contributory patent infringement, then any patent licenses
87 | granted to You under this License for that Work shall terminate
88 | as of the date such litigation is filed.
89 |
90 | 4. Redistribution. You may reproduce and distribute copies of the
91 | Work or Derivative Works thereof in any medium, with or without
92 | modifications, and in Source or Object form, provided that You
93 | meet the following conditions:
94 |
95 | (a) You must give any other recipients of the Work or
96 | Derivative Works a copy of this License; and
97 |
98 | (b) You must cause any modified files to carry prominent notices
99 | stating that You changed the files; and
100 |
101 | (c) You must retain, in the Source form of any Derivative Works
102 | that You distribute, all copyright, patent, trademark, and
103 | attribution notices from the Source form of the Work,
104 | excluding those notices that do not pertain to any part of
105 | the Derivative Works; and
106 |
107 | (d) If the Work includes a "NOTICE" text file as part of its
108 | distribution, then any Derivative Works that You distribute must
109 | include a readable copy of the attribution notices contained
110 | within such NOTICE file, excluding those notices that do not
111 | pertain to any part of the Derivative Works, in at least one
112 | of the following places: within a NOTICE text file distributed
113 | as part of the Derivative Works; within the Source form or
114 | documentation, if provided along with the Derivative Works; or,
115 | within a display generated by the Derivative Works, if and
116 | wherever such third-party notices normally appear. The contents
117 | of the NOTICE file are for informational purposes only and
118 | do not modify the License. You may add Your own attribution
119 | notices within Derivative Works that You distribute, alongside
120 | or as an addendum to the NOTICE text from the Work, provided
121 | that such additional attribution notices cannot be construed
122 | as modifying the License.
123 |
124 | You may add Your own copyright statement to Your modifications and
125 | may provide additional or different license terms and conditions
126 | for use, reproduction, or distribution of Your modifications, or
127 | for any such Derivative Works as a whole, provided Your use,
128 | reproduction, and distribution of the Work otherwise complies with
129 | the conditions stated in this License.
130 |
131 | 5. Submission of Contributions. Unless You explicitly state otherwise,
132 | any Contribution intentionally submitted for inclusion in the Work
133 | by You to the Licensor shall be under the terms and conditions of
134 | this License, without any additional terms or conditions.
135 | Notwithstanding the above, nothing herein shall supersede or modify
136 | the terms of any separate license agreement you may have executed
137 | with Licensor regarding such Contributions.
138 |
139 | 6. Trademarks. This License does not grant permission to use the trade
140 | names, trademarks, service marks, or product names of the Licensor,
141 | except as required for reasonable and customary use in describing the
142 | origin of the Work and reproducing the content of the NOTICE file.
143 |
144 | 7. Disclaimer of Warranty. Unless required by applicable law or
145 | agreed to in writing, Licensor provides the Work (and each
146 | Contributor provides its Contributions) on an "AS IS" BASIS,
147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 | implied, including, without limitation, any warranties or conditions
149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 | PARTICULAR PURPOSE. You are solely responsible for determining the
151 | appropriateness of using or redistributing the Work and assume any
152 | risks associated with Your exercise of permissions under this License.
153 |
154 | 8. Limitation of Liability. In no event and under no legal theory,
155 | whether in tort (including negligence), contract, or otherwise,
156 | unless required by applicable law (such as deliberate and grossly
157 | negligent acts) or agreed to in writing, shall any Contributor be
158 | liable to You for damages, including any direct, indirect, special,
159 | incidental, or consequential damages of any character arising as a
160 | result of this License or out of the use or inability to use the
161 | Work (including but not limited to damages for loss of goodwill,
162 | work stoppage, computer failure or malfunction, or any and all
163 | other commercial damages or losses), even if such Contributor
164 | has been advised of the possibility of such damages.
165 |
166 | 9. Accepting Warranty or Additional Liability. While redistributing
167 | the Work or Derivative Works thereof, You may choose to offer,
168 | and charge a fee for, acceptance of support, warranty, indemnity,
169 | or other liability obligations and/or rights consistent with this
170 | License. However, in accepting such obligations, You may act only
171 | on Your own behalf and on Your sole responsibility, not on behalf
172 | of any other Contributor, and only if You agree to indemnify,
173 | defend, and hold each Contributor harmless for any liability
174 | incurred by, or claims asserted against, such Contributor by reason
175 | of your accepting any such warranty or additional liability.
176 |
177 | END OF TERMS AND CONDITIONS
178 |
179 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | REBAR= rebar
2 | DIALYZER= dialyzer
3 |
4 |
5 | .PHONY: plt analyze all deps compile get-deps clean
6 |
7 | all: get-deps compile
8 |
9 | deps: get-deps compile
10 |
11 | get-deps:
12 | @$(REBAR) get-deps
13 |
14 | compile:
15 | @$(REBAR) compile
16 |
17 | clean:
18 | @$(REBAR) clean
19 |
20 | test: eunit
21 |
22 | eunit: compile clean-test-btrees
23 | @$(REBAR) eunit skip_deps=true
24 |
25 | eunit_console:
26 | erl -pa .eunit deps/*/ebin
27 |
28 | clean-test-btrees:
29 | rm -fr .eunit/Btree_* .eunit/simple
30 |
31 | plt: compile
32 | $(DIALYZER) --build_plt --output_plt .hanoi.plt \
33 | -pa deps/snappy/ebin \
34 | -pa deps/snappy/ebin \
35 | -pa deps/lz4/ebin \
36 | -pa deps/ebloom/ebin \
37 | -pa deps/plain_fsm/ebin \
38 | deps/plain_fsm/ebin \
39 | --apps erts kernel stdlib ebloom lz4 snappy
40 |
41 | analyze: compile
42 | $(DIALYZER) --plt .hanoi.plt \
43 | -pa deps/snappy/ebin \
44 | -pa deps/lz4/ebin \
45 | -pa deps/ebloom/ebin \
46 | -pa deps/plain_fsm/ebin \
47 | ebin
48 |
49 | analyze-nospec: compile
50 | $(DIALYZER) --plt .hanoi.plt \
51 | -pa deps/plain_fsm/ebin \
52 | --no_spec \
53 | ebin
54 |
55 | repl:
56 | erl -pz deps/*/ebin -pa ebin
57 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # HanoiDB Indexed Key/Value Storage
2 |
3 | [](https://travis-ci.org/krestenkrab/hanoidb)
4 |
5 | HanoiDB implements an indexed, key/value storage engine. The primary index is
6 | a log-structured merge tree (LSM-BTree) implemented using "doubling sizes"
7 | persistent ordered sets of key/value pairs, similar is some regards to
8 | [LevelDB](http://code.google.com/p/leveldb/). HanoiDB includes a visualizer
9 | which when used to watch a living database resembles the "Towers of Hanoi"
10 | puzzle game, which inspired the name of this database.
11 |
12 | ## Features
13 | - Insert, Delete and Read all have worst case *O*(log2(*N*)) latency.
14 | - Incremental space reclaimation: The cost of evicting stale key/values
15 | is amortized into insertion
16 | - you don't need a separate eviction thread to keep memory use low
17 | - you don't need to schedule merges to happen at off-peak hours
18 | - Operations-friendly "append-only" storage
19 | - allows you to backup live system
20 | - crash-recovery is very fast and the logic is straight forward
21 | - all data subject to CRC32 checksums
22 | - data can be compressed on disk to save space
23 | - Efficient range queries
24 | - Riak secondary indexing
25 | - Fast key and bucket listing
26 | - Uses bloom filters to avoid unnecessary lookups on disk
27 | - Time-based expiry of data
28 | - configure the database to expire data older than n seconds
29 | - specify a lifetime in seconds for any particular key/value pair
30 | - Efficient resource utilization
31 | - doesn't store all keys in memory
32 | - uses a modest number of file descriptors proportional to the number of levels
33 | - I/O is generally balanced between random and sequential
34 | - low CPU overhead
35 | - ~2000 lines of pure Erlang code in src/*.erl
36 |
37 | HanoiDB is developed by Trifork, a Riak expert solutions provider, and Basho
38 | Technologies, makers of Riak. HanoiDB can be used in Riak via the
39 | `riak_kv_tower_backend` repository.
40 |
41 | ### Configuration options
42 |
43 | Put these values in your `app.config` in the `hanoidb` section
44 |
45 | ```erlang
46 | {hanoidb, [
47 | {data_root, "./data/hanoidb"},
48 |
49 | %% Enable/disable on-disk compression.
50 | %%
51 | {compress, none | gzip},
52 |
53 | %% Expire (automatically delete) entries after N seconds.
54 | %% When this value is 0 (zero), entries never expire.
55 | %%
56 | {expiry_secs, 0},
57 |
58 | %% Sync strategy `none' only syncs every time the
59 | %% nursery runs full, which is currently hard coded
60 | %% to be evert 256 inserts or deletes.
61 | %%
62 | %% Sync strategy `sync' will sync the nursery log
63 | %% for every insert or delete operation.
64 | %%
65 | {sync_strategy, none | sync | {seconds, N}},
66 |
67 | %% The page size is a minimum page size, when a page fills
68 | %% up to beyond this size, it is written to disk.
69 | %% Compression applies to such units of page size.
70 | %%
71 | {page_size, 8192},
72 |
73 | %% Read/write buffer sizes apply to merge processes.
74 | %% A merge process has two read buffers and a write
75 | %% buffer, and there is a merge process *per level* in
76 | %% the database.
77 | %%
78 | {write_buffer_size, 524288}, % 512kB
79 | {read_buffer_size, 524288}, % 512kB
80 |
81 | %% The merge strategy is one of `fast' or `predictable'.
82 | %% Both have same log2(N) worst case, but `fast' is
83 | %% sometimes faster; yielding latency fluctuations.
84 | %%
85 | {merge_strategy, fast | predictable},
86 |
87 | %% "Level0" files has 2^N KVs in it, defaulting to 1024.
88 | %% If the database is to contain very small KVs, this is
89 | %% likely too small, and will result in many unnecessary
90 | %% file operations. (Subsequent levels double in size).
91 | {top_level, 10} % 1024 Key/Values
92 | ]},
93 | ```
94 |
95 |
96 | ### Contributors
97 |
98 | - Kresten Krab Thorup @krestenkrab
99 | - Greg Burd @gburd
100 | - Jesper Louis Andersen @jlouis
101 | - Steve Vinoski @vinoski
102 | - Erik Søe Sørensen, @eriksoe
103 | - Yamamoto Takashi @yamt
104 | - Joseph Wayne Norton @norton
105 |
--------------------------------------------------------------------------------
/TODO:
--------------------------------------------------------------------------------
1 | * Phase 1: Minimum viable product (in order of priority)
2 | * lager; check for uses of lager:error/2
3 | * configurable TOP_LEVEL size
4 | * test new snappy compression support
5 | * status and statistics
6 | * for each level {#merges, {merge-time-min, max, average}}
7 | * add @doc strings and and -spec's
8 | * check to make sure every error returns with a reason {error, Reason}
9 |
10 |
11 | * Phase 2: Production Ready
12 | * dual-nursery
13 | * cache for read-path
14 | * {cache, bytes(), name} share max(bytes) cache named 'name' via etc
15 | * snapshot entire database (fresh directory w/ hard links to all files)
16 | * persist merge progress (to speed up re-opening a HanoiDB)
17 | * support for future file format changes
18 | * Define a standard struct which is the metadata added at the end of the
19 | file, e.g. [btree-nodes] [meta-data] [offset of meta-data]. This is written
20 | in hanoi_writer:flush_nodes, and read in hanoi_reader:open2.
21 |
22 | * Phase 3: Wish List
23 | * add truncate/1 - quickly truncates a database to 0 items
24 | * count/1 - return number of items currently in tree
25 | * adaptive nursery sizing
26 | * backpressure on fold operations
27 | - The "sync_fold" creates a snapshot (hard link to btree files), which
28 | provides consistent behavior but may use a lot of disk space if there is
29 | a lot of insertion going on.
30 | - The "async_fold" folds a limited number, and remembers the last key
31 | serviced, then picks up from there again. So you could see intermittent
32 | puts in a subsequent batch of results.
33 | * add block-level encryption support
34 |
35 |
36 | ## NOTES:
37 |
38 | 1: make the "first level" have more thatn 2^5 entries (controlled by the constant TOP_LEVEL in hanoi.hrl); this means a new set of files is opened/closed/merged for every 32 insert/updates/deletes. Setting this higher will just make the nursery correspondingly larger, which should be absolutely fine.
39 |
40 | 2: Right now, the streaming btree writer emits a btree page based on number of elements. This could be changed to be based on the size of the node (say, some block-size boudary) and then add padding at the end so that each node read becomes a clean block transfer. Right now, we're probably taking way to many reads.
41 |
42 | 3: Also, there is no caching of read nodes. So every time a btree node is visited it is also read from disk and term_to_binary'ed. But we need a caching system for that to work well (https://github.com/cliffmoon/cherly is difficult to build), it needs to be rebar-ified.
43 |
44 | 4: Also, the format for btree nodes could probably be optimized. Right now it's just binary_to_term of a key/value list as far as I remember. Perhaps we dont have to deserialize the entire thing.
45 |
46 | 5: It might also be good to employ a scheduler (github.com/esl/jobs) for issuing merges; because I think that it can be a problem for the OS if there are too many merges going on at the same time.
47 |
--------------------------------------------------------------------------------
/doc/10.1.1.44.2782.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/10.1.1.44.2782.pdf
--------------------------------------------------------------------------------
/doc/compare-innodb-vs-hanoi.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/compare-innodb-vs-hanoi.png
--------------------------------------------------------------------------------
/doc/design_diagrams.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/design_diagrams.pdf
--------------------------------------------------------------------------------
/doc/sample_result_mba_20min.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/krestenkrab/hanoidb/68333fa51a6fdf27834fc84f42d4421f9627e3b7/doc/sample_result_mba_20min.png
--------------------------------------------------------------------------------
/include/hanoidb.hrl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 |
26 | %%
27 | %% When doing "async fold", it does "sync fold" in chunks
28 | %% of this many K/V entries.
29 | %%
30 | -define(BTREE_ASYNC_CHUNK_SIZE, 100).
31 |
32 | %%
33 | %% The key_range structure is a bit assymetric, here is why:
34 | %%
35 | %% from_key=<<>> is "less than" any other key, hence we don't need to
36 | %% handle from_key=undefined to support an open-ended start of the
37 | %% interval. For to_key, we cannot (statically) construct a key
38 | %% which is > any possible key, hence we need to allow to_key=undefined
39 | %% as a token of an interval that has no upper limit.
40 | %%
41 | -record(key_range, { from_key = <<>> :: binary(),
42 | from_inclusive = true :: boolean(),
43 | to_key :: binary() | undefined,
44 | to_inclusive = false :: boolean(),
45 | limit :: pos_integer() | undefined }).
46 |
--------------------------------------------------------------------------------
/include/plain_rpc.hrl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% plain_rpc: RPC module to accompany plain_fsm
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% This file is provided to you under the Apache License, Version 2.0 (the
9 | %% "License"); you may not use this file except in compliance with the License.
10 | %% You may obtain a copy of the License at
11 | %%
12 | %% http://www.apache.org/licenses/LICENSE-2.0
13 | %%
14 | %% Unless required by applicable law or agreed to in writing, software
15 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
16 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
17 | %% License for the specific language governing permissions and limitations
18 | %% under the License.
19 | %%
20 | %% ----------------------------------------------------------------------------
21 |
22 | %%
23 | %% This module really belongs in the plain_fsm distro.
24 | %%
25 |
26 | -define(CALL(From,Msg), {'$call', From, Msg}).
27 | -define(REPLY(Ref,Msg), {'$reply', Ref, Msg}).
28 | -define(CAST(From,Msg), {'$cast', From, Msg}).
29 |
30 |
--------------------------------------------------------------------------------
/rebar.config:
--------------------------------------------------------------------------------
1 | {cover_enabled, true}.
2 |
3 | {clean_files, ["*.eunit", "ebin/*.beam"]}.
4 | {eunit_opts, [verbose, {report, {eunit_surefire, [{dir, "."}]}}]}.
5 |
6 | {erl_opts, [%{d,'DEBUG',true},
7 | {d,'USE_EBLOOM',true},
8 | {parse_transform, lager_transform},
9 | fail_on_warning,
10 | warn_unused_vars,
11 | warn_export_all,
12 | warn_shadow_vars,
13 | warn_unused_import,
14 | warn_unused_function,
15 | warn_bif_clash,
16 | warn_unused_record,
17 | warn_deprecated_function,
18 | warn_obsolete_guard,
19 | warn_export_vars,
20 | warn_exported_vars,
21 | warn_untyped_record,
22 | % warn_missing_spec,
23 | % strict_validation,
24 | {platform_define, "^R|17", pre18},
25 | debug_info]}.
26 |
27 | {xref_checks, [undefined_function_calls]}.
28 |
29 | {deps, [ {sext, ".*", {git, "git://github.com/uwiger/sext", {branch, "master"}}}
30 | , {lager, ".*", {git, "git://github.com/basho/lager", {branch, "master"}}}
31 | , {snappy, "1.*", {git, "git://github.com/fdmanana/snappy-erlang-nif.git", {branch, "master"}}}
32 | , {plain_fsm, "1.*", {git, "git://github.com/gburd/plain_fsm", {branch, "master"}}}
33 | % , {basho_bench, ".*", {git, "git://github.com/basho/basho_bench", {branch, "master"}}}
34 | , {ebloom, ".*", {git, "git://github.com/basho/ebloom", {branch, "master"}}}
35 | , {triq, ".*", {git, "git://github.com/krestenkrab/triq", {branch, "master"}}}
36 | , {lz4, ".*", {git, "git://github.com/krestenkrab/erlang-lz4.git", {branch, "master"}}}
37 | % , {edown, "0.3.*", {git, "git://github.com/uwiger/edown.git", {branch, "master"}}}
38 | % , {asciiedoc, "0.1.*", {git, "git://github.com/norton/asciiedoc.git", {branch, "master"}}}
39 | % , {triq, ".*", {git, "git://github.com/krestenkrab/triq.git", {branch, "master"}}}
40 | % , {proper, ".*", {git, "git://github.com/manopapad/proper.git", {branch, "master"}}}
41 | ]}.
42 |
--------------------------------------------------------------------------------
/src/gb_trees_ext.erl:
--------------------------------------------------------------------------------
1 |
2 | -module(gb_trees_ext).
3 | -extends(gb_trees).
4 | -export([fold/3]).
5 |
6 | % author: http://erlang.2086793.n4.nabble.com/gb-trees-fold-td2228614.html
7 |
8 | -spec fold(fun((term(), term(), term()) -> term()), term(), gb_trees:tree()) -> term().
9 | fold(F, A, {_, T})
10 | when is_function(F, 3) ->
11 | fold_1(F, A, T).
12 |
13 | fold_1(F, Acc0, {Key, Value, Small, Big}) ->
14 | Acc1 = fold_1(F, Acc0, Small),
15 | Acc = F(Key, Value, Acc1),
16 | fold_1(F, Acc, Big);
17 | fold_1(_, Acc, _) ->
18 | Acc.
19 |
--------------------------------------------------------------------------------
/src/hanoidb.app.src:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | {application, hanoidb,
26 | [
27 | {description, ""},
28 | {vsn, "1.3.0"},
29 | {registered, []},
30 | {applications, [
31 | kernel,
32 | stdlib,
33 | plain_fsm
34 | ]},
35 | {mod, {hanoidb_app, []}},
36 | {env, []}
37 | ]}.
38 |
--------------------------------------------------------------------------------
/src/hanoidb.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb).
26 | -author('Kresten Krab Thorup ').
27 |
28 |
29 | -behavior(gen_server).
30 |
31 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2,
32 | terminate/2, code_change/3]).
33 |
34 | -export([open/1, open/2, open/3, open_link/1, open_link/2, open_link/3,
35 | transact/2, close/1, get/2, lookup/2, delete/2, put/3, put/4,
36 | fold/3, fold_range/4, destroy/1]).
37 |
38 | -export([get_opt/2, get_opt/3]).
39 |
40 | -include("hanoidb.hrl").
41 | -include_lib("kernel/include/file.hrl").
42 | -include_lib("include/hanoidb.hrl").
43 | -include_lib("include/plain_rpc.hrl").
44 |
45 | -record(state, { top :: pid(),
46 | nursery :: #nursery{},
47 | dir :: string(),
48 | opt :: term(),
49 | max_level :: pos_integer()}).
50 |
51 | %% 0 means never expire
52 | -define(DEFAULT_EXPIRY_SECS, 0).
53 |
54 | -ifdef(DEBUG).
55 | -define(log(Fmt,Args),io:format(user,Fmt,Args)).
56 | -else.
57 | -define(log(Fmt,Args),ok).
58 | -endif.
59 |
60 |
61 | %% PUBLIC API
62 |
63 | -type hanoidb() :: pid().
64 | -type key_range() :: #key_range{}.
65 | -type config_option() :: {compress, none | gzip | snappy | lz4}
66 | | {page_size, pos_integer()}
67 | | {read_buffer_size, pos_integer()}
68 | | {write_buffer_size, pos_integer()}
69 | | {merge_strategy, fast | predictable }
70 | | {sync_strategy, none | sync | {seconds, pos_integer()}}
71 | | {expiry_secs, non_neg_integer()}
72 | | {spawn_opt, list()}
73 | | {top_level, pos_integer()}
74 | .
75 |
76 | %% @doc
77 | %% Create or open a hanoidb store. Argument `Dir' names a
78 | %% directory in which to keep the data files. By convention, we
79 | %% name hanoidb data directories with extension ".hanoidb".
80 | - spec open(Dir::string()) -> {ok, hanoidb()} | ignore | {error, term()}.
81 | open(Dir) ->
82 | open(Dir, []).
83 |
84 | %% @doc Create or open a hanoidb store.
85 | - spec open(Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}.
86 | open(Dir, Opts) ->
87 | ok = start_app(),
88 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []),
89 | gen_server:start(?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]).
90 |
91 | %% @doc Create or open a hanoidb store with a registered name.
92 | - spec open(Name::{local, Name::atom()} | {global, GlobalName::term()} | {via, ViaName::term()},
93 | Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}.
94 | open(Name, Dir, Opts) ->
95 | ok = start_app(),
96 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []),
97 | gen_server:start(Name, ?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]).
98 |
99 | %% @doc
100 | %% Create or open a hanoidb store as part of a supervision tree.
101 | %% Argument `Dir' names a directory in which to keep the data files.
102 | %% By convention, we name hanoidb data directories with extension
103 | %% ".hanoidb".
104 | - spec open_link(Dir::string()) -> {ok, hanoidb()} | ignore | {error, term()}.
105 | open_link(Dir) ->
106 | open_link(Dir, []).
107 |
108 | %% @doc Create or open a hanoidb store as part of a supervision tree.
109 | - spec open_link(Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}.
110 | open_link(Dir, Opts) ->
111 | ok = start_app(),
112 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []),
113 | gen_server:start_link(?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]).
114 |
115 | %% @doc Create or open a hanoidb store as part of a supervision tree
116 | %% with a registered name.
117 | - spec open_link(Name::{local, Name::atom()} | {global, GlobalName::term()} | {via, ViaName::term()},
118 | Dir::string(), Opts::[config_option()]) -> {ok, hanoidb()} | ignore | {error, term()}.
119 | open_link(Name, Dir, Opts) ->
120 | ok = start_app(),
121 | SpawnOpt = hanoidb:get_opt(spawn_opt, Opts, []),
122 | gen_server:start_link(Name, ?MODULE, [Dir, Opts], [{spawn_opt,SpawnOpt}]).
123 |
124 | %% @doc
125 | %% Close a Hanoi data store.
126 | - spec close(Ref::pid()) -> ok.
127 | close(Ref) ->
128 | try
129 | gen_server:call(Ref, close, infinity)
130 | catch
131 | exit:{noproc,_} -> ok;
132 | exit:noproc -> ok;
133 | %% Handle the case where the monitor triggers
134 | exit:{normal, _} -> ok
135 | end.
136 |
137 | -spec destroy(Ref::pid()) -> ok.
138 | destroy(Ref) ->
139 | try
140 | gen_server:call(Ref, destroy, infinity)
141 | catch
142 | exit:{noproc,_} -> ok;
143 | exit:noproc -> ok;
144 | %% Handle the case where the monitor triggers
145 | exit:{normal, _} -> ok
146 | end.
147 |
148 | get(Ref,Key) when is_binary(Key) ->
149 | gen_server:call(Ref, {get, Key}, infinity).
150 |
151 | %% for compatibility with original code
152 | lookup(Ref,Key) when is_binary(Key) ->
153 | gen_server:call(Ref, {get, Key}, infinity).
154 |
155 | -spec delete(hanoidb(), binary()) ->
156 | ok | {error, term()}.
157 | delete(Ref,Key) when is_binary(Key) ->
158 | gen_server:call(Ref, {delete, Key}, infinity).
159 |
160 | -spec put(hanoidb(), binary(), binary()) ->
161 | ok | {error, term()}.
162 | put(Ref,Key,Value) when is_binary(Key), is_binary(Value) ->
163 | gen_server:call(Ref, {put, Key, Value, infinity}, infinity).
164 |
165 | -spec put(hanoidb(), binary(), binary(), integer()) ->
166 | ok | {error, term()}.
167 | put(Ref,Key,Value,infinity) when is_binary(Key), is_binary(Value) ->
168 | gen_server:call(Ref, {put, Key, Value, infinity}, infinity);
169 | put(Ref,Key,Value,Expiry) when is_binary(Key), is_binary(Value) ->
170 | gen_server:call(Ref, {put, Key, Value, Expiry}, infinity).
171 |
172 | -type transact_spec() :: {put, binary(), binary()} | {delete, binary()}.
173 | -spec transact(hanoidb(), [transact_spec()]) ->
174 | ok | {error, term()}.
175 | transact(Ref, TransactionSpec) ->
176 | gen_server:call(Ref, {transact, TransactionSpec}, infinity).
177 |
178 | -type kv_fold_fun() :: fun((binary(),binary(),any())->any()).
179 |
180 | -spec fold(hanoidb(),kv_fold_fun(),any()) -> any().
181 | fold(Ref,Fun,Acc0) ->
182 | fold_range(Ref,Fun,Acc0,#key_range{from_key= <<>>, to_key=undefined}).
183 |
184 | -spec fold_range(hanoidb(),kv_fold_fun(),any(),key_range()) -> any().
185 | fold_range(Ref,Fun,Acc0,#key_range{limit=Limit}=Range) ->
186 | RangeType =
187 | if Limit < 10 -> blocking_range;
188 | true -> snapshot_range
189 | end,
190 | {ok, FoldWorkerPID} = hanoidb_fold_worker:start(self()),
191 | MRef = erlang:monitor(process, FoldWorkerPID),
192 | ?log("fold_range begin: self=~p, worker=~p monitor=~p~n", [self(), FoldWorkerPID, MRef]),
193 | ok = gen_server:call(Ref, {RangeType, FoldWorkerPID, Range}, infinity),
194 | Result = receive_fold_range(MRef, FoldWorkerPID, Fun, Acc0, Limit),
195 | ?log("fold_range done: self:~p, result=~p~n", [self(), Result]),
196 | Result.
197 |
198 | receive_fold_range(MRef,PID,_,Acc0, 0) ->
199 | erlang:exit(PID, shutdown),
200 | drain_worker(MRef,PID,Acc0);
201 |
202 | receive_fold_range(MRef,PID,Fun,Acc0, Limit) ->
203 | ?log("receive_fold_range:~p,~P~n", [PID,Acc0,10]),
204 | receive
205 |
206 | %% receive one K/V from fold_worker
207 | ?CALL(From, {fold_result, PID, K,V}) ->
208 | plain_rpc:send_reply(From, ok),
209 | case
210 | try
211 | {ok, Fun(K,V,Acc0)}
212 | catch
213 | Class:Exception ->
214 | % TODO ?log("Exception in hanoidb fold: ~p ~p", [Exception, erlang:get_stacktrace()]),
215 | {'EXIT', Class, Exception, erlang:get_stacktrace()}
216 | end
217 | of
218 | {ok, Acc1} ->
219 | receive_fold_range(MRef, PID, Fun, Acc1, decr(Limit));
220 | Exit ->
221 | %% kill the fold worker ...
222 | erlang:exit(PID, shutdown),
223 | raise(drain_worker(MRef,PID,Exit))
224 | end;
225 |
226 | ?CAST(_,{fold_limit, PID, _}) ->
227 | ?log("> fold_limit pid=~p, self=~p~n", [PID, self()]),
228 | erlang:demonitor(MRef, [flush]),
229 | Acc0;
230 | ?CAST(_,{fold_done, PID}) ->
231 | ?log("> fold_done pid=~p, self=~p~n", [PID, self()]),
232 | erlang:demonitor(MRef, [flush]),
233 | Acc0;
234 | {'DOWN', MRef, _, _PID, normal} ->
235 | ?log("> fold worker ~p ENDED~n", [_PID]),
236 | Acc0;
237 | {'DOWN', MRef, _, _PID, Reason} ->
238 | ?log("> fold worker ~p DOWN reason:~p~n", [_PID, Reason]),
239 | error({fold_worker_died, Reason})
240 | end.
241 |
242 | decr(undefined) ->
243 | undefined;
244 | decr(N) ->
245 | N-1.
246 |
247 | %%
248 | %% Just calls erlang:raise with appropriate arguments
249 | %%
250 | raise({'EXIT', Class, Exception, Trace}) ->
251 | erlang:raise(Class, Exception, Trace).
252 |
253 |
254 | drain_worker(MRef, PID, Value) ->
255 | receive
256 | ?CALL(_From,{fold_result, PID, _, _}) ->
257 | drain_worker(MRef, PID, Value);
258 | {'DOWN', MRef, _, _, _} ->
259 | Value;
260 | ?CAST(_,{fold_limit, PID, _}) ->
261 | erlang:demonitor(MRef, [flush]),
262 | Value;
263 | ?CAST(_,{fold_done, PID}) ->
264 | erlang:demonitor(MRef, [flush]),
265 | Value
266 | after 0 ->
267 | Value
268 | end.
269 |
270 |
271 | init([Dir, Opts0]) ->
272 | %% ensure expory_secs option is set in config
273 | Opts =
274 | case get_opt(expiry_secs, Opts0) of
275 | undefined ->
276 | [{expiry_secs, ?DEFAULT_EXPIRY_SECS}|Opts0];
277 | N when is_integer(N), N >= 0 ->
278 | [{expiry_secs, N}|Opts0]
279 | end,
280 | hanoidb_util:ensure_expiry(Opts),
281 |
282 | {Top, Nur, Max} =
283 | case file:read_file_info(Dir) of
284 | {ok, #file_info{ type=directory }} ->
285 | {ok, TopLevel, MinLevel, MaxLevel} = open_levels(Dir, Opts),
286 | {ok, Nursery} = hanoidb_nursery:recover(Dir, TopLevel, MinLevel, MaxLevel, Opts),
287 | {TopLevel, Nursery, MaxLevel};
288 | {error, E} when E =:= enoent ->
289 | ok = file:make_dir(Dir),
290 | MinLevel = get_opt(top_level, Opts0, ?TOP_LEVEL),
291 | {ok, TopLevel} = hanoidb_level:open(Dir, MinLevel, undefined, Opts, self()),
292 | MaxLevel = MinLevel,
293 | {ok, Nursery} = hanoidb_nursery:new(Dir, MinLevel, MaxLevel, Opts),
294 | {TopLevel, Nursery, MaxLevel}
295 | end,
296 | {ok, #state{ top=Top, dir=Dir, nursery=Nur, opt=Opts, max_level=Max }}.
297 |
298 |
299 | open_levels(Dir, Options) ->
300 | {ok, Files} = file:list_dir(Dir),
301 | TopLevel0 = get_opt(top_level, Options, ?TOP_LEVEL),
302 |
303 | %% parse file names and find max level
304 | {MinLevel, MaxLevel} =
305 | lists:foldl(fun(FileName, {MinLevel, MaxLevel}) ->
306 | case parse_level(FileName) of
307 | {ok, Level} ->
308 | {erlang:min(MinLevel, Level),
309 | erlang:max(MaxLevel, Level)};
310 | _ ->
311 | {MinLevel, MaxLevel}
312 | end
313 | end,
314 | {TopLevel0, TopLevel0},
315 | Files),
316 |
317 | %% remove old nursery data file
318 | NurseryFileName = filename:join(Dir, "nursery.data"),
319 | _ = file:delete(NurseryFileName),
320 |
321 | %% Do enough incremental merge to be sure we won't deadlock in insert
322 | {TopLevel, MaxMerge} =
323 | lists:foldl(fun(LevelNo, {NextLevel, MergeWork0}) ->
324 | {ok, Level} = hanoidb_level:open(Dir, LevelNo, NextLevel, Options, self()),
325 | MergeWork = MergeWork0 + hanoidb_level:unmerged_count(Level),
326 | {Level, MergeWork}
327 | end,
328 | {undefined, 0},
329 | lists:seq(MaxLevel, MinLevel, -1)),
330 | WorkPerIter = (MaxLevel - MinLevel + 1) * ?BTREE_SIZE(MinLevel),
331 | % error_logger:info_msg("do_merge ... {~p,~p,~p}~n", [TopLevel, WorkPerIter, MaxMerge]),
332 | do_merge(TopLevel, WorkPerIter, MaxMerge, MinLevel),
333 | {ok, TopLevel, MinLevel, MaxLevel}.
334 |
335 | do_merge(TopLevel, _Inc, N, _MinLevel) when N =< 0 ->
336 | ok = hanoidb_level:await_incremental_merge(TopLevel);
337 | do_merge(TopLevel, Inc, N, MinLevel) ->
338 | ok = hanoidb_level:begin_incremental_merge(TopLevel, ?BTREE_SIZE(MinLevel)),
339 | do_merge(TopLevel, Inc, N-Inc, MinLevel).
340 |
341 |
342 | parse_level(FileName) ->
343 | case re:run(FileName, "^[^\\d]+-(\\d+)\\.data$", [{capture,all_but_first,list}]) of
344 | {match,[StringVal]} ->
345 | {ok, list_to_integer(StringVal)};
346 | _ ->
347 | nomatch
348 | end.
349 |
350 |
351 | handle_info({bottom_level, N}, #state{ nursery=Nursery, top=TopLevel }=State)
352 | when N > State#state.max_level ->
353 | State2 = State#state{ max_level = N,
354 | nursery= hanoidb_nursery:set_max_level(Nursery, N) },
355 |
356 | _ = hanoidb_level:set_max_level(TopLevel, N),
357 |
358 | {noreply, State2};
359 |
360 | handle_info(Info,State) ->
361 | error_logger:error_msg("Unknown info ~p~n", [Info]),
362 | {stop,bad_msg,State}.
363 |
364 | handle_cast(Info,State) ->
365 | error_logger:error_msg("Unknown cast ~p~n", [Info]),
366 | {stop,bad_msg,State}.
367 |
368 |
369 | %% premature delete -> cleanup
370 | terminate(normal, _State) ->
371 | ok;
372 | terminate(_Reason, _State) ->
373 | error_logger:info_msg("got terminate(~p, ~p)~n", [_Reason, _State]),
374 | ok.
375 |
376 | code_change(_OldVsn, State, _Extra) ->
377 | {ok, State}.
378 |
379 |
380 | handle_call({snapshot_range, FoldWorkerPID, Range}, _From, State=#state{ top=TopLevel, nursery=Nursery }) ->
381 | hanoidb_nursery:do_level_fold(Nursery, FoldWorkerPID, Range),
382 | Result = hanoidb_level:snapshot_range(TopLevel, FoldWorkerPID, Range),
383 | {reply, Result, State};
384 |
385 | handle_call({blocking_range, FoldWorkerPID, Range}, _From, State=#state{ top=TopLevel, nursery=Nursery }) ->
386 | hanoidb_nursery:do_level_fold(Nursery, FoldWorkerPID, Range),
387 | Result = hanoidb_level:blocking_range(TopLevel, FoldWorkerPID, Range),
388 | {reply, Result, State};
389 |
390 | handle_call({put, Key, Value, Expiry}, _From, State) when is_binary(Key), is_binary(Value) ->
391 | {ok, State2} = do_put(Key, Value, Expiry, State),
392 | {reply, ok, State2};
393 |
394 | handle_call({transact, TransactionSpec}, _From, State) ->
395 | {ok, State2} = do_transact(TransactionSpec, State),
396 | {reply, ok, State2};
397 |
398 | handle_call({delete, Key}, _From, State) when is_binary(Key) ->
399 | {ok, State2} = do_put(Key, ?TOMBSTONE, infinity, State),
400 | {reply, ok, State2};
401 |
402 | handle_call({get, Key}, From, State=#state{ top=Top, nursery=Nursery } ) when is_binary(Key) ->
403 | case hanoidb_nursery:lookup(Key, Nursery) of
404 | {value, ?TOMBSTONE} ->
405 | {reply, not_found, State};
406 | {value, Value} when is_binary(Value) ->
407 | {reply, {ok, Value}, State};
408 | none ->
409 | _ = hanoidb_level:lookup(Top, Key, fun(Reply) -> gen_server:reply(From, Reply) end),
410 | {noreply, State}
411 | end;
412 |
413 | handle_call(close, _From, State=#state{ nursery=undefined }) ->
414 | {stop, normal, ok, State};
415 |
416 | handle_call(close, _From, State=#state{ nursery=Nursery, top=Top, dir=Dir, max_level=MaxLevel, opt=Config }) ->
417 | try
418 | ok = hanoidb_nursery:finish(Nursery, Top),
419 | MinLevel = hanoidb_level:level(Top),
420 | {ok, Nursery2} = hanoidb_nursery:new(Dir, MinLevel, MaxLevel, Config),
421 | ok = hanoidb_level:close(Top),
422 | {stop, normal, ok, State#state{ nursery=Nursery2 }}
423 | catch
424 | E:R ->
425 | error_logger:info_msg("exception from close ~p:~p~n", [E,R]),
426 | {stop, normal, ok, State}
427 | end;
428 |
429 | handle_call(destroy, _From, State=#state{top=Top, nursery=Nursery }) ->
430 | TopLevelNumber = hanoidb_level:level(Top),
431 | ok = hanoidb_nursery:destroy(Nursery),
432 | ok = hanoidb_level:destroy(Top),
433 | {stop, normal, ok, State#state{ top=undefined, nursery=undefined, max_level=TopLevelNumber }}.
434 |
435 | -spec do_put(key(), value(), expiry(), #state{}) -> {ok, #state{}}.
436 | do_put(Key, Value, Expiry, State=#state{ nursery=Nursery, top=Top }) when Nursery =/= undefined ->
437 | {ok, Nursery2} = hanoidb_nursery:add(Key, Value, Expiry, Nursery, Top),
438 | {ok, State#state{nursery=Nursery2}}.
439 |
440 | do_transact([{put, Key, Value}], State) ->
441 | do_put(Key, Value, infinity, State);
442 | do_transact([{delete, Key}], State) ->
443 | do_put(Key, ?TOMBSTONE, infinity, State);
444 | do_transact([], State) ->
445 | {ok, State};
446 | do_transact(TransactionSpec, State=#state{ nursery=Nursery, top=Top }) ->
447 | {ok, Nursery2} = hanoidb_nursery:transact(TransactionSpec, Nursery, Top),
448 | {ok, State#state{ nursery=Nursery2 }}.
449 |
450 | start_app() ->
451 | ok = ensure_started(syntax_tools),
452 | ok = ensure_started(plain_fsm),
453 | ok = ensure_started(?MODULE).
454 |
455 | ensure_started(Application) ->
456 | case application:start(Application) of
457 | ok ->
458 | ok;
459 | {error, {already_started, _}} ->
460 | ok;
461 | {error, Reason} ->
462 | {error, Reason}
463 | end.
464 |
465 | get_opt(Key, Opts) ->
466 | get_opt(Key, Opts, undefined).
467 |
468 | get_opt(Key, Opts, Default) ->
469 | case proplists:get_value(Key, Opts) of
470 | undefined ->
471 | case application:get_env(?MODULE, Key) of
472 | {ok, Value} -> Value;
473 | undefined -> Default
474 | end;
475 | Value ->
476 | Value
477 | end.
478 |
--------------------------------------------------------------------------------
/src/hanoidb.hrl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 |
26 | %% smallest levels are 1024 entries
27 | -define(TOP_LEVEL, 10).
28 | -define(BTREE_SIZE(Level), (1 bsl (Level))).
29 | -define(FILE_FORMAT, <<"HAN2">>).
30 | -define(FIRST_BLOCK_POS, byte_size(?FILE_FORMAT)).
31 |
32 | -define(TOMBSTONE, 'deleted').
33 |
34 | -define(KEY_IN_FROM_RANGE(Key,Range),
35 | ((Range#key_range.from_inclusive andalso
36 | (Range#key_range.from_key =< Key))
37 | orelse
38 | (Range#key_range.from_key < Key))).
39 |
40 | -define(KEY_IN_TO_RANGE(Key,Range),
41 | ((Range#key_range.to_key == undefined)
42 | orelse
43 | ((Range#key_range.to_inclusive andalso
44 | (Key =< Range#key_range.to_key))
45 | orelse
46 | (Key < Range#key_range.to_key)))).
47 |
48 | -define(KEY_IN_RANGE(Key,Range),
49 | (?KEY_IN_FROM_RANGE(Key,Range) andalso ?KEY_IN_TO_RANGE(Key,Range))).
50 |
51 |
52 | -ifdef(pre18).
53 | -define(TIMESTAMP, now()).
54 | -else.
55 | -define(TIMESTAMP, erlang:timestamp()).
56 | -endif.
57 |
58 | -record(nursery, { log_file :: file:fd(),
59 | dir :: string(),
60 | cache :: gb_trees:tree(binary(), binary()),
61 | total_size=0 :: integer(),
62 | count=0 :: integer(),
63 | last_sync=?TIMESTAMP :: erlang:timestamp(),
64 | min_level :: integer(),
65 | max_level :: integer(),
66 | config=[] :: [{atom(), term()}],
67 | step=0 :: integer(),
68 | merge_done=0 :: integer()}).
69 |
70 | -type kventry() :: { key(), expvalue() } | [ kventry() ].
71 | -type key() :: binary().
72 | -type txspec() :: { delete, key() } | { put, key(), value() }.
73 | -type value() :: ?TOMBSTONE | binary().
74 | -type expiry() :: infinity | integer().
75 | -type filepos() :: { non_neg_integer(), non_neg_integer() }.
76 | -type expvalue() :: { value(), expiry() }
77 | | value()
78 | | filepos().
79 |
80 | -ifdef(USE_EBLOOM).
81 | -define(HANOI_BLOOM_TYPE, ebloom).
82 | -else.
83 | -define(HANOI_BLOOM_TYPE, sbloom).
84 | -endif.
85 |
86 | -define(BLOOM_NEW(Size), hanoidb_util:bloom_new(Size, ?HANOI_BLOOM_TYPE)).
87 | -define(BLOOM_TO_BIN(Bloom), hanoidb_util:bloom_to_bin(Bloom)).
88 | -define(BIN_TO_BLOOM(Bin, Fmt), hanoidb_util:bin_to_bloom(Bin, Fmt)).
89 | -define(BLOOM_INSERT(Bloom, Key), hanoidb_util:bloom_insert(Bloom, Key)).
90 | -define(BLOOM_CONTAINS(Bloom, Key), hanoidb_util:bloom_contains(Bloom, Key)).
91 |
92 | %% tags used in the on-disk representation
93 | -define(TAG_KV_DATA, 16#80).
94 | -define(TAG_DELETED, 16#81).
95 | -define(TAG_POSLEN32, 16#82).
96 | -define(TAG_TRANSACT, 16#83).
97 | -define(TAG_KV_DATA2, 16#84).
98 | -define(TAG_DELETED2, 16#85).
99 | -define(TAG_END, 16#FF).
100 |
101 |
102 |
--------------------------------------------------------------------------------
/src/hanoidb_app.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_app).
26 | -author('Kresten Krab Thorup ').
27 |
28 | -behaviour(application).
29 |
30 | %% Application callbacks
31 | -export([start/2, stop/1]).
32 |
33 | %% ===================================================================
34 | %% Application callbacks
35 | %% ===================================================================
36 |
37 | start(_StartType, _StartArgs) ->
38 | hanoidb_sup:start_link().
39 |
40 | stop(_State) ->
41 | ok.
42 |
--------------------------------------------------------------------------------
/src/hanoidb_bloom.erl:
--------------------------------------------------------------------------------
1 | % The contents of this file are subject to the Erlang Public License, Version
2 | %% 1.1, (the "License"); you may not use this file except in compliance with
3 | %% the License. You should have received a copy of the Erlang Public License
4 | %% along with this software. If not, it can be retrieved via the world wide web
5 | %% at http://www.erlang.org/.
6 | %%
7 | %% Software distributed under the License is distributed on an "AS IS" basis,
8 | %% WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for
9 | %% the specific language governing rights and limitations under the License.
10 |
11 | %% Based on: Scalable Bloom Filters
12 | %% Paulo Sérgio Almeida, Carlos Baquero, Nuno Preguiça, David Hutchison
13 | %% Information Processing Letters
14 | %% Volume 101, Issue 6, 31 March 2007, Pages 255-261
15 | %%
16 | %% Provides scalable bloom filters that can grow indefinitely while ensuring a
17 | %% desired maximum false positive probability. Also provides standard
18 | %% partitioned bloom filters with a maximum capacity. Bit arrays are
19 | %% dimensioned as a power of 2 to enable reusing hash values across filters
20 | %% through bit operations. Double hashing is used (no need for enhanced double
21 | %% hashing for partitioned bloom filters).
22 |
23 | %% Modified slightly by Justin Sheehy to make it a single file (incorporated
24 | %% the array-based bitarray internally).
25 | -module(hanoidb_bloom).
26 | -author("Paulo Sergio Almeida ").
27 |
28 | -export([sbf/1, sbf/2, sbf/3, sbf/4,
29 | bloom/1, bloom/2,
30 | member/2, add/2,
31 | size/1, capacity/1,
32 | encode/1, decode/1]).
33 | -import(math, [log/1, pow/2]).
34 |
35 | -ifdef(TEST).
36 | -ifdef(EQC).
37 | -include_lib("eqc/include/eqc.hrl").
38 | -endif.
39 | -include_lib("eunit/include/eunit.hrl").
40 | -endif.
41 |
42 | -define(W, 27).
43 |
44 | -ifdef(pre18).
45 | -type bitmask() :: array() | any().
46 | -else.
47 | -type bitmask() :: arrays:array() | any().
48 | -endif.
49 |
50 | -record(bloom, {
51 | e :: float(), % error probability
52 | n :: non_neg_integer(), % maximum number of elements
53 | mb :: non_neg_integer(), % 2^mb = m, the size of each slice (bitvector)
54 | size :: non_neg_integer(), % number of elements
55 | a :: [bitmask()] % list of bitvectors
56 | }).
57 |
58 | -record(sbf, {
59 | e :: float(), % error probability
60 | r :: float(), % error probability ratio
61 | s :: non_neg_integer(), % log 2 of growth ratio
62 | size :: non_neg_integer(), % number of elements
63 | b :: [#bloom{}] % list of plain bloom filters
64 | }).
65 |
66 | %% Constructors for (fixed capacity) bloom filters
67 | %%
68 | %% N - capacity
69 | %% E - error probability
70 | bloom(N) -> bloom(N, 0.001).
71 | bloom(N, E) when is_number(N), N > 0,
72 | is_float(E), E > 0, E < 1,
73 | N >= 4/E -> % rule of thumb; due to double hashing
74 | bloom(size, N, E);
75 | bloom(N, E) when is_number(N), N >= 0,
76 | is_float(E), E > 0, E < 1 ->
77 | bloom(bits, 32, E).
78 |
79 | bloom(Mode, N, E) ->
80 | K = case Mode of
81 | size -> 1 + trunc(log2(1/E));
82 | bits -> 1
83 | end,
84 | P = pow(E, 1 / K),
85 |
86 | Mb =
87 | case Mode of
88 | size ->
89 | 1 + trunc(-log2(1 - pow(1 - P, 1 / N)));
90 | bits ->
91 | N
92 | end,
93 | M = 1 bsl Mb,
94 | D = trunc(log(1-P) / log(1-1/M)),
95 | #bloom{e=E, n=D, mb=Mb, size = 0,
96 | a = [bitmask_new(Mb) || _ <- lists:seq(1, K)]}.
97 |
98 | log2(X) -> log(X) / log(2).
99 |
100 | %% Constructors for scalable bloom filters
101 | %%
102 | %% N - initial capacity before expanding
103 | %% E - error probability
104 | %% S - growth ratio when full (log 2) can be 1, 2 or 3
105 | %% R - tightening ratio of error probability
106 | sbf(N) -> sbf(N, 0.001).
107 | sbf(N, E) -> sbf(N, E, 1).
108 | sbf(N, E, 1) -> sbf(N, E, 1, 0.85);
109 | sbf(N, E, 2) -> sbf(N, E, 2, 0.75);
110 | sbf(N, E, 3) -> sbf(N, E, 3, 0.65).
111 | sbf(N, E, S, R) when is_number(N), N > 0,
112 | is_float(E), E > 0, E < 1,
113 | is_integer(S), S > 0, S < 4,
114 | is_float(R), R > 0, R < 1,
115 | N >= 4/(E*(1-R)) -> % rule of thumb; due to double hashing
116 | #sbf{e=E, s=S, r=R, size=0, b=[bloom(N, E*(1-R))]}.
117 |
118 | %% Returns number of elements
119 | %%
120 | size(#bloom{size=Size}) -> Size;
121 | size(#sbf{size=Size}) -> Size.
122 |
123 | %% Returns capacity
124 | %%
125 | capacity(#bloom{n=N}) -> N;
126 | capacity(#sbf{}) -> infinity.
127 |
128 | %% Test for membership
129 | %%
130 | member(Elem, #bloom{mb=Mb}=B) ->
131 | Hashes = make_hashes(Mb, Elem),
132 | hash_member(Hashes, B);
133 | member(Elem, #sbf{b=[H|_]}=Sbf) ->
134 | Hashes = make_hashes(H#bloom.mb, Elem),
135 | hash_member(Hashes, Sbf).
136 |
137 | hash_member(Hashes, #bloom{mb=Mb, a=A}) ->
138 | Mask = 1 bsl Mb -1,
139 | {I1, I0} = make_indexes(Mask, Hashes),
140 | all_set(Mask, I1, I0, A);
141 | hash_member(Hashes, #sbf{b=B}) ->
142 | lists:any(fun(X) -> hash_member(Hashes, X) end, B).
143 |
144 | make_hashes(Mb, E) when Mb =< 16 ->
145 | erlang:phash2({E}, 1 bsl 32);
146 | make_hashes(Mb, E) when Mb =< 32 ->
147 | {erlang:phash2({E}, 1 bsl 32), erlang:phash2([E], 1 bsl 32)}.
148 |
149 | make_indexes(Mask, {H0, H1}) when Mask > 1 bsl 16 -> masked_pair(Mask, H0, H1);
150 | make_indexes(Mask, {H0, _}) -> make_indexes(Mask, H0);
151 | make_indexes(Mask, H0) -> masked_pair(Mask, H0 bsr 16, H0).
152 |
153 | masked_pair(Mask, X, Y) -> {X band Mask, Y band Mask}.
154 |
155 | all_set(_Mask, _I1, _I, []) -> true;
156 | all_set(Mask, I1, I, [H|T]) ->
157 | bitmask_get(I, H) andalso all_set(Mask, I1, (I+I1) band Mask, T).
158 |
159 | %% Adds element to set
160 | %%
161 | add(Elem, #bloom{mb=Mb} = B) ->
162 | Hashes = make_hashes(Mb, Elem),
163 | hash_add(Hashes, B);
164 | add(Elem, #sbf{size=Size, r=R, s=S, b=[H|T]=Bs}=Sbf) ->
165 | #bloom{mb=Mb, e=E, n=N, size=HSize} = H,
166 | Hashes = make_hashes(Mb, Elem),
167 | case hash_member(Hashes, Sbf) of
168 | true -> Sbf;
169 | false ->
170 | case HSize < N of
171 | true -> Sbf#sbf{size=Size+1, b=[hash_add(Hashes, H)|T]};
172 | false ->
173 | B = add(Elem, bloom(bits, Mb + S, E * R)),
174 | Sbf#sbf{size=Size+1, b=[B|Bs]}
175 | end
176 | end.
177 |
178 | hash_add(Hashes, #bloom{mb=Mb, a=A, size=Size} = B) ->
179 | Mask = 1 bsl Mb -1,
180 | {I1, I0} = make_indexes(Mask, Hashes),
181 | B#bloom{size=Size+1, a=set_bits(Mask, I1, I0, A, [])}.
182 |
183 | set_bits(_Mask, _I1, _I, [], Acc) -> lists:reverse(Acc);
184 | set_bits(Mask, I1, I, [H|T], Acc) ->
185 | set_bits(Mask, I1, (I+I1) band Mask, T, [bitmask_set(I, H) | Acc]).
186 |
187 |
188 | %%%========== Dispatch to appropriate representation:
189 | bitmask_new(LogN) ->
190 | if LogN >= 20 -> % Use sparse representation.
191 | hanoidb_sparse_bitmap:new(LogN);
192 | true -> % Use dense representation.
193 | hanoidb_dense_bitmap:new(1 bsl LogN)
194 | end.
195 |
196 | bitmask_set(I, BM) ->
197 | case element(1,BM) of
198 | array -> bitarray_set(I, as_array(BM));
199 | sparse_bitmap -> hanoidb_sparse_bitmap:set(I, BM);
200 | dense_bitmap_ets -> hanoidb_dense_bitmap:set(I, BM);
201 | dense_bitmap ->
202 | %% Surprise - we need to mutate a built representation:
203 | hanoidb_dense_bitmap:set(I, hanoidb_dense_bitmap:unbuild(BM))
204 | end.
205 |
206 | %%% Convert to external form.
207 | bitmask_build(BM) ->
208 | case element(1,BM) of
209 | array -> BM;
210 | sparse_bitmap -> BM;
211 | dense_bitmap -> BM;
212 | dense_bitmap_ets -> hanoidb_dense_bitmap:build(BM)
213 | end.
214 |
215 | bitmask_get(I, BM) ->
216 | case element(1,BM) of
217 | array -> bitarray_get(I, as_array(BM));
218 | sparse_bitmap -> hanoidb_sparse_bitmap:member(I, BM);
219 | dense_bitmap_ets -> hanoidb_dense_bitmap:member(I, BM);
220 | dense_bitmap -> hanoidb_dense_bitmap:member(I, BM)
221 | end.
222 |
223 | -ifdef(pre18).
224 | -spec as_array(bitmask()) -> array().
225 | -else.
226 | -spec as_array(bitmask()) -> arrays:array().
227 | -endif.
228 | as_array(BM) ->
229 | case array:is_array(BM) of
230 | true -> BM
231 | end.
232 |
233 | %%%========== Bitarray representation - suitable for sparse arrays ==========
234 | bitarray_new(N) -> array:new((N-1) div ?W + 1, {default, 0}).
235 |
236 | -ifdef(pre18).
237 | -spec bitarray_set( non_neg_integer(), array() ) -> array().
238 | -else.
239 | -spec bitarray_set( non_neg_integer(), arrays:array() ) -> arrays:array().
240 | -endif.
241 |
242 | bitarray_set(I, A1) ->
243 | A = as_array(A1),
244 | AI = I div ?W,
245 | V = array:get(AI, A),
246 | V1 = V bor (1 bsl (I rem ?W)),
247 | if V =:= V1 -> A; % The bit is already set
248 | true -> array:set(AI, V1, A)
249 | end.
250 |
251 | -ifdef(pre18).
252 | -spec bitarray_get( non_neg_integer(), array() ) -> boolean().
253 | -else.
254 | -spec bitarray_get( non_neg_integer(), arrays:array() ) -> boolean().
255 | -endif.
256 | bitarray_get(I, A) ->
257 | AI = I div ?W,
258 | V = array:get(AI, A),
259 | (V band (1 bsl (I rem ?W))) =/= 0.
260 |
261 | %%%^^^^^^^^^^ Bitarray representation - suitable for sparse arrays ^^^^^^^^^^
262 |
263 | encode(Bloom) ->
264 | zlib:gzip(term_to_binary(bloom_build(Bloom))).
265 |
266 | decode(Bin) ->
267 | binary_to_term(zlib:gunzip(Bin)).
268 |
269 | %%% Convert to external form.
270 | bloom_build(Bloom=#bloom{a=Bitmasks}) ->
271 | Bloom#bloom{a=[bitmask_build(X) || X <- Bitmasks]};
272 | bloom_build(Sbf=#sbf{b=Blooms}) ->
273 | Sbf#sbf{b=[bloom_build(X) || X <- Blooms]}.
274 |
275 | %% UNIT TESTS
276 |
277 | -ifdef(TEST).
278 | -ifdef(EQC).
279 |
280 | prop_bloom_test_() ->
281 | {timeout, 60, fun() -> ?assert(eqc:quickcheck(prop_bloom())) end}.
282 |
283 | g_keys() ->
284 | non_empty(list(non_empty(binary()))).
285 |
286 | prop_bloom() ->
287 | ?FORALL(Keys, g_keys(),
288 | begin
289 | Bloom = ?MODULE:bloom(Keys),
290 | F = fun(X) -> member(X, Bloom) end,
291 | lists:all(F, Keys)
292 | end).
293 |
294 | -endif.
295 | -endif.
296 |
--------------------------------------------------------------------------------
/src/hanoidb_dense_bitmap.erl:
--------------------------------------------------------------------------------
1 | -module(hanoidb_dense_bitmap).
2 |
3 | -export([new/1, set/2, build/1, unbuild/1, member/2]).
4 | -define(BITS_PER_CELL, 32).
5 |
6 | -define(REPR_NAME, dense_bitmap).
7 |
8 | new(N) ->
9 | Tab = ets:new(dense_bitmap, [private, set]),
10 | Width = 1 + (N-1) div ?BITS_PER_CELL,
11 | Value = erlang:make_tuple(Width+1, 0, [{1,?REPR_NAME}]),
12 | ets:insert(Tab, Value),
13 | {dense_bitmap_ets, N, Width, Tab}.
14 |
15 | %% Set a bit.
16 | set(I, {dense_bitmap_ets, _,_, Tab}=DBM) ->
17 | Cell = 2 + I div ?BITS_PER_CELL,
18 | BitInCell = I rem ?BITS_PER_CELL,
19 | Old = ets:lookup_element(Tab, ?REPR_NAME, Cell),
20 | New = Old bor (1 bsl BitInCell),
21 | if New =:= Old ->
22 | ok; % The bit is already set
23 | true ->
24 | ets:update_element(Tab, ?REPR_NAME, {Cell,New})
25 | end,
26 | DBM.
27 |
28 | build({dense_bitmap_ets, _, _, Tab}) ->
29 | [Row] = ets:lookup(Tab, ?REPR_NAME),
30 | ets:delete(Tab),
31 | Row.
32 |
33 | unbuild(Row) when element(1,Row)==?REPR_NAME ->
34 | Tab = ets:new(dense_bitmap, [private, set]),
35 | ets:insert(Tab, Row),
36 | {dense_bitmap_ets, undefined, undefined, Tab}.
37 |
38 | member(I, Row) when element(1,Row)==?REPR_NAME ->
39 | Cell = 2 + I div ?BITS_PER_CELL,
40 | BitInCell = I rem ?BITS_PER_CELL,
41 | CellValue = element(Cell, Row),
42 | CellValue band (1 bsl BitInCell) =/= 0;
43 | member(I, {dense_bitmap_ets, _,_, Tab}) ->
44 | Cell = 2 + I div ?BITS_PER_CELL,
45 | BitInCell = I rem ?BITS_PER_CELL,
46 | CellValue = ets:lookup_element(Tab, ?REPR_NAME, Cell),
47 | CellValue band (1 bsl BitInCell) =/= 0.
48 |
--------------------------------------------------------------------------------
/src/hanoidb_fold_worker.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_fold_worker).
26 | -author('Kresten Krab Thorup ').
27 |
28 | -ifdef(DEBUG).
29 | -define(log(Fmt,Args),io:format(user,Fmt,Args)).
30 | -else.
31 | -define(log(Fmt,Args),ok).
32 | -endif.
33 |
34 | %%
35 | %% This worker is used to merge fold results from individual
36 | %% levels. First, it receives a message
37 | %%
38 | %% {initialize, [LevelWorker, ...]}
39 | %%
40 | %% And then from each LevelWorker, a sequence of
41 | %%
42 | %% {level_result, LevelWorker, Key1, Value}
43 | %% {level_result, LevelWorker, Key2, Value}
44 | %% {level_result, LevelWorker, Key3, Value}
45 | %% {level_result, LevelWorker, Key4, Value}
46 | %% {level_results, LevelWorker, [{Key,Value}...]} %% alternatively
47 | %% ...
48 | %% {level_done, LevelWorker}
49 | %%
50 | %% The order of level workers in the initialize messge is top-down,
51 | %% which is used to select between same-key messages from different
52 | %% levels.
53 | %%
54 | %% This fold_worker process will then send to a designated SendTo target
55 | %% a similar sequence of messages
56 | %%
57 | %% {fold_result, self(), Key1, Value}
58 | %% {fold_result, self(), Key2, Value}
59 | %% {fold_result, self(), Key3, Value}
60 | %% ...
61 | %% {fold_done, self()}.
62 | %%
63 |
64 | -export([start/1]).
65 | -behavior(plain_fsm).
66 | -export([data_vsn/0, code_change/3]).
67 |
68 | -include("hanoidb.hrl").
69 | -include("plain_rpc.hrl").
70 |
71 | -record(state, {sendto :: pid(), sendto_ref :: reference()}).
72 |
73 | start(SendTo) ->
74 | F = fun() ->
75 | ?log("fold_worker started ~p~n", [self()]),
76 | process_flag(trap_exit, true),
77 | MRef = erlang:monitor(process, SendTo),
78 | try
79 | initialize(#state{sendto=SendTo, sendto_ref=MRef}, []),
80 | ?log("fold_worker done ~p~n", [self()])
81 | catch
82 | Class:Ex ->
83 | ?log("fold_worker exception ~p:~p ~p~n", [Class, Ex, erlang:get_stacktrace()]),
84 | error_logger:error_msg("Unexpected: ~p:~p ~p~n", [Class, Ex, erlang:get_stacktrace()]),
85 | exit({bad, Class, Ex, erlang:get_stacktrace()})
86 | end
87 | end,
88 | PID = plain_fsm:spawn(?MODULE, F),
89 | {ok, PID}.
90 |
91 | initialize(State, PrefixFolders) ->
92 | Parent = plain_fsm:info(parent),
93 | receive
94 | {prefix, [_]=Folders} ->
95 | initialize(State, Folders);
96 |
97 | {initialize, Folders} ->
98 | Queues = [ {PID,queue:new()} || PID <- (PrefixFolders ++ Folders) ],
99 | Initial = [ {PID,undefined} || PID <- (PrefixFolders ++ Folders) ],
100 | fill(State, Initial, Queues, PrefixFolders ++ Folders);
101 |
102 | %% gen_fsm handling
103 | {system, From, Req} ->
104 | plain_fsm:handle_system_msg(
105 | Req, From, State, fun(S1) -> initialize(S1, PrefixFolders) end);
106 |
107 | {'DOWN', MRef, _, _, _} when MRef =:= State#state.sendto_ref ->
108 | ok;
109 |
110 | {'EXIT', Parent, Reason} ->
111 | plain_fsm:parent_EXIT(Reason, State)
112 | end.
113 |
114 | fill(State, Values, Queues, []) ->
115 | emit_next(State, Values, Queues);
116 |
117 | fill(State, Values, Queues, [PID|Rest]=PIDs) ->
118 | % io:format(user, "v=~P, q=~P, pids=~p~n", [Values, 10, Queues, 10, PIDs]),
119 | case lists:keyfind(PID, 1, Queues) of
120 | {PID, Q} ->
121 | case queue:out(Q) of
122 | {empty, Q} ->
123 | fill_from_inbox(State, Values, Queues, [PID], PIDs);
124 |
125 | {{value, Msg}, Q2} ->
126 | Queues2 = lists:keyreplace(PID, 1, Queues, {PID, Q2}),
127 |
128 | case Msg of
129 | done ->
130 | fill(State, lists:keydelete(PID, 1, Values), Queues2, Rest);
131 | {_Key, _Value}=KV ->
132 | fill(State, lists:keyreplace(PID, 1, Values, {PID, KV}), Queues2, Rest)
133 | end
134 | end
135 | end.
136 |
137 | fill_from_inbox(State, Values, Queues, [], PIDs) ->
138 | fill(State, Values, Queues, PIDs);
139 |
140 | fill_from_inbox(State, Values, Queues, [PID|_]=PIDs, SavePIDs) ->
141 | ?log("waiting for ~p~n", [PIDs]),
142 | receive
143 | {level_done, PID} ->
144 | ?log("got {done, ~p}~n", [PID]),
145 | Queues2 = enter(PID, done, Queues),
146 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs);
147 |
148 | {level_limit, PID, Key} ->
149 | ?log("got {limit, ~p}~n", [PID]),
150 | Queues2 = enter(PID, {Key, limit}, Queues),
151 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs);
152 |
153 | {level_result, PID, Key, Value} ->
154 | ?log("got {result, ~p}~n", [PID]),
155 | Queues2 = enter(PID, {Key, Value}, Queues),
156 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs);
157 |
158 | ?CALL(From,{level_results, PID, KVs}) ->
159 | ?log("got {results, ~p}~n", [PID]),
160 | plain_rpc:send_reply(From,ok),
161 | Queues2 = enter_many(PID, KVs, Queues),
162 | fill_from_inbox(State, Values, Queues2, lists:delete(PID,PIDs), SavePIDs);
163 |
164 | %% gen_fsm handling
165 | {system, From, Req} ->
166 | plain_fsm:handle_system_msg(
167 | Req, From, State, fun(S1) -> fill_from_inbox(S1, Values, Queues, PIDs, SavePIDs) end);
168 |
169 | {'DOWN', MRef, _, _, _} when MRef =:= State#state.sendto_ref ->
170 | ok;
171 |
172 | {'EXIT', Parent, Reason}=Msg ->
173 | case plain_fsm:info(parent) == Parent of
174 | true ->
175 | plain_fsm:parent_EXIT(Reason, State);
176 | false ->
177 | error_logger:info_msg("unhandled EXIT message ~p~n", [Msg]),
178 | fill_from_inbox(State, Values, Queues, PIDs, SavePIDs)
179 | end
180 |
181 | end.
182 |
183 | enter(PID, Msg, Queues) ->
184 | {PID, Q} = lists:keyfind(PID, 1, Queues),
185 | Q2 = queue:in(Msg, Q),
186 | lists:keyreplace(PID, 1, Queues, {PID, Q2}).
187 |
188 | enter_many(PID, Msgs, Queues) ->
189 | {PID, Q} = lists:keyfind(PID, 1, Queues),
190 | Q2 = lists:foldl(fun queue:in/2, Q, Msgs),
191 | lists:keyreplace(PID, 1, Queues, {PID, Q2}).
192 |
193 | emit_next(State, [], _Queues) ->
194 | ?log( "emit_next ~p~n", [[]]),
195 | Msg = {fold_done, self()},
196 | Target = State#state.sendto,
197 | ?log( "~p ! ~p~n", [Target, Msg]),
198 | _ = plain_rpc:cast(Target, Msg),
199 | end_of_fold(State);
200 |
201 | emit_next(State, [{FirstPID,FirstKV}|Rest]=Values, Queues) ->
202 | ?log( "emit_next ~p~n", [Values]),
203 | case
204 | lists:foldl(fun({P,{K1,_}=KV}, {{K2,_},_}) when K1 < K2 ->
205 | {KV,[P]};
206 | ({P,{K,_}}, {{K,_}=KV,List}) ->
207 | {KV, [P|List]};
208 | (_, Found) ->
209 | Found
210 | end,
211 | {FirstKV,[FirstPID]},
212 | Rest)
213 | of
214 | {{_, ?TOMBSTONE}, FillFrom} ->
215 | fill(State, Values, Queues, FillFrom);
216 | {{Key, limit}, _} ->
217 | ?log( "~p ! ~p~n", [State#state.sendto, {fold_limit, self(), Key}]),
218 | _ = plain_rpc:cast(State#state.sendto, {fold_limit, self(), Key}),
219 | end_of_fold(State);
220 | {{Key, Value}, FillFrom} ->
221 | ?log( "~p ! ~p~n", [State#state.sendto, {fold_result, self(), Key, '...'}]),
222 | plain_rpc:call(State#state.sendto, {fold_result, self(), Key, Value}),
223 | fill(State, Values, Queues, FillFrom)
224 | end.
225 |
226 | end_of_fold(_State) ->
227 | ok.
228 |
229 | data_vsn() ->
230 | 5.
231 |
232 | code_change(_OldVsn, _State, _Extra) ->
233 | {ok, {#state{}, data_vsn()}}.
234 |
235 |
236 |
--------------------------------------------------------------------------------
/src/hanoidb_merger.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_merger).
26 | -author('Kresten Krab Thorup ').
27 | -author('Gregory Burd ').
28 |
29 | %% @doc Merging two Indexes
30 |
31 | -export([start/6, merge/6]).
32 |
33 | -include("hanoidb.hrl").
34 | -include("include/plain_rpc.hrl").
35 |
36 | %% A merger which is inactive for this long will sleep which means that it will
37 | %% close open files, and compress the current bloom filter.
38 | -define(HIBERNATE_TIMEOUT, 5000).
39 |
40 | %% Most likely, there will be plenty of I/O being generated by concurrent
41 | %% merges, so we default to running the entire merge in one process.
42 | -define(LOCAL_WRITER, true).
43 |
44 |
45 | -spec start(string(), string(), string(), integer(), boolean(), list()) -> pid().
46 | start(A,B,X, Size, IsLastLevel, Options) ->
47 | Owner = self(),
48 | plain_fsm:spawn_link(?MODULE, fun() ->
49 | try
50 | {ok, OutCount} = hanoidb_merger:merge(A, B, X,
51 | Size,
52 | IsLastLevel,
53 | Options),
54 |
55 | Owner ! ?CAST(self(),{merge_done, OutCount, X})
56 | catch
57 | C:E ->
58 | %% this semi-bogus code makes sure we always get a stack trace if merging fails
59 | error_logger:error_msg("~p: merge failed ~p:~p ~p -> ~s~n",
60 | [self(), C,E,erlang:get_stacktrace(), X]),
61 | erlang:raise(C,E,erlang:get_stacktrace())
62 | end
63 | end).
64 |
65 | -spec merge(string(), string(), string(), integer(), boolean(), list()) -> {ok, integer()}.
66 | merge(A,B,C, Size, IsLastLevel, Options) ->
67 | {ok, IXA} = hanoidb_reader:open(A, [sequential|Options]),
68 | {ok, IXB} = hanoidb_reader:open(B, [sequential|Options]),
69 | {ok, Out} = hanoidb_writer:init([C, [{size, Size} | Options]]),
70 | AKVs =
71 | case hanoidb_reader:first_node(IXA) of
72 | {kvlist, AKV} -> AKV;
73 | none -> []
74 | end,
75 | BKVs =
76 | case hanoidb_reader:first_node(IXB) of
77 | {kvlist, BKV} ->BKV;
78 | none -> []
79 | end,
80 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {0, none}).
81 |
82 | terminate(Out) ->
83 | {ok, Count, Out1} = hanoidb_writer:handle_call(count, self(), Out),
84 | {stop, normal, ok, _Out2} = hanoidb_writer:handle_call(close, self(), Out1),
85 | {ok, Count}.
86 |
87 | step(S) ->
88 | step(S, 1).
89 |
90 | step({N, From}, Steps) ->
91 | {N-Steps, From}.
92 |
93 | hibernate_scan(Keep) ->
94 | erlang:garbage_collect(),
95 | receive
96 | {step, From, HowMany} ->
97 | {IXA, IXB, Out, IsLastLevel, AKVs, BKVs, N} = erlang:binary_to_term(Keep),
98 | scan(hanoidb_reader:deserialize(IXA),
99 | hanoidb_reader:deserialize(IXB),
100 | hanoidb_writer:deserialize(Out),
101 | IsLastLevel, AKVs, BKVs, {N+HowMany, From});
102 |
103 | %% gen_fsm handling
104 | {system, From, Req} ->
105 | plain_fsm:handle_system_msg(
106 | Req, From, Keep, fun hibernate_scan/1);
107 |
108 | {'EXIT', Parent, Reason} ->
109 | case plain_fsm:info(parent) of
110 | Parent ->
111 | plain_fsm:parent_EXIT(Reason, Keep)
112 | end
113 |
114 | end.
115 |
116 |
117 | hibernate_scan_only(Keep) ->
118 | erlang:garbage_collect(),
119 | receive
120 | {step, From, HowMany} ->
121 | {IX, OutBin, IsLastLevel, KVs, N} = erlang:binary_to_term(Keep),
122 | scan_only(hanoidb_reader:deserialize(IX),
123 | hanoidb_writer:deserialize(OutBin),
124 | IsLastLevel, KVs, {N+HowMany, From});
125 |
126 | %% gen_fsm handling
127 | {system, From, Req} ->
128 | plain_fsm:handle_system_msg(
129 | Req, From, Keep, fun hibernate_scan_only/1);
130 |
131 | {'EXIT', Parent, Reason} ->
132 | case plain_fsm:info(parent) of
133 | Parent ->
134 | plain_fsm:parent_EXIT(Reason, Keep)
135 | end
136 | end.
137 |
138 |
139 | receive_scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}) ->
140 |
141 | receive
142 | {step, From, HowMany} ->
143 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N+HowMany, From});
144 |
145 | %% gen_fsm handling
146 | {system, From, Req} ->
147 | plain_fsm:handle_system_msg(
148 | Req, From, {IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}},
149 | fun({IXA2, IXB2, Out2, IsLastLevel2, AKVs2, BKVs2, {N2, FromPID2}}) ->
150 | receive_scan(IXA2, IXB2, Out2, IsLastLevel2, AKVs2, BKVs2, {N2, FromPID2})
151 | end);
152 |
153 | {'EXIT', Parent, Reason} ->
154 | case plain_fsm:info(parent) of
155 | Parent ->
156 | plain_fsm:parent_EXIT(Reason, {IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}})
157 | end
158 |
159 | after ?HIBERNATE_TIMEOUT ->
160 | Args = {hanoidb_reader:serialize(IXA),
161 | hanoidb_reader:serialize(IXB),
162 | hanoidb_writer:serialize(Out), IsLastLevel, AKVs, BKVs, N},
163 | Keep = erlang:term_to_binary(Args, [compressed]),
164 | hibernate_scan(Keep)
165 | end.
166 |
167 |
168 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID}) when N < 1, AKVs =/= [], BKVs =/= [] ->
169 | case FromPID of
170 | none ->
171 | ok;
172 | {PID, Ref} ->
173 | PID ! {Ref, step_done}
174 | end,
175 |
176 | receive_scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, {N, FromPID});
177 |
178 | scan(IXA, IXB, Out, IsLastLevel, [], BKVs, Step) ->
179 | case hanoidb_reader:next_node(IXA) of
180 | {kvlist, AKVs} ->
181 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, Step);
182 | end_of_data ->
183 | hanoidb_reader:close(IXA),
184 | scan_only(IXB, Out, IsLastLevel, BKVs, Step)
185 | end;
186 |
187 | scan(IXA, IXB, Out, IsLastLevel, AKVs, [], Step) ->
188 | case hanoidb_reader:next_node(IXB) of
189 | {kvlist, BKVs} ->
190 | scan(IXA, IXB, Out, IsLastLevel, AKVs, BKVs, Step);
191 | end_of_data ->
192 | hanoidb_reader:close(IXB),
193 | scan_only(IXA, Out, IsLastLevel, AKVs, Step)
194 | end;
195 |
196 | scan(IXA, IXB, Out, IsLastLevel, [{Key1,Value1}|AT]=_AKVs, [{Key2,_Value2}|_IX]=BKVs, Step)
197 | when Key1 < Key2 ->
198 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key1, Value1}, Out),
199 | scan(IXA, IXB, Out3, IsLastLevel, AT, BKVs, step(Step));
200 | scan(IXA, IXB, Out, IsLastLevel, [{Key1,_Value1}|_AT]=AKVs, [{Key2,Value2}|IX]=_BKVs, Step)
201 | when Key1 > Key2 ->
202 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key2, Value2}, Out),
203 | scan(IXA, IXB, Out3, IsLastLevel, AKVs, IX, step(Step));
204 | scan(IXA, IXB, Out, IsLastLevel, [{_Key1,_Value1}|AT]=_AKVs, [{Key2,Value2}|IX]=_BKVs, Step) ->
205 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key2, Value2}, Out),
206 | scan(IXA, IXB, Out3, IsLastLevel, AT, IX, step(Step, 2)).
207 |
208 |
209 | receive_scan_only(IX, Out, IsLastLevel, KVs, {N, FromPID}) ->
210 |
211 |
212 | receive
213 | {step, From, HowMany} ->
214 | scan_only(IX, Out, IsLastLevel, KVs, {N+HowMany, From});
215 |
216 | %% gen_fsm handling
217 | {system, From, Req} ->
218 | plain_fsm:handle_system_msg(
219 | Req, From, {IX, Out, IsLastLevel, KVs, {N, FromPID}},
220 | fun({IX2, Out2, IsLastLevel2, KVs2, {N2, FromPID2}}) ->
221 | receive_scan_only(IX2, Out2, IsLastLevel2, KVs2, {N2, FromPID2})
222 | end);
223 |
224 | {'EXIT', Parent, Reason} ->
225 | case plain_fsm:info(parent) of
226 | Parent ->
227 | plain_fsm:parent_EXIT(Reason, {IX, Out, IsLastLevel, KVs, {N, FromPID}})
228 | end
229 |
230 | after ?HIBERNATE_TIMEOUT ->
231 | Args = {hanoidb_reader:serialize(IX),
232 | hanoidb_writer:serialize(Out), IsLastLevel, KVs, N},
233 | Keep = erlang:term_to_binary(Args, [compressed]),
234 | hibernate_scan_only(Keep)
235 | end.
236 |
237 |
238 |
239 | scan_only(IX, Out, IsLastLevel, KVs, {N, FromPID}) when N < 1, KVs =/= [] ->
240 | case FromPID of
241 | none ->
242 | ok;
243 | {PID, Ref} ->
244 | PID ! {Ref, step_done}
245 | end,
246 |
247 | receive_scan_only(IX, Out, IsLastLevel, KVs, {N, FromPID});
248 |
249 | scan_only(IX, Out, IsLastLevel, [], {_, FromPID}=Step) ->
250 | case hanoidb_reader:next_node(IX) of
251 | {kvlist, KVs} ->
252 | scan_only(IX, Out, IsLastLevel, KVs, Step);
253 | end_of_data ->
254 | case FromPID of
255 | none ->
256 | ok;
257 | {PID, Ref} ->
258 | PID ! {Ref, step_done}
259 | end,
260 | hanoidb_reader:close(IX),
261 | terminate(Out)
262 | end;
263 |
264 | scan_only(IX, Out, true, [{_,?TOMBSTONE}|Rest], Step) ->
265 | scan_only(IX, Out, true, Rest, step(Step));
266 |
267 | scan_only(IX, Out, IsLastLevel, [{Key,Value}|Rest], Step) ->
268 | {noreply, Out3} = hanoidb_writer:handle_cast({add, Key, Value}, Out),
269 | scan_only(IX, Out3, IsLastLevel, Rest, step(Step)).
270 |
--------------------------------------------------------------------------------
/src/hanoidb_nursery.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_nursery).
26 | -author('Kresten Krab Thorup ').
27 |
28 | -export([new/4, recover/5, finish/2, lookup/2, add/4, add/5]).
29 | -export([do_level_fold/3, set_max_level/2, transact/3, destroy/1]).
30 |
31 | -include("include/hanoidb.hrl").
32 | -include("hanoidb.hrl").
33 | -include_lib("kernel/include/file.hrl").
34 |
35 | -spec new(string(), integer(), integer(), [_]) -> {ok, #nursery{}} | {error, term()}.
36 |
37 | -define(LOGFILENAME(Dir), filename:join(Dir, "nursery.log")).
38 |
39 | %% do incremental merge every this many inserts
40 | %% this value *must* be less than or equal to
41 | %% 2^TOP_LEVEL == ?BTREE_SIZE(?TOP_LEVEL)
42 | -define(INC_MERGE_STEP, ?BTREE_SIZE(MinLevel) div 2).
43 |
44 | new(Directory, MinLevel, MaxLevel, Config) ->
45 | hanoidb_util:ensure_expiry(Config),
46 |
47 | {ok, File} = file:open(?LOGFILENAME(Directory),
48 | [raw, exclusive, write, delayed_write, append]),
49 | {ok, #nursery{ log_file=File, dir=Directory, cache= gb_trees:empty(),
50 | min_level=MinLevel, max_level=MaxLevel, config=Config }}.
51 |
52 |
53 | recover(Directory, TopLevel, MinLevel, MaxLevel, Config)
54 | when MinLevel =< MaxLevel, is_integer(MinLevel), is_integer(MaxLevel) ->
55 | hanoidb_util:ensure_expiry(Config),
56 | case file:read_file_info(?LOGFILENAME(Directory)) of
57 | {ok, _} ->
58 | ok = do_recover(Directory, TopLevel, MinLevel, MaxLevel, Config),
59 | new(Directory, MinLevel, MaxLevel, Config);
60 | {error, enoent} ->
61 | new(Directory, MinLevel, MaxLevel, Config)
62 | end.
63 |
64 | do_recover(Directory, TopLevel, MinLevel, MaxLevel, Config) ->
65 | %% repair the log file; storing it in nursery2
66 | LogFileName = ?LOGFILENAME(Directory),
67 | {ok, Nursery} = read_nursery_from_log(Directory, MinLevel, MaxLevel, Config),
68 | ok = finish(Nursery, TopLevel),
69 | %% assert log file is gone
70 | {error, enoent} = file:read_file_info(LogFileName),
71 | ok.
72 |
73 | fill_cache({Key, Value}, Cache)
74 | when is_binary(Value); Value =:= ?TOMBSTONE ->
75 | gb_trees:enter(Key, Value, Cache);
76 | fill_cache({Key, {Value, _TStamp}=Entry}, Cache)
77 | when is_binary(Value); Value =:= ?TOMBSTONE ->
78 | gb_trees:enter(Key, Entry, Cache);
79 | fill_cache([], Cache) ->
80 | Cache;
81 | fill_cache(Transactions, Cache)
82 | when is_list(Transactions) ->
83 | lists:foldl(fun fill_cache/2, Cache, Transactions).
84 |
85 | read_nursery_from_log(Directory, MinLevel, MaxLevel, Config) ->
86 | {ok, LogBinary} = file:read_file(?LOGFILENAME(Directory)),
87 | Cache =
88 | case hanoidb_util:decode_crc_data(LogBinary, [], []) of
89 | {ok, KVs} ->
90 | fill_cache(KVs, gb_trees:empty());
91 | {partial, KVs, _ErrorData} ->
92 | error_logger:info_msg("ignoring undecypherable bytes in ~p~n", [?LOGFILENAME(Directory)]),
93 | fill_cache(KVs, gb_trees:empty())
94 | end,
95 | {ok, #nursery{ dir=Directory, cache=Cache, count=gb_trees:size(Cache), min_level=MinLevel, max_level=MaxLevel, config=Config }}.
96 |
97 | %% @doc Add a Key/Value to the nursery
98 | %% @end
99 | -spec do_add(#nursery{}, binary(), binary()|?TOMBSTONE, non_neg_integer() | infinity, pid()) -> {ok, #nursery{}} | {full, #nursery{}}.
100 | do_add(Nursery, Key, Value, infinity, Top) ->
101 | do_add(Nursery, Key, Value, 0, Top);
102 | do_add(Nursery=#nursery{log_file=File, cache=Cache, total_size=TotalSize, count=Count, config=Config}, Key, Value, KeyExpiryTime, Top) ->
103 | DatabaseExpiryTime = hanoidb:get_opt(expiry_secs, Config),
104 |
105 | {Data, Cache2} =
106 | if (KeyExpiryTime + DatabaseExpiryTime) == 0 ->
107 | %% Both the database expiry and this key's expiry are unset or set to 0
108 | %% (aka infinity) so never automatically expire the value.
109 | { hanoidb_util:crc_encapsulate_kv_entry(Key, Value),
110 | gb_trees:enter(Key, Value, Cache) };
111 | true ->
112 | Expiry =
113 | if DatabaseExpiryTime == 0 ->
114 | %% It was the database's setting that was 0 so expire this
115 | %% value after KeyExpiryTime seconds elapse.
116 | hanoidb_util:expiry_time(KeyExpiryTime);
117 | true ->
118 | if KeyExpiryTime == 0 ->
119 | hanoidb_util:expiry_time(DatabaseExpiryTime);
120 | true ->
121 | hanoidb_util:expiry_time(min(KeyExpiryTime, DatabaseExpiryTime))
122 | end
123 | end,
124 | { hanoidb_util:crc_encapsulate_kv_entry(Key, {Value, Expiry}),
125 | gb_trees:enter(Key, {Value, Expiry}, Cache) }
126 | end,
127 |
128 | ok = file:write(File, Data),
129 | Nursery1 = do_sync(File, Nursery),
130 | {ok, Nursery2} = do_inc_merge(Nursery1#nursery{ cache=Cache2,
131 | total_size=TotalSize + erlang:iolist_size(Data),
132 | count=Count + 1 }, 1, Top),
133 | case has_room(Nursery2, 1) of
134 | true ->
135 | {ok, Nursery2};
136 | false ->
137 | {full, Nursery2}
138 | end.
139 |
140 | do_sync(File, Nursery) ->
141 | LastSync =
142 | case application:get_env(hanoidb, sync_strategy) of
143 | {ok, sync} ->
144 | file:datasync(File),
145 | os:timestamp();
146 | {ok, {seconds, N}} ->
147 | MicrosSinceLastSync = timer:now_diff(os:timestamp(), Nursery#nursery.last_sync),
148 | if (MicrosSinceLastSync div 1000000) >= N ->
149 | file:datasync(File),
150 | os:timestamp();
151 | true ->
152 | Nursery#nursery.last_sync
153 | end;
154 | _ ->
155 | Nursery#nursery.last_sync
156 | end,
157 | Nursery#nursery{last_sync = LastSync}.
158 |
159 |
160 | lookup(Key, #nursery{cache=Cache}) ->
161 | case gb_trees:lookup(Key, Cache) of
162 | {value, {Value, TStamp}} ->
163 | case hanoidb_util:has_expired(TStamp) of
164 | true ->
165 | {value, ?TOMBSTONE};
166 | false ->
167 | {value, Value}
168 | end;
169 | Reply ->
170 | Reply
171 | end.
172 |
173 | %% @doc
174 | %% Finish this nursery (encode it to a btree, and delete the nursery file)
175 | %% @end
176 | -spec finish(Nursery::#nursery{}, TopLevel::pid()) -> ok.
177 | finish(#nursery{ dir=Dir, cache=Cache, log_file=LogFile, merge_done=DoneMerge,
178 | count=Count, config=Config, min_level=MinLevel }, TopLevel) ->
179 |
180 | hanoidb_util:ensure_expiry(Config),
181 |
182 | %% First, close the log file (if it is open)
183 | case LogFile of
184 | undefined -> ok;
185 | _ -> ok = file:close(LogFile)
186 | end,
187 |
188 | case Count of
189 | N when N > 0 ->
190 | %% next, flush cache to a new BTree
191 | BTreeFileName = filename:join(Dir, "nursery.data"),
192 | {ok, BT} = hanoidb_writer:open(BTreeFileName, [{size, ?BTREE_SIZE(MinLevel)},
193 | {compress, none} | Config]),
194 | try
195 | ok = gb_trees_ext:fold(fun(Key, Value, Acc) ->
196 | ok = hanoidb_writer:add(BT, Key, Value),
197 | Acc
198 | end, ok, Cache)
199 | after
200 | ok = hanoidb_writer:close(BT)
201 | end,
202 |
203 | %% Inject the B-Tree (blocking RPC)
204 | ok = hanoidb_level:inject(TopLevel, BTreeFileName),
205 |
206 | %% Issue some work if this is a top-level inject (blocks until previous such
207 | %% incremental merge is finished).
208 | if DoneMerge >= ?BTREE_SIZE(MinLevel) ->
209 | ok;
210 | true ->
211 | hanoidb_level:begin_incremental_merge(TopLevel, ?BTREE_SIZE(MinLevel) - DoneMerge)
212 | end;
213 | % {ok, _Nursery2} = do_inc_merge(Nursery, Count, TopLevel);
214 |
215 | _ ->
216 | ok
217 | end,
218 |
219 | %% then, delete the log file
220 | LogFileName = filename:join(Dir, "nursery.log"),
221 | file:delete(LogFileName),
222 | ok.
223 |
224 | destroy(#nursery{ dir=Dir, log_file=LogFile }) ->
225 | %% first, close the log file
226 | if LogFile /= undefined ->
227 | ok = file:close(LogFile);
228 | true ->
229 | ok
230 | end,
231 | %% then delete it
232 | LogFileName = filename:join(Dir, "nursery.log"),
233 | file:delete(LogFileName),
234 | ok.
235 |
236 | -spec add(key(), value(), #nursery{}, pid()) -> {ok, #nursery{}}.
237 | add(Key, Value, Nursery, Top) ->
238 | add(Key, Value, infinity, Nursery, Top).
239 |
240 | -spec add(key(), value(), expiry(), #nursery{}, pid()) -> {ok, #nursery{}}.
241 | add(Key, Value, Expiry, Nursery, Top) ->
242 | case do_add(Nursery, Key, Value, Expiry, Top) of
243 | {ok, Nursery0} ->
244 | {ok, Nursery0};
245 | {full, Nursery0} ->
246 | flush(Nursery0, Top)
247 | end.
248 |
249 | -spec flush(#nursery{}, pid()) -> {ok, #nursery{}}.
250 | flush(Nursery=#nursery{ dir=Dir, min_level=MinLevel, max_level=MaxLevel, config=Config }, Top) ->
251 | ok = finish(Nursery, Top),
252 | {error, enoent} = file:read_file_info(filename:join(Dir, "nursery.log")),
253 | hanoidb_nursery:new(Dir, MinLevel, MaxLevel, Config).
254 |
255 | has_room(#nursery{ count=Count, min_level=MinLevel }, N) ->
256 | (Count + N + 1) < ?BTREE_SIZE(MinLevel).
257 |
258 | ensure_space(Nursery, NeededRoom, Top) ->
259 | case has_room(Nursery, NeededRoom) of
260 | true ->
261 | Nursery;
262 | false ->
263 | {ok, Nursery1} = flush(Nursery, Top),
264 | Nursery1
265 | end.
266 |
267 | transact(Spec, Nursery, Top) ->
268 | transact1(Spec, ensure_space(Nursery, length(Spec), Top), Top).
269 |
270 | transact1(Spec, Nursery1=#nursery{ log_file=File, cache=Cache0, total_size=TotalSize, config=Config }, Top) ->
271 | Expiry =
272 | case hanoidb:get_opt(expiry_secs, Config) of
273 | 0 ->
274 | infinity;
275 | DatabaseExpiryTime ->
276 | hanoidb_util:expiry_time(DatabaseExpiryTime)
277 | end,
278 |
279 | Data = hanoidb_util:crc_encapsulate_transaction(Spec, Expiry),
280 | ok = file:write(File, Data),
281 |
282 | Nursery2 = do_sync(File, Nursery1),
283 |
284 | Cache2 = lists:foldl(fun({put, Key, Value}, Cache) ->
285 | case Expiry of
286 | infinity ->
287 | gb_trees:enter(Key, Value, Cache);
288 | _ ->
289 | gb_trees:enter(Key, {Value, Expiry}, Cache)
290 | end;
291 | ({delete, Key}, Cache) ->
292 | case Expiry of
293 | infinity ->
294 | gb_trees:enter(Key, ?TOMBSTONE, Cache);
295 | _ ->
296 | gb_trees:enter(Key, {?TOMBSTONE, Expiry}, Cache)
297 | end
298 | end,
299 | Cache0,
300 | Spec),
301 |
302 | Count = gb_trees:size(Cache2),
303 |
304 | do_inc_merge(Nursery2#nursery{ cache=Cache2, total_size=TotalSize+erlang:iolist_size(Data), count=Count }, length(Spec), Top).
305 |
306 | do_inc_merge(Nursery=#nursery{ step=Step, merge_done=Done, min_level=MinLevel }, N, TopLevel) ->
307 | if Step+N >= ?INC_MERGE_STEP ->
308 | hanoidb_level:begin_incremental_merge(TopLevel, Step + N),
309 | {ok, Nursery#nursery{ step=0, merge_done=Done + Step + N }};
310 | true ->
311 | {ok, Nursery#nursery{ step=Step + N }}
312 | end.
313 |
314 | do_level_fold(#nursery{cache=Cache}, FoldWorkerPID, KeyRange) ->
315 | Ref = erlang:make_ref(),
316 | FoldWorkerPID ! {prefix, [Ref]},
317 | case gb_trees_ext:fold(
318 | fun(_, _, {LastKey, limit}) ->
319 | {LastKey, limit};
320 | (Key, Value, {LastKey, Count}) ->
321 | case ?KEY_IN_RANGE(Key, KeyRange) andalso (not is_expired(Value)) of
322 | true ->
323 | BinOrTombstone = get_value(Value),
324 | FoldWorkerPID ! {level_result, Ref, Key, BinOrTombstone},
325 | case BinOrTombstone of
326 | ?TOMBSTONE ->
327 | {Key, Count};
328 | _ ->
329 | {Key, decrement(Count)}
330 | end;
331 | false ->
332 | {LastKey, Count}
333 | end
334 | end,
335 | {undefined, KeyRange#key_range.limit},
336 | Cache)
337 | of
338 | {LastKey, limit} when LastKey =/= undefined ->
339 | FoldWorkerPID ! {level_limit, Ref, LastKey};
340 | _ ->
341 | FoldWorkerPID ! {level_done, Ref}
342 | end,
343 | ok.
344 |
345 | set_max_level(Nursery = #nursery{}, MaxLevel) ->
346 | Nursery#nursery{ max_level = MaxLevel }.
347 |
348 | decrement(undefined) ->
349 | undefined;
350 | decrement(1) ->
351 | limit;
352 | decrement(Number) ->
353 | Number-1.
354 |
355 | %%%
356 |
357 | % TODO this is duplicate code also found in hanoidb_reader
358 | is_expired(?TOMBSTONE) ->
359 | false;
360 | is_expired({_Value, TStamp}) ->
361 | hanoidb_util:has_expired(TStamp);
362 | is_expired(Bin) when is_binary(Bin) ->
363 | false.
364 |
365 | get_value({Value, TStamp}) when is_integer(TStamp); TStamp =:= infinity ->
366 | Value;
367 | get_value(Value) when Value =:= ?TOMBSTONE; is_binary(Value) ->
368 | Value.
369 |
370 |
--------------------------------------------------------------------------------
/src/hanoidb_reader.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_reader).
26 | -author('Kresten Krab Thorup ').
27 |
28 | -include_lib("kernel/include/file.hrl").
29 | -include("include/hanoidb.hrl").
30 | -include("hanoidb.hrl").
31 | -include("include/plain_rpc.hrl").
32 |
33 | -define(ASSERT_WHEN(X), when X).
34 |
35 | -export([open/1, open/2,close/1,lookup/2,fold/3,range_fold/4, destroy/1]).
36 | -export([first_node/1,next_node/1]).
37 | -export([serialize/1, deserialize/1]).
38 |
39 | -record(node, {level :: non_neg_integer(),
40 | members=[] :: list(any()) | binary() }).
41 |
42 | -record(index, {file :: file:io_device(),
43 | root= none :: #node{} | none,
44 | bloom :: term(),
45 | name :: string(),
46 | config=[] :: term() }).
47 |
48 | -type read_file() :: #index{}.
49 | -export_type([read_file/0]).
50 |
51 | -spec open(Name::string()) -> {ok, read_file()} | {error, any()}.
52 | open(Name) ->
53 | open(Name, [random]).
54 |
55 | -type config() :: [sequential | folding | random | {atom(), term()}].
56 | -spec open(Name::string(), config()) -> {ok, read_file()} | {error, any()}.
57 | open(Name, Config) ->
58 | case proplists:get_bool(sequential, Config) of
59 | true ->
60 | ReadBufferSize = hanoidb:get_opt(read_buffer_size, Config, 512 * 1024),
61 | case file:open(Name, [raw,read,{read_ahead, ReadBufferSize},binary]) of
62 | {ok, File} ->
63 | {ok, #index{file=File, name=Name, config=Config}};
64 | {error, _}=Err ->
65 | Err
66 | end;
67 |
68 | false ->
69 | {ok, File} =
70 | case proplists:get_bool(folding, Config) of
71 | true ->
72 | ReadBufferSize = hanoidb:get_opt(read_buffer_size, Config, 512 * 1024),
73 | file:open(Name, [read, {read_ahead, ReadBufferSize}, binary]);
74 | false ->
75 | file:open(Name, [read, binary])
76 | end,
77 |
78 | {ok, FileInfo} = file:read_file_info(Name),
79 |
80 | %% read and validate magic tag
81 | {ok, ?FILE_FORMAT} = file:pread(File, 0, byte_size(?FILE_FORMAT)),
82 |
83 | %% read root position
84 | {ok, <>} = file:pread(File, FileInfo#file_info.size - 8, 8),
85 | {ok, <>} = file:pread(File, FileInfo#file_info.size - 12, 4),
86 | {ok, BloomData} = file:pread(File, (FileInfo#file_info.size - 12 - BloomSize), BloomSize),
87 | {ok, Bloom} = hanoidb_util:bin_to_bloom(BloomData),
88 |
89 | %% read in the root node
90 | Root =
91 | case read_node(File, RootPos) of
92 | {ok, Node} ->
93 | Node;
94 | eof ->
95 | none
96 | end,
97 |
98 | {ok, #index{file=File, root=Root, bloom=Bloom, name=Name, config=Config}}
99 | end.
100 |
101 | destroy(#index{file=File, name=Name}) ->
102 | ok = file:close(File),
103 | file:delete(Name).
104 |
105 | serialize(#index{file=File, bloom=undefined }=Index) ->
106 | {ok, Position} = file:position(File, cur),
107 | ok = file:close(File),
108 | {seq_read_file, Index, Position}.
109 |
110 | deserialize({seq_read_file, Index, Position}) ->
111 | {ok, #index{file=File}=Index2} = open(Index#index.name, Index#index.config),
112 | {ok, Position} = file:position(File, {bof, Position}),
113 | Index2.
114 |
115 |
116 |
117 |
118 | fold(Fun, Acc0, #index{file=File}) ->
119 | {ok, Node} = read_node(File,?FIRST_BLOCK_POS),
120 | fold0(File,fun({K,V},Acc) -> Fun(K,V,Acc) end,Node,Acc0).
121 |
122 | fold0(File,Fun,#node{level=0, members=BinPage},Acc0) when is_binary(BinPage) ->
123 | Acc1 = vbisect:foldl(fun(K, V, Acc2) -> Fun({K, decode_binary_value(V)}, Acc2) end,Acc0,BinPage),
124 | fold1(File,Fun,Acc1);
125 | fold0(File,Fun,#node{level=0, members=List},Acc0) when is_list(List) ->
126 | Acc1 = lists:foldl(Fun,Acc0,List),
127 | fold1(File,Fun,Acc1);
128 | fold0(File,Fun,_InnerNode,Acc0) ->
129 | fold1(File,Fun,Acc0).
130 |
131 | fold1(File,Fun,Acc0) ->
132 | case next_leaf_node(File) of
133 | eof ->
134 | Acc0;
135 | {ok, Node} ->
136 | fold0(File,Fun,Node,Acc0)
137 | end.
138 |
139 | -spec range_fold(fun((binary(),binary(),any()) -> any()), any(), #index{}, #key_range{}) ->
140 | {limit, any(), binary()} | {done, any()}.
141 | range_fold(Fun, Acc0, #index{file=File,root=Root}, Range) ->
142 | case Range#key_range.from_key =< first_key(Root) of
143 | true ->
144 | {ok, _} = file:position(File, ?FIRST_BLOCK_POS),
145 | range_fold_from_here(Fun, Acc0, File, Range, Range#key_range.limit);
146 | false ->
147 | case find_leaf_node(File,Range#key_range.from_key,Root,?FIRST_BLOCK_POS) of
148 | {ok, {Pos,_}} ->
149 | {ok, _} = file:position(File, Pos),
150 | range_fold_from_here(Fun, Acc0, File, Range, Range#key_range.limit);
151 | {ok, Pos} ->
152 | {ok, _} = file:position(File, Pos),
153 | range_fold_from_here(Fun, Acc0, File, Range, Range#key_range.limit);
154 | none ->
155 | {done, Acc0}
156 | end
157 | end.
158 |
159 | first_key(#node{members=Dict}) ->
160 | {_,FirstKey} = fold_until_stop(fun({K,_},_) -> {stop, K} end, none, Dict),
161 | FirstKey.
162 |
163 | fold_until_stop(Fun,Acc,List) when is_list(List) ->
164 | fold_until_stop2(Fun, {continue, Acc}, List);
165 | fold_until_stop(Fun,Acc0,Bin) when is_binary(Bin) ->
166 | vbisect:fold_until_stop(fun({Key,VBin},Acc1) ->
167 | % io:format("-> DOING ~p,~p~n", [Key,Acc1]),
168 | Fun({Key, decode_binary_value(VBin)}, Acc1)
169 | end,
170 | Acc0,
171 | Bin).
172 |
173 | fold_until_stop2(_Fun,{stop,Result},_) ->
174 | {stopped, Result};
175 | fold_until_stop2(_Fun,{continue, Acc},[]) ->
176 | {ok, Acc};
177 | fold_until_stop2(Fun,{continue, Acc},[H|T]) ->
178 | fold_until_stop2(Fun,Fun(H,Acc),T).
179 |
180 | % TODO this is duplicate code also found in hanoidb_nursery
181 | is_expired(?TOMBSTONE) ->
182 | false;
183 | is_expired({_Value, TStamp}) ->
184 | hanoidb_util:has_expired(TStamp);
185 | is_expired(Bin) when is_binary(Bin) ->
186 | false.
187 |
188 | get_value({Value, _TStamp}) ->
189 | Value;
190 | get_value(Value) ->
191 | Value.
192 |
193 | range_fold_from_here(Fun, Acc0, File, Range, undefined) ->
194 | % io:format("RANGE_FOLD_FROM_HERE(~p,~p)~n", [Acc0,File]),
195 | case next_leaf_node(File) of
196 | eof ->
197 | {done, Acc0};
198 |
199 | {ok, #node{members=Members}} ->
200 | case fold_until_stop(fun({Key,_}, Acc) when not ?KEY_IN_TO_RANGE(Key,Range) ->
201 | {stop, {done, Acc}};
202 | ({Key,Value}, Acc) when ?KEY_IN_FROM_RANGE(Key, Range) ->
203 | case is_expired(Value) of
204 | true ->
205 | {continue, Acc};
206 | false ->
207 | {continue, Fun(Key, get_value(Value), Acc)}
208 | end;
209 | (_Huh, Acc) ->
210 | % io:format("SKIPPING ~p~n", [_Huh]),
211 | {continue, Acc}
212 | end,
213 | Acc0,
214 | Members) of
215 | {stopped, Result} -> Result;
216 | {ok, Acc1} ->
217 | range_fold_from_here(Fun, Acc1, File, Range, undefined)
218 | end
219 | end;
220 |
221 | range_fold_from_here(Fun, Acc0, File, Range, N0) ->
222 | case next_leaf_node(File) of
223 | eof ->
224 | {done, Acc0};
225 |
226 | {ok, #node{members=Members}} ->
227 | case fold_until_stop(fun({Key,_}, {0,Acc}) ->
228 | {stop, {limit, Acc, Key}};
229 | ({Key,_}, {_,Acc}) when not ?KEY_IN_TO_RANGE(Key,Range)->
230 | {stop, {done, Acc}};
231 | ({Key,?TOMBSTONE}, {N1,Acc}) when ?KEY_IN_FROM_RANGE(Key,Range) ->
232 | {continue, {N1, Fun(Key, ?TOMBSTONE, Acc)}};
233 | ({Key,{?TOMBSTONE,TStamp}}, {N1,Acc}) when ?KEY_IN_FROM_RANGE(Key,Range) ->
234 | case hanoidb_util:has_expired(TStamp) of
235 | true ->
236 | {continue, {N1,Acc}};
237 | false ->
238 | {continue, {N1, Fun(Key, ?TOMBSTONE, Acc)}}
239 | end;
240 | ({Key,Value}, {N1,Acc}) when ?KEY_IN_FROM_RANGE(Key,Range) ->
241 | case is_expired(Value) of
242 | true ->
243 | {continue, {N1,Acc}};
244 | false ->
245 | {continue, {N1-1, Fun(Key, get_value(Value), Acc)}}
246 | end;
247 | (_, Acc) ->
248 | {continue, Acc}
249 | end,
250 | {N0, Acc0},
251 | Members)
252 | of
253 | {stopped, Result} ->
254 | Result;
255 | {ok, {N2, Acc1}} ->
256 | range_fold_from_here(Fun, Acc1, File, Range, N2)
257 | end
258 | end.
259 |
260 | find_leaf_node(_File,_FromKey,#node{level=0},Pos) ->
261 | {ok, Pos};
262 | find_leaf_node(File,FromKey,#node{members=Members,level=N},_) when is_list(Members) ->
263 | case find_start(FromKey, Members) of
264 | {ok, ChildPos} ->
265 | recursive_find(File, FromKey, N, ChildPos);
266 | not_found ->
267 | none
268 | end;
269 | find_leaf_node(File,FromKey,#node{members=Members,level=N},_) when is_binary(Members) ->
270 | case vbisect:find_geq(FromKey,Members) of
271 | {ok, _, <>} ->
272 | % io:format("** FIND_LEAF_NODE(~p,~p) -> {~p,~p}~n", [FromKey, N, Pos,Len]),
273 | recursive_find(File, FromKey, N, {Pos,Len});
274 | none ->
275 | % io:format("** FIND_LEAF_NODE(~p,~p) -> none~n", [FromKey, N]),
276 | none
277 | end;
278 | find_leaf_node(_,_,none,_) ->
279 | none.
280 |
281 | recursive_find(_File,_FromKey,1,ChildPos) ->
282 | {ok, ChildPos};
283 | recursive_find(File,FromKey,N,ChildPos) when N>1 ->
284 | case read_node(File,ChildPos) of
285 | {ok, ChildNode} ->
286 | find_leaf_node(File, FromKey,ChildNode,ChildPos);
287 | eof ->
288 | none
289 | end.
290 |
291 |
292 | %% used by the merger, needs list value
293 | first_node(#index{file=File}) ->
294 | case read_node(File, ?FIRST_BLOCK_POS) of
295 | {ok, #node{level=0, members=Members}} ->
296 | {kvlist, decode_member_list(Members)};
297 | eof->
298 | none
299 | end.
300 |
301 | %% used by the merger, needs list value
302 | next_node(#index{file=File}=_Index) ->
303 | case next_leaf_node(File) of
304 | {ok, #node{level=0, members=Members}} ->
305 | {kvlist, decode_member_list(Members)};
306 | eof ->
307 | end_of_data
308 | end.
309 |
310 | decode_member_list(List) when is_list(List) ->
311 | List;
312 | decode_member_list(BinDict) when is_binary(BinDict) ->
313 | vbisect:foldr( fun(Key,Value,Acc) ->
314 | [{Key, decode_binary_value(Value) }|Acc]
315 | end,
316 | [],
317 | BinDict).
318 |
319 | close(#index{file=undefined}) ->
320 | ok;
321 | close(#index{file=File}) ->
322 | file:close(File).
323 |
324 |
325 | lookup(#index{file=File, root=Node, bloom=Bloom}, Key) ->
326 | case ?BLOOM_CONTAINS(Bloom, Key) of
327 | true ->
328 | case lookup_in_node(File, Node, Key) of
329 | not_found ->
330 | not_found;
331 | {ok, {Value, TStamp}} ?ASSERT_WHEN(Value =:= ?TOMBSTONE; is_binary(Value)) ->
332 | case hanoidb_util:has_expired(TStamp) of
333 | true -> not_found;
334 | false -> {ok, Value}
335 | end;
336 | {ok, Value}=Reply ?ASSERT_WHEN(Value =:= ?TOMBSTONE; is_binary(Value)) ->
337 | Reply
338 | end;
339 | false ->
340 | not_found
341 | end.
342 |
343 | lookup_in_node(_File,#node{level=0,members=Members}, Key) ->
344 | find_in_leaf(Key,Members);
345 |
346 | lookup_in_node(File,#node{members=Members},Key) when is_binary(Members) ->
347 | case vbisect:find_geq(Key,Members) of
348 | {ok, _Key, <>} ->
349 | % io:format("FOUND ~p @ ~p~n", [_Key, {Pos,Size}]),
350 | case read_node(File,{Pos,Size}) of
351 | {ok, Node} ->
352 | lookup_in_node(File, Node, Key);
353 | eof ->
354 | not_found
355 | end;
356 | none ->
357 | not_found
358 | end;
359 |
360 | lookup_in_node(File,#node{members=Members},Key) ->
361 | case find_1(Key, Members) of
362 | {ok, {Pos,Size}} ->
363 | %% do this in separate process, to avoid having to
364 | %% garbage collect all the inner node junk
365 | PID = proc_lib:spawn_link(fun() ->
366 | receive
367 | ?CALL(From,read) ->
368 | case read_node(File, {Pos,Size}) of
369 | {ok, Node} ->
370 | Result = lookup_in_node2(File, Node, Key),
371 | plain_rpc:send_reply(From, Result);
372 | eof ->
373 | plain_rpc:send_reply(From, {error, eof})
374 | end
375 | end
376 | end),
377 | try plain_rpc:call(PID, read)
378 | catch
379 | Class:Ex ->
380 | error_logger:error_msg("crashX: ~p:~p ~p~n", [Class,Ex,erlang:get_stacktrace()]),
381 | not_found
382 | end;
383 |
384 | not_found ->
385 | not_found
386 | end.
387 |
388 |
389 | lookup_in_node2(_File,#node{level=0,members=Members},Key) ->
390 | case lists:keyfind(Key,1,Members) of
391 | false ->
392 | not_found;
393 | {_,Value} ->
394 | {ok, Value}
395 | end;
396 |
397 | lookup_in_node2(File,#node{members=Members},Key) ->
398 | case find_1(Key, Members) of
399 | {ok, {Pos,Size}} ->
400 | case read_node(File, {Pos,Size}) of
401 | {ok, Node} ->
402 | lookup_in_node2(File, Node, Key);
403 | eof ->
404 | {error, eof}
405 | end;
406 | not_found ->
407 | not_found
408 | end.
409 |
410 |
411 | find_1(K, [{K1,V},{K2,_}|_]) when K >= K1, K < K2 ->
412 | {ok, V};
413 | find_1(K, [{K1,V}]) when K >= K1 ->
414 | {ok, V};
415 | find_1(K, [_|T]) ->
416 | find_1(K,T);
417 | find_1(_, _) ->
418 | not_found.
419 |
420 |
421 | find_start(K, [{_,V},{K2,_}|_]) when K < K2 ->
422 | {ok, V};
423 | find_start(_, [{_,{_,_}=V}]) ->
424 | {ok, V};
425 | find_start(K, KVs) ->
426 | find_1(K, KVs).
427 |
428 |
429 | -spec read_node(file:io_device(), non_neg_integer() | { non_neg_integer(), non_neg_integer() }) ->
430 | {ok, #node{}} | eof.
431 |
432 | read_node(File, {Pos, Size}) ->
433 | % error_logger:info_msg("read_node ~p ~p ~p~n", [File, Pos, Size]),
434 | {ok, <<_:32/unsigned, Level:16/unsigned, Data/binary>>} = file:pread(File, Pos, Size),
435 | hanoidb_util:decode_index_node(Level, Data);
436 |
437 | read_node(File, Pos) ->
438 | % error_logger:info_msg("read_node ~p ~p~n", [File, Pos]),
439 | {ok, Pos} = file:position(File, Pos),
440 | Result = read_node(File),
441 | % error_logger:info_msg("decoded ~p ~p~n", [Pos, Result]),
442 | Result.
443 |
444 | read_node(File) ->
445 | % error_logger:info_msg("read_node ~p~n", [File]),
446 | {ok, <>} = file:read(File, 6),
447 | % error_logger:info_msg("decoded ~p ~p~n", [Len, Level]),
448 | case Len of
449 | 0 ->
450 | eof;
451 | _ ->
452 | {ok, Data} = file:read(File, Len-2),
453 | hanoidb_util:decode_index_node(Level, Data)
454 | end.
455 |
456 |
457 | next_leaf_node(File) ->
458 | case file:read(File, 6) of
459 | eof ->
460 | %% premature end-of-file
461 | eof;
462 | {ok, <<0:32/unsigned, _:16/unsigned>>} ->
463 | eof;
464 | {ok, <>} ->
465 | {ok, Data} = file:read(File, Len-2),
466 | hanoidb_util:decode_index_node(0, Data);
467 | {ok, <>} ->
468 | {ok, _} = file:position(File, {cur,Len-2}),
469 | next_leaf_node(File)
470 | end.
471 |
472 |
473 | find_in_leaf(Key,Bin) when is_binary(Bin) ->
474 | case vbisect:find(Key,Bin) of
475 | {ok, BinValue} ->
476 | {ok, decode_binary_value(BinValue)};
477 | error ->
478 | not_found
479 | end;
480 | find_in_leaf(Key,List) when is_list(List) ->
481 | case lists:keyfind(Key, 1, List) of
482 | {_, Value} ->
483 | {ok, Value};
484 | false ->
485 | not_found
486 | end.
487 |
488 | decode_binary_value(<>) ->
489 | Value;
490 | decode_binary_value(<>) ->
491 | {Value, TStamp};
492 | decode_binary_value(<>) ->
493 | ?TOMBSTONE;
494 | decode_binary_value(<>) ->
495 | {?TOMBSTONE, TStamp};
496 | decode_binary_value(<>) ->
497 | {Pos, Len}.
498 |
--------------------------------------------------------------------------------
/src/hanoidb_sparse_bitmap.erl:
--------------------------------------------------------------------------------
1 | -module(hanoidb_sparse_bitmap).
2 | -export([new/1, set/2, member/2]).
3 |
4 | -define(REPR_NAME, sparse_bitmap).
5 |
6 | new(Bits) when is_integer(Bits), Bits>0 ->
7 | {?REPR_NAME, Bits, []}.
8 |
9 | set(N, {?REPR_NAME, Bits, Tree}) ->
10 | {?REPR_NAME, Bits, set_to_tree(N, 1 bsl (Bits-1), Tree)}.
11 |
12 | set_to_tree(N, HighestBit, Mask) when HighestBit<32 ->
13 | Nbit = 1 bsl N,
14 | case Mask of
15 | []-> Nbit;
16 | _ -> Nbit bor Mask
17 | end;
18 | set_to_tree(N, _HighestBit, []) -> N;
19 | set_to_tree(N, HighestBit, [TLo|THi]) ->
20 | pushdown(N, HighestBit, TLo, THi);
21 | set_to_tree(N, _HighestBit, N) -> N;
22 | set_to_tree(N, HighestBit, M) when is_integer(M) ->
23 | set_to_tree(N, HighestBit, pushdown(M, HighestBit, [], [])).
24 |
25 | pushdown(N, HighestBit, TLo, THi) ->
26 | NHigh = N band HighestBit,
27 | if NHigh =:= 0 -> [set_to_tree(N, HighestBit bsr 1, TLo) | THi];
28 | true -> [TLo | set_to_tree(N bxor NHigh, HighestBit bsr 1, THi)]
29 | end.
30 |
31 | member(N, {?REPR_NAME, Bits, Tree}) ->
32 | member_in_tree(N, 1 bsl (Bits-1), Tree).
33 |
34 | member_in_tree(_N, _HighestBit, []) -> false;
35 | member_in_tree(N, HighestBit, Mask) when HighestBit<32 ->
36 | Nbit = 1 bsl N,
37 | Nbit band Mask > 0;
38 | member_in_tree(N, _HighestBit, M) when is_integer(M) -> N =:= M;
39 | member_in_tree(N, HighestBit, [TLo|THi]) ->
40 | NHigh = N band HighestBit,
41 | if NHigh =:= 0 -> member_in_tree(N, HighestBit bsr 1, TLo);
42 | true -> member_in_tree(N bxor NHigh, HighestBit bsr 1, THi)
43 | end.
44 |
--------------------------------------------------------------------------------
/src/hanoidb_sup.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_sup).
26 | -author('Kresten Krab Thorup ').
27 |
28 | -behaviour(supervisor).
29 |
30 | %% API
31 | -export([start_link/0]).
32 |
33 | %% Supervisor callbacks
34 | -export([init/1]).
35 |
36 | %% Helper macro for declaring children of supervisor
37 | -define(CHILD(I, Type), {I, {I, start_link, []}, permanent, 5000, Type, [I]}).
38 |
39 | %% ===================================================================
40 | %% API functions
41 | %% ===================================================================
42 |
43 | start_link() ->
44 | supervisor:start_link({local, ?MODULE}, ?MODULE, []).
45 |
46 | %% ===================================================================
47 | %% Supervisor callbacks
48 | %% ===================================================================
49 |
50 | init([]) ->
51 | {ok, { {one_for_one, 5, 10}, []} }.
52 |
53 |
--------------------------------------------------------------------------------
/src/hanoidb_util.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_util).
26 | -author('Kresten Krab Thorup ').
27 |
28 | -export([ compress/2
29 | , uncompress/1
30 | , index_file_name/1
31 | , estimate_node_size_increment/3
32 | , encode_index_node/2
33 | , decode_index_node/2
34 | , crc_encapsulate_kv_entry/2
35 | , decode_crc_data/3
36 | , file_exists/1
37 | , crc_encapsulate_transaction/2
38 | , tstamp/0
39 | , expiry_time/1
40 | , has_expired/1
41 | , ensure_expiry/1
42 |
43 | , bloom_type/1
44 | , bloom_new/2
45 | , bloom_to_bin/1
46 | , bin_to_bloom/1
47 | , bin_to_bloom/2
48 | , bloom_insert/2
49 | , bloom_contains/2
50 | ]).
51 |
52 | -include("src/hanoidb.hrl").
53 |
54 | -define(ERLANG_ENCODED, 131).
55 | -define(CRC_ENCODED, 127).
56 | -define(BISECT_ENCODED, 126).
57 |
58 |
59 | -define(FILE_ENCODING, bisect).
60 |
61 | -compile({inline, [crc_encapsulate/1, crc_encapsulate_kv_entry/2 ]}).
62 |
63 |
64 | -spec index_file_name(string()) -> string().
65 | index_file_name(Name) ->
66 | Name.
67 |
68 | -spec file_exists(string()) -> boolean().
69 | file_exists(FileName) ->
70 | case file:read_file_info(FileName) of
71 | {ok, _} ->
72 | true;
73 | {error, enoent} ->
74 | false
75 | end.
76 |
77 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp})
78 | when is_integer(Value) -> byte_size(Key) + 5 + 4;
79 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp})
80 | when is_binary(Value) -> byte_size(Key) + 5 + 4 + byte_size(Value);
81 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp})
82 | when is_atom(Value) -> byte_size(Key) + 8 + 4;
83 | estimate_node_size_increment(_KVList, Key, {Value, _TStamp})
84 | when is_tuple(Value) -> byte_size(Key) + 13 + 4;
85 | estimate_node_size_increment(_KVList, Key, Value)
86 | when is_integer(Value) -> byte_size(Key) + 5 + 4;
87 | estimate_node_size_increment(_KVList, Key, Value)
88 | when is_binary(Value) -> byte_size(Key) + 5 + 4 + byte_size(Value);
89 | estimate_node_size_increment(_KVList, Key, Value)
90 | when is_atom(Value) -> byte_size(Key) + 8 + 4;
91 | estimate_node_size_increment(_KVList, Key, Value)
92 | when is_tuple(Value) -> byte_size(Key) + 13 + 4.
93 |
94 | -define(NO_COMPRESSION, 0).
95 | -define(SNAPPY_COMPRESSION, 1).
96 | -define(GZIP_COMPRESSION, 2).
97 | -define(LZ4_COMPRESSION, 3).
98 |
99 | use_compressed(UncompressedSize, CompressedSize) when CompressedSize < UncompressedSize ->
100 | true;
101 | use_compressed(_UncompressedSize, _CompressedSize) ->
102 | false.
103 |
104 | compress(snappy, Bin) ->
105 | {ok, CompressedBin} = snappy:compress(Bin),
106 | case use_compressed(erlang:iolist_size(Bin), erlang:iolist_size(CompressedBin)) of
107 | true ->
108 | {?SNAPPY_COMPRESSION, CompressedBin};
109 | false ->
110 | {?NO_COMPRESSION, Bin}
111 | end;
112 | compress(lz4, Bin) ->
113 | {ok, CompressedBin} = lz4:compress(erlang:iolist_to_binary(Bin)),
114 | case use_compressed(erlang:iolist_size(Bin), erlang:iolist_size(CompressedBin)) of
115 | true ->
116 | {?LZ4_COMPRESSION, CompressedBin};
117 | false ->
118 | {?NO_COMPRESSION, Bin}
119 | end;
120 | compress(gzip, Bin) ->
121 | CompressedBin = zlib:gzip(Bin),
122 | case use_compressed(erlang:iolist_size(Bin), erlang:iolist_size(CompressedBin)) of
123 | true ->
124 | {?GZIP_COMPRESSION, CompressedBin};
125 | false ->
126 | {?NO_COMPRESSION, Bin}
127 | end;
128 | compress(none, Bin) ->
129 | {?NO_COMPRESSION, Bin}.
130 |
131 | uncompress(<>) ->
132 | Data;
133 | uncompress(<>) ->
134 | {ok, UncompressedData} = snappy:decompress(Data),
135 | UncompressedData;
136 | uncompress(<>) ->
137 | lz4:uncompress(Data);
138 | uncompress(<>) ->
139 | zlib:gunzip(Data).
140 |
141 | encode_index_node(KVList, Method) ->
142 | TermData =
143 | case ?FILE_ENCODING of
144 | bisect ->
145 | Binary = vbisect:from_orddict(lists:map(fun binary_encode_kv/1, KVList)),
146 | CRC = erlang:crc32(Binary),
147 | [?BISECT_ENCODED, <>, Binary];
148 | hanoi2 ->
149 | [ ?TAG_END |
150 | lists:map(fun ({Key,Value}) ->
151 | crc_encapsulate_kv_entry(Key, Value)
152 | end,
153 | KVList) ]
154 | end,
155 | {MethodName, OutData} = compress(Method, TermData),
156 | {ok, [MethodName | OutData]}.
157 |
158 | decode_index_node(Level, Data) ->
159 | TermData = uncompress(Data),
160 | case decode_kv_list(TermData) of
161 | {ok, KVList} ->
162 | {ok, {node, Level, KVList}};
163 | {bisect, Binary} ->
164 | % io:format("[page level=~p~n", [Level]),
165 | % vbisect:foldl(fun(K,V,_) -> io:format(" ~p -> ~p,~n", [K,V]) end, 0, Binary),
166 | % io:format("]~n",[]),
167 | {ok, {node, Level, Binary}}
168 | end.
169 |
170 |
171 | binary_encode_kv({Key, {Value,infinity}}) ->
172 | binary_encode_kv({Key,Value});
173 | binary_encode_kv({Key, {?TOMBSTONE, TStamp}}) ->
174 | {Key, <>};
175 | binary_encode_kv({Key, ?TOMBSTONE}) ->
176 | {Key, <>};
177 | binary_encode_kv({Key, {Value, TStamp}}) when is_binary(Value) ->
178 | {Key, <>};
179 | binary_encode_kv({Key, Value}) when is_binary(Value)->
180 | {Key, <>};
181 | binary_encode_kv({Key, {Pos, Len}}) when Len < 16#ffffffff ->
182 | {Key, <>}.
183 |
184 |
185 | -spec crc_encapsulate_kv_entry(binary(), expvalue()) -> iolist().
186 | crc_encapsulate_kv_entry(Key, {Value, infinity}) ->
187 | crc_encapsulate_kv_entry(Key, Value);
188 | crc_encapsulate_kv_entry(Key, {?TOMBSTONE, TStamp}) -> %
189 | crc_encapsulate( [?TAG_DELETED2, <> | Key] );
190 | crc_encapsulate_kv_entry(Key, ?TOMBSTONE) ->
191 | crc_encapsulate( [?TAG_DELETED | Key] );
192 | crc_encapsulate_kv_entry(Key, {Value, TStamp}) when is_binary(Value) ->
193 | crc_encapsulate( [?TAG_KV_DATA2, <>, Key, Value] );
194 | crc_encapsulate_kv_entry(Key, Value) when is_binary(Value) ->
195 | crc_encapsulate( [?TAG_KV_DATA, <<(byte_size(Key)):32/unsigned>>, Key, Value] );
196 | crc_encapsulate_kv_entry(Key, {Pos,Len}) when Len < 16#ffffffff ->
197 | crc_encapsulate( [?TAG_POSLEN32, <>, Key] ).
198 |
199 | -spec crc_encapsulate_transaction( [ txspec() ], expiry() ) -> iolist().
200 | crc_encapsulate_transaction(TransactionSpec, Expiry) ->
201 | crc_encapsulate([?TAG_TRANSACT |
202 | lists:map(fun({delete, Key}) ->
203 | crc_encapsulate_kv_entry(Key, {?TOMBSTONE, Expiry});
204 | ({put, Key, Value}) ->
205 | crc_encapsulate_kv_entry(Key, {Value, Expiry})
206 | end,
207 | TransactionSpec)]).
208 |
209 | -spec crc_encapsulate( iolist() ) -> iolist().
210 | crc_encapsulate(Blob) ->
211 | CRC = erlang:crc32(Blob),
212 | Size = erlang:iolist_size(Blob),
213 | [<< (Size):32/unsigned, CRC:32/unsigned >>, Blob, ?TAG_END].
214 |
215 | -spec decode_kv_list( binary() ) -> {ok, [ kventry() ]} | {partial, [kventry()], iolist()}.
216 | decode_kv_list(<>) ->
217 | decode_crc_data(Custom, [], []);
218 | decode_kv_list(<>=TermData) ->
219 | {ok, erlang:term_to_binary(TermData)};
220 | decode_kv_list(<>) ->
221 | decode_crc_data(Custom, [], []);
222 | decode_kv_list(<>) ->
223 | CRCTest = erlang:crc32( Binary ),
224 | if CRC == CRCTest ->
225 | {bisect, Binary};
226 | true ->
227 | {bisect, vbisect:from_orddict([])}
228 | end.
229 |
230 | -spec decode_crc_data(binary(), list(), list()) -> {ok, [kventry()]} | {partial, [kventry()], iolist()}.
231 | decode_crc_data(<<>>, [], Acc) ->
232 | {ok, lists:reverse(Acc)};
233 | decode_crc_data(<<>>, BrokenData, Acc) ->
234 | {partial, lists:reverse(Acc), BrokenData};
235 | % TODO: we *could* simply return the good parts of the data...
236 | % would that be so wrong?
237 | decode_crc_data(<< BinSize:32/unsigned, CRC:32/unsigned, Bin:BinSize/binary, ?TAG_END, Rest/binary >>, Broken, Acc) ->
238 | CRCTest = erlang:crc32( Bin ),
239 | if CRC == CRCTest ->
240 | decode_crc_data(Rest, Broken, [decode_kv_data(Bin) | Acc]);
241 | true ->
242 | % TODO: chunk is broken, ignore it. Maybe we should tell someone?
243 | decode_crc_data(Rest, [Bin|Broken], Acc)
244 | end;
245 | decode_crc_data(Bad, Broken, Acc) ->
246 | %% If a chunk is broken, try to find the next ?TAG_END and
247 | %% start decoding from there.
248 | {Skipped, MaybeGood} = find_next_value(Bad),
249 | decode_crc_data(MaybeGood, [Skipped|Broken], Acc).
250 |
251 | -spec find_next_value(binary()) -> { binary(), binary() }.
252 | find_next_value(<<>>) ->
253 | {<<>>, <<>>};
254 | find_next_value(Bin) ->
255 | case binary:match (Bin, <>) of
256 | {Pos, _Len} ->
257 | <> = Bin,
258 | {SkipBin, MaybeGood};
259 | nomatch ->
260 | {Bin, <<>>}
261 | end.
262 |
263 | -spec decode_kv_data( binary() ) -> kventry().
264 | decode_kv_data(<>) ->
265 | {Key, Value};
266 | decode_kv_data(<>) ->
267 | {Key, ?TOMBSTONE};
268 | decode_kv_data(<>) ->
269 | {Key, {Value, TStamp}};
270 | decode_kv_data(<>) ->
271 | {Key, {?TOMBSTONE, TStamp}};
272 | decode_kv_data(<>) ->
273 | {Key, {Pos,Len}};
274 | decode_kv_data(<>) ->
275 | {ok, TX} = decode_crc_data(Rest, [], []),
276 | TX.
277 |
278 | %% @doc Return number of seconds since 1970
279 | -spec tstamp() -> pos_integer().
280 | tstamp() ->
281 | {Mega, Sec, _Micro} = os:timestamp(),
282 | (Mega * 1000000) + Sec.
283 |
284 | %% @doc Return time when values expire (i.e. Now + ExpirySecs), or 0.
285 | -spec expiry_time(pos_integer()) -> pos_integer().
286 | expiry_time(ExpirySecs) when ExpirySecs > 0 ->
287 | tstamp() + ExpirySecs.
288 |
289 | -spec has_expired(pos_integer()) -> true|false.
290 | has_expired(Expiration) when Expiration > 0 ->
291 | Expiration < tstamp();
292 | has_expired(infinity) ->
293 | false.
294 |
295 |
296 | ensure_expiry(Opts) ->
297 | case hanoidb:get_opt(expiry_secs, Opts) of
298 | undefined ->
299 | try exit(err)
300 | catch
301 | exit:err ->
302 | io:format(user, "~p~n", [erlang:get_stacktrace()])
303 | end,
304 | exit(expiry_secs_not_set);
305 | N when N >= 0 ->
306 | ok
307 | end.
308 |
309 | bloom_type({ebloom, _}) ->
310 | ebloom;
311 | bloom_type({sbloom, _}) ->
312 | sbloom.
313 |
314 | bloom_new(Size, sbloom) ->
315 | {ok, {sbloom, hanoidb_bloom:bloom(Size, 0.01)}};
316 | bloom_new(Size, ebloom) ->
317 | {ok, Bloom} = ebloom:new(Size, 0.01, Size),
318 | {ok, {ebloom, Bloom}}.
319 |
320 | bloom_to_bin({sbloom, Bloom}) ->
321 | hanoidb_bloom:encode(Bloom);
322 | bloom_to_bin({ebloom, Bloom}) ->
323 | ebloom:serialize(Bloom).
324 |
325 | bin_to_bloom(GZiped = <<16#1F, 16#8B, _/binary>>) ->
326 | bin_to_bloom(GZiped, sbloom);
327 | bin_to_bloom(TermBin = <<131, _/binary>>) ->
328 | erlang:term_to_binary(TermBin);
329 | bin_to_bloom(Blob) ->
330 | bin_to_bloom(Blob, ebloom).
331 |
332 | bin_to_bloom(Binary, sbloom) ->
333 | {ok, {sbloom, hanoidb_bloom:decode(Binary)}};
334 | bin_to_bloom(Binary, ebloom) ->
335 | {ok, Bloom} = ebloom:deserialize(Binary),
336 | {ok, {ebloom, Bloom}}.
337 |
338 | bloom_insert({sbloom, Bloom}, Key) ->
339 | {ok, {sbloom, hanoidb_bloom:add(Key, Bloom)}};
340 | bloom_insert({ebloom, Bloom}, Key) ->
341 | ok = ebloom:insert(Bloom, Key),
342 | {ok, {ebloom, Bloom}}.
343 |
344 | bloom_contains({sbloom, Bloom}, Key) ->
345 | hanoidb_bloom:member(Key, Bloom);
346 | bloom_contains({ebloom, Bloom}, Key) ->
347 | ebloom:contains(Bloom, Key).
348 |
349 |
--------------------------------------------------------------------------------
/src/hanoidb_writer.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_writer).
26 | -author('Kresten Krab Thorup ').
27 |
28 | -include("hanoidb.hrl").
29 |
30 | %%
31 | %% Streaming btree writer. Accepts only monotonically increasing keys for put.
32 | %%
33 |
34 | -define(NODE_SIZE, 8*1024).
35 |
36 | -behavior(gen_server).
37 |
38 | %% gen_server callbacks
39 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2,
40 | terminate/2, code_change/3, serialize/1, deserialize/1]).
41 |
42 | -export([open/1, open/2, add/3, count/1, close/1]).
43 |
44 | -record(node, {level :: integer(),
45 | members=[] :: [ {key(), expvalue()} ],
46 | size=0 :: integer()}).
47 |
48 | -record(state, {index_file :: file:io_device() | undefined,
49 | index_file_pos :: integer(),
50 |
51 | last_node_pos :: pos_integer(),
52 | last_node_size :: pos_integer(),
53 |
54 | nodes = [] :: list(#node{}),
55 |
56 | name :: string(),
57 |
58 | bloom :: {ebloom, term()} | {sbloom, term()},
59 | block_size = ?NODE_SIZE :: integer(),
60 | compress = none :: none | snappy | gzip | lz4,
61 | opts = [] :: list(any()),
62 |
63 | value_count = 0 :: integer(),
64 | tombstone_count = 0 :: integer()
65 | }).
66 |
67 |
68 | %%% PUBLIC API
69 |
70 | open(Name,Options) ->
71 | hanoidb_util:ensure_expiry(Options),
72 | gen_server:start_link(?MODULE, [Name, Options], []).
73 |
74 | open(Name) ->
75 | gen_server:start_link(?MODULE, [Name,[{expiry_secs,0}]], []).
76 |
77 | add(Ref, Key, Value) ->
78 | gen_server:cast(Ref, {add, Key, Value}).
79 |
80 | %% @doc Return number of KVs added to this writer so far
81 | count(Ref) ->
82 | gen_server:call(Ref, count, infinity).
83 |
84 | %% @doc Close the btree index file
85 | close(Ref) ->
86 | gen_server:call(Ref, close, infinity).
87 |
88 | %%%
89 |
90 | init([Name, Options]) ->
91 | hanoidb_util:ensure_expiry(Options),
92 | Size = proplists:get_value(size, Options, 2048),
93 |
94 | case do_open(Name, Options, [exclusive]) of
95 | {ok, IdxFile} ->
96 | ok = file:write(IdxFile, ?FILE_FORMAT),
97 | {ok, Bloom} = ?BLOOM_NEW(Size),
98 | BlockSize = hanoidb:get_opt(block_size, Options, ?NODE_SIZE),
99 | {ok, #state{ name=Name,
100 | index_file_pos=?FIRST_BLOCK_POS, index_file=IdxFile,
101 | bloom = Bloom,
102 | block_size = BlockSize,
103 | compress = hanoidb:get_opt(compress, Options, none),
104 | opts = Options
105 | }};
106 | {error, _}=Error ->
107 | error_logger:error_msg("hanoidb_writer cannot open ~p: ~p~n", [Name, Error]),
108 | {stop, Error}
109 | end.
110 |
111 |
112 | handle_cast({add, Key, {?TOMBSTONE, TStamp}}, State)
113 | when is_binary(Key) ->
114 | NewState =
115 | case hanoidb_util:has_expired(TStamp) of
116 | true ->
117 | State;
118 | false ->
119 | {ok, State2} = append_node(0, Key, {?TOMBSTONE, TStamp}, State),
120 | State2
121 | end,
122 | {noreply, NewState};
123 | handle_cast({add, Key, ?TOMBSTONE}, State)
124 | when is_binary(Key) ->
125 | {ok, NewState} = append_node(0, Key, ?TOMBSTONE, State),
126 | {noreply, NewState};
127 | handle_cast({add, Key, {Value, TStamp}}, State)
128 | when is_binary(Key), is_binary(Value) ->
129 | NewState =
130 | case hanoidb_util:has_expired(TStamp) of
131 | true ->
132 | State;
133 | false ->
134 | {ok, State2} = append_node(0, Key, {Value, TStamp}, State),
135 | State2
136 | end,
137 | {noreply, NewState};
138 | handle_cast({add, Key, Value}, State)
139 | when is_binary(Key), is_binary(Value) ->
140 | {ok, State2} = append_node(0, Key, Value, State),
141 | {noreply, State2}.
142 |
143 | handle_call(count, _From, State = #state{ value_count=VC, tombstone_count=TC }) ->
144 | {ok, VC+TC, State};
145 | handle_call(close, _From, State) ->
146 | {ok, State2} = archive_nodes(State),
147 | {stop, normal, ok, State2}.
148 |
149 | handle_info(Info, State) ->
150 | error_logger:error_msg("Unknown info ~p~n", [Info]),
151 | {stop, bad_msg, State}.
152 |
153 | terminate(normal,_State) ->
154 | ok;
155 | terminate(_Reason, State) ->
156 | %% premature delete -> cleanup
157 | _ignore = file:close(State#state.index_file),
158 | file:delete(hanoidb_util:index_file_name(State#state.name)).
159 |
160 | code_change(_OldVsn, State, _Extra) ->
161 | {ok, State}.
162 |
163 |
164 | %% INTERNAL FUNCTIONS
165 | serialize(#state{ bloom=Bloom, index_file=File, index_file_pos=Position }=State) ->
166 | case file:position(File, {eof, 0}) of
167 | {ok, Position} ->
168 | ok;
169 | {ok, WrongPosition} ->
170 | exit({bad_position, Position, WrongPosition})
171 | end,
172 | ok = file:close(File),
173 | erlang:term_to_binary( { State#state{ index_file=undefined, bloom=undefined }, ?BLOOM_TO_BIN(Bloom), hanoidb_util:bloom_type(Bloom) } ).
174 |
175 | deserialize(Binary) ->
176 | {State, Bin, Type} = erlang:binary_to_term(Binary),
177 | {ok, Bloom} = ?BIN_TO_BLOOM(Bin, Type),
178 | {ok, IdxFile} = do_open(State#state.name, State#state.opts, []),
179 | State#state{ bloom=Bloom, index_file=IdxFile }.
180 |
181 |
182 | do_open(Name, Options, OpenOpts) ->
183 | WriteBufferSize = hanoidb:get_opt(write_buffer_size, Options, 512 * 1024),
184 | file:open(hanoidb_util:index_file_name(Name),
185 | [raw, append, {delayed_write, WriteBufferSize, 2000} | OpenOpts]).
186 |
187 |
188 | %% @doc flush pending nodes and write trailer
189 | archive_nodes(#state{ nodes=[], last_node_pos=LastNodePos, last_node_size=_LastNodeSize, bloom=Bloom, index_file=IdxFile }=State) ->
190 |
191 | BloomBin = ?BLOOM_TO_BIN(Bloom),
192 | true = is_binary(BloomBin),
193 | BloomSize = byte_size(BloomBin),
194 | RootPos =
195 | case LastNodePos of
196 | undefined ->
197 | %% store contains no entries
198 | ok = file:write(IdxFile, <<0:32/unsigned, 0:16/unsigned>>),
199 | ?FIRST_BLOCK_POS;
200 | _ ->
201 | LastNodePos
202 | end,
203 | Trailer = [ << 0:32/unsigned>> , BloomBin, << BloomSize:32/unsigned, RootPos:64/unsigned >> ],
204 |
205 | ok = file:write(IdxFile, Trailer),
206 | ok = file:datasync(IdxFile),
207 | ok = file:close(IdxFile),
208 | {ok, State#state{ index_file=undefined, index_file_pos=undefined, bloom=undefined }};
209 |
210 | archive_nodes(State=#state{ nodes=[#node{level=N, members=[{_,{Pos,_Len}}]}], last_node_pos=Pos })
211 | when N > 0 ->
212 | %% Ignore this node, its stack consists of one node with one {pos,len} member
213 | archive_nodes(State#state{ nodes=[] });
214 |
215 | archive_nodes(State) ->
216 | {ok, State2} = flush_node_buffer(State),
217 | archive_nodes(State2).
218 |
219 |
220 | append_node(Level, Key, Value, State=#state{ nodes=[] }) ->
221 | append_node(Level, Key, Value, State#state{ nodes=[ #node{ level=Level } ] });
222 | append_node(Level, Key, Value, State=#state{ nodes=[ #node{level=Level2 } |_]=Stack })
223 | when Level < Level2 ->
224 | append_node(Level, Key, Value, State#state{ nodes=[ #node{ level=(Level2 - 1) } | Stack] });
225 | append_node(Level, Key, Value, #state{ nodes=[ #node{level=Level, members=List, size=NodeSize}=CurrNode | RestNodes ], value_count=VC, tombstone_count=TC, bloom=Bloom }=State)
226 | when Bloom /= undefined ->
227 | %% The top-of-stack node is at the level we wish to insert at.
228 |
229 | %% Assert that keys are increasing:
230 | case List of
231 | [] ->
232 | ok;
233 | [{PrevKey,_}|_] ->
234 | if
235 | (Key >= PrevKey) -> ok;
236 | true ->
237 | error_logger:error_msg("keys not ascending ~p < ~p~n", [PrevKey, Key]),
238 | exit({badarg, Key})
239 | end
240 | end,
241 | NewSize = NodeSize + hanoidb_util:estimate_node_size_increment(List, Key, Value),
242 |
243 | {ok,Bloom2} = case Level of
244 | 0 ->
245 | ?BLOOM_INSERT(Bloom, Key);
246 | _ ->
247 | {ok,Bloom}
248 | end,
249 |
250 | {TC1, VC1} =
251 | case Level of
252 | 0 ->
253 | case Value of
254 | ?TOMBSTONE ->
255 | {TC+1, VC};
256 | {?TOMBSTONE, _} -> %% Matched when this Value can expire
257 | {TC+1, VC};
258 | _ ->
259 | {TC, VC+1}
260 | end;
261 | _ ->
262 | {TC, VC}
263 | end,
264 |
265 | NodeMembers = [{Key, Value} | List],
266 | State2 = State#state{ nodes=[CurrNode#node{members=NodeMembers, size=NewSize} | RestNodes],
267 | value_count=VC1, tombstone_count=TC1, bloom=Bloom2 },
268 |
269 | case NewSize >= State#state.block_size of
270 | true ->
271 | flush_node_buffer(State2);
272 | false ->
273 | {ok, State2}
274 | end.
275 |
276 | flush_node_buffer(#state{nodes=[#node{ level=Level, members=NodeMembers }|RestNodes], compress=Compress, index_file_pos=NodePos } = State) ->
277 |
278 | OrderedMembers = lists:reverse(NodeMembers),
279 | {ok, BlockData} = hanoidb_util:encode_index_node(OrderedMembers, Compress),
280 |
281 | BlockSize = erlang:iolist_size(BlockData),
282 | Data = [ <<(BlockSize+2):32/unsigned, Level:16/unsigned>> | BlockData ],
283 | DataSize = BlockSize + 6,
284 |
285 | ok = file:write(State#state.index_file, Data),
286 |
287 | {FirstKey, _} = hd(OrderedMembers),
288 | append_node(Level + 1, FirstKey, {NodePos, DataSize},
289 | State#state{ nodes = RestNodes,
290 | index_file_pos = NodePos + DataSize,
291 | last_node_pos = NodePos,
292 | last_node_size = DataSize }).
293 |
--------------------------------------------------------------------------------
/src/plain_rpc.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% plain_rpc: RPC module to accompany plain_fsm
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% This file is provided to you under the Apache License, Version 2.0 (the
9 | %% "License"); you may not use this file except in compliance with the License.
10 | %% You may obtain a copy of the License at
11 | %%
12 | %% http://www.apache.org/licenses/LICENSE-2.0
13 | %%
14 | %% Unless required by applicable law or agreed to in writing, software
15 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
16 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
17 | %% License for the specific language governing permissions and limitations
18 | %% under the License.
19 | %%
20 | %% ----------------------------------------------------------------------------
21 |
22 | -module(plain_rpc).
23 | -author('Kresten Krab Thorup ').
24 |
25 | -export([send_call/2, receive_reply/1, send_reply/2, call/2, call/3, cast/2]).
26 |
27 | -include("include/plain_rpc.hrl").
28 |
29 |
30 | send_call(PID, Request) ->
31 | Ref = erlang:monitor(process, PID),
32 | PID ! ?CALL({self(), Ref}, Request),
33 | Ref.
34 |
35 | cast(PID, Msg) ->
36 | PID ! ?CAST(self(), Msg).
37 |
38 | receive_reply(MRef) ->
39 | receive
40 | ?REPLY(MRef, Reply) ->
41 | erlang:demonitor(MRef, [flush]),
42 | Reply;
43 | {'DOWN', MRef, _, _, Reason} ->
44 | exit(Reason)
45 | end.
46 |
47 | send_reply({PID,Ref}, Reply) ->
48 | _ = erlang:send(PID, ?REPLY(Ref, Reply)),
49 | ok.
50 |
51 | call(PID,Request) ->
52 | call(PID, Request, infinity).
53 |
54 | call(PID,Request,Timeout) ->
55 | MRef = erlang:monitor(process, PID),
56 | PID ! ?CALL({self(), MRef}, Request),
57 | receive
58 | ?REPLY(MRef, Reply) ->
59 | erlang:demonitor(MRef, [flush]),
60 | Reply;
61 | {'DOWN', MRef, _, _, Reason} ->
62 | exit(Reason)
63 | after Timeout ->
64 | erlang:demonitor(MRef, [flush]),
65 | exit({rpc_timeout, Request})
66 | end.
67 |
68 |
69 |
--------------------------------------------------------------------------------
/src/vbisect.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2014 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% This file is provided to you under the Apache License, Version 2.0 (the
9 | %% "License"); you may not use this file except in compliance with the License.
10 | %% You may obtain a copy of the License at
11 | %%
12 | %% http://www.apache.org/licenses/LICENSE-2.0
13 | %%
14 | %% Unless required by applicable law or agreed to in writing, software
15 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
16 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
17 | %% License for the specific language governing permissions and limitations
18 | %% under the License.
19 | %%
20 | %% ----------------------------------------------------------------------------
21 |
22 |
23 | -module(vbisect).
24 |
25 | -export([from_orddict/1,
26 | from_gb_tree/1,
27 | to_gb_tree/1,
28 | first_key/1,
29 | find/2, find_geq/2,
30 | foldl/3, foldr/3, fold_until_stop/3,
31 | to_orddict/1,
32 | merge/3]).
33 |
34 | -define(MAGIC, "vbis").
35 | -type key() :: binary().
36 | -type value() :: binary().
37 | -type bindict() :: binary().
38 |
39 | -ifdef(TEST).
40 | -include_lib("eunit/include/eunit.hrl").
41 | -endif.
42 |
43 | -spec from_gb_tree(gb_trees:tree()) -> bindict().
44 | from_gb_tree({Count,Node}) when Count =< 16#ffffffff ->
45 | {_BinSize,IOList} = encode_gb_node(Node),
46 | erlang:iolist_to_binary([ <> | IOList ]).
47 |
48 | encode_gb_node({Key, Value, Smaller, Bigger}) when is_binary(Key), is_binary(Value) ->
49 | {BinSizeSmaller, IOSmaller} = encode_gb_node(Smaller),
50 | {BinSizeBigger, IOBigger} = encode_gb_node(Bigger),
51 |
52 | KeySize = byte_size(Key),
53 | ValueSize = byte_size(Value),
54 | { 2 + KeySize
55 | + 4 + ValueSize
56 | + 4 + BinSizeSmaller
57 | + BinSizeBigger,
58 |
59 | [ << KeySize:16, Key/binary,
60 | BinSizeSmaller:32 >>, IOSmaller,
61 | << ValueSize:32, Value/binary >> | IOBigger ] };
62 |
63 | encode_gb_node(nil) ->
64 | { 0, [] }.
65 |
66 | to_gb_tree(<>) ->
67 | { Count, to_gb_node(Nodes) }.
68 |
69 | to_gb_node( <<>> ) ->
70 | nil;
71 |
72 | to_gb_node( << KeySize:16, Key:KeySize/binary,
73 | BinSizeSmaller:32, Smaller:BinSizeSmaller/binary,
74 | ValueSize:32, Value:ValueSize/binary,
75 | Bigger/binary >> ) ->
76 | {Key, Value,
77 | to_gb_node(Smaller),
78 | to_gb_node(Bigger)}.
79 |
80 | -spec find(Key::key(), Dict::bindict()) ->
81 | { ok, value() } | error.
82 | find(Key, <>) ->
83 | find_node(byte_size(Key), Key, Binary).
84 |
85 | find_node(KeySize, Key, <> = Bin) ->
89 | if
90 | Key < HereKey ->
91 | Skip = 6 + HereKeySize,
92 | << _:Skip/binary, Smaller:BinSizeSmaller/binary, _/binary>> = Bin,
93 | find_node(KeySize, Key, Smaller);
94 | HereKey < Key ->
95 | Skip = 10 + HereKeySize + BinSizeSmaller + ValueSize,
96 | << _:Skip/binary, Bigger/binary>> = Bin,
97 | find_node(KeySize, Key, Bigger);
98 | true ->
99 | {ok, Value}
100 | end;
101 |
102 | find_node(_, _, <<>>) ->
103 | error.
104 |
105 | to_orddict(BinDict) ->
106 | foldr(fun(Key,Value,Acc) ->
107 | [{Key,Value}|Acc]
108 | end,
109 | [],
110 | BinDict).
111 |
112 | merge(Fun, BinDict1, BinDict2) ->
113 | OD1 = to_orddict(BinDict1),
114 | OD2 = to_orddict(BinDict2),
115 | OD3 = orddict:merge(Fun, OD1, OD2),
116 | from_orddict(OD3).
117 |
118 | -spec first_key( bindict() ) -> binary() | none.
119 | first_key(BinDict) ->
120 | {_, Key} = fold_until_stop(fun({K,_},_) -> {stop, K} end, none, BinDict),
121 | Key.
122 |
123 | %% @doc Find largest {K,V} where K is smaller than or equal to key.
124 | %% This is good for an inner node where key is the smallest key
125 | %% in the child node.
126 |
127 | -spec find_geq(Key::binary(), Binary::binary()) ->
128 | none | {ok, Key::key(), Value::value()}.
129 |
130 | find_geq(Key, <>) ->
131 | find_geq_node(byte_size(Key), Key, Binary, none).
132 |
133 | find_geq_node(_, _, <<>>, Else) ->
134 | Else;
135 |
136 | find_geq_node(KeySize, Key, <> = Bin, Else) ->
140 | if
141 | Key < HereKey ->
142 | Skip = 6 + HereKeySize,
143 | << _:Skip/binary, Smaller:BinSizeSmaller/binary, _/binary>> = Bin,
144 | find_geq_node(KeySize, Key, Smaller, Else);
145 | HereKey < Key ->
146 | Skip = 10 + HereKeySize + BinSizeSmaller + ValueSize,
147 | << _:Skip/binary, Bigger/binary>> = Bin,
148 | find_geq_node(KeySize, Key, Bigger, {ok, HereKey, Value});
149 | true ->
150 | {ok, HereKey, Value}
151 | end.
152 |
153 | -spec foldl(fun((Key::key(), Value::value(), Acc::term()) -> term()), term(), bindict()) ->
154 | term().
155 | foldl(Fun, Acc, <>) ->
156 | foldl_node(Fun, Acc, Binary).
157 |
158 | foldl_node(_Fun, Acc, <<>>) ->
159 | Acc;
160 |
161 | foldl_node(Fun, Acc, <>) ->
165 | Acc1 = foldl_node(Fun, Acc, Smaller),
166 | Acc2 = Fun(Key, Value, Acc1),
167 | foldl_node(Fun, Acc2, Bigger).
168 |
169 |
170 | -spec fold_until_stop(function(), term(), bindict()) -> {stopped, term()} | {ok, term()}.
171 |
172 | fold_until_stop(Fun, Acc, <>) ->
173 | fold_until_stop2(Fun, {continue, Acc}, Bin).
174 |
175 | fold_until_stop2(_Fun,{stop,Result},_) ->
176 | {stopped, Result};
177 | fold_until_stop2(_Fun,{continue, Acc},<<>>) ->
178 | {ok, Acc};
179 | fold_until_stop2(Fun,{continue, Acc}, <>) ->
183 |
184 | case fold_until_stop2(Fun, {continue, Acc}, Smaller) of
185 | {stopped, Result} ->
186 | {stopped, Result};
187 | {ok, Acc1} ->
188 | ContinueOrStopAcc = Fun({Key,Value}, Acc1),
189 | fold_until_stop2(Fun, ContinueOrStopAcc, Bigger)
190 | end.
191 |
192 |
193 | -spec foldr(fun((Key::key(), Value::value(), Acc::term()) -> term()), term(), bindict()) ->
194 | term().
195 | foldr(Fun, Acc, <>) ->
196 | foldr_node(Fun, Acc, Binary).
197 |
198 | foldr_node(_Fun, Acc, <<>>) ->
199 | Acc;
200 |
201 | foldr_node(Fun, Acc, <>) ->
205 | Acc1 = foldr_node(Fun, Acc, Bigger),
206 | Acc2 = Fun(Key, Value, Acc1),
207 | foldr_node(Fun, Acc2, Smaller).
208 |
209 |
210 | from_orddict(OrdDict) ->
211 | from_gb_tree(gb_trees:from_orddict(OrdDict)).
212 |
213 | -ifdef(TEST).
214 |
215 | speed_test_() ->
216 | {timeout, 600,
217 | fun() ->
218 | Start = 100000000000000,
219 | N = 100000,
220 | Keys = lists:seq(Start, Start+N),
221 | KeyValuePairs = lists:map(fun (I) -> {<>, <<255:8/integer>>} end,
222 | Keys),
223 |
224 | %% Will mostly be unique, if N is bigger than 10000
225 | ReadKeys = [<<(lists:nth(random:uniform(N), Keys)):64/integer>> || _ <- lists:seq(1, 1000)],
226 | B = from_orddict(KeyValuePairs),
227 | time_reads(B, N, ReadKeys)
228 | end}.
229 |
230 |
231 | geq_test() ->
232 | B = from_orddict([{<<2>>,<<2>>},{<<4>>,<<4>>},{<<6>>,<<6>>},{<<122>>,<<122>>}]),
233 | none = find_geq(<<1>>, B),
234 | {ok, <<2>>, <<2>>} = find_geq(<<2>>, B),
235 | {ok, <<2>>, <<2>>} = find_geq(<<3>>, B),
236 | {ok, <<4>>, <<4>>} = find_geq(<<5>>, B),
237 | {ok, <<6>>, <<6>>} = find_geq(<<100>>, B),
238 | {ok, <<122>>, <<122>>} = find_geq(<<150>>, B),
239 | true.
240 |
241 |
242 | time_reads(B, Size, ReadKeys) ->
243 | Parent = self(),
244 | spawn(
245 | fun() ->
246 | Runs = 20,
247 | Timings =
248 | lists:map(
249 | fun (_) ->
250 | StartTime = now(),
251 | find_many(B, ReadKeys),
252 | timer:now_diff(now(), StartTime)
253 | end, lists:seq(1, Runs)),
254 |
255 | Rps = 1000000 / ((lists:sum(Timings) / length(Timings)) / 1000),
256 | error_logger:info_msg("Average over ~p runs, ~p keys in dict~n"
257 | "Average fetch ~p keys: ~p us, max: ~p us~n"
258 | "Average fetch 1 key: ~p us~n"
259 | "Theoretical sequential RPS: ~w~n",
260 | [Runs, Size, length(ReadKeys),
261 | lists:sum(Timings) / length(Timings),
262 | lists:max(Timings),
263 | (lists:sum(Timings) / length(Timings)) / length(ReadKeys),
264 | trunc(Rps)]),
265 |
266 | Parent ! done
267 | end),
268 | receive done -> ok after 1000 -> ok end.
269 |
270 | -spec find_many(bindict(), [key()]) -> non_neg_integer().
271 | find_many(B, Keys) ->
272 | lists:foldl(fun (K, N) ->
273 | case find(K, B) of
274 | {ok, _} -> N+1;
275 | error -> N
276 | end
277 | end,
278 | 0, Keys).
279 |
280 | -endif.
281 |
--------------------------------------------------------------------------------
/test/hanoidb_drv.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | %% @doc Drive a set of LSM BTrees
26 | -module(hanoidb_drv).
27 |
28 | -behaviour(gen_server).
29 |
30 | %% API
31 | -export([start_link/0]).
32 |
33 | -export([
34 | delete_exist/2,
35 | get_exist/2,
36 | get_fail/2,
37 | open/1, close/1,
38 | put/3,
39 | fold_range/4,
40 | stop/0]).
41 |
42 | %% gen_server callbacks
43 | -export([init/1, handle_call/3, handle_cast/2, handle_info/2,
44 | terminate/2, code_change/3]).
45 |
46 | -define(SERVER, ?MODULE).
47 |
48 | -record(state, { btrees = dict:new() % Map from a name to its tree
49 | }).
50 |
51 | %%%===================================================================
52 |
53 | start_link() ->
54 | gen_server:start_link({local, ?SERVER}, ?MODULE, [], []).
55 |
56 | call(X) ->
57 | gen_server:call(?SERVER, X, infinity).
58 |
59 | get_exist(N, K) ->
60 | call({get, N, K}).
61 |
62 | get_fail(N, K) ->
63 | call({get, N, K}).
64 |
65 | delete_exist(N, K) ->
66 | call({delete_exist, N, K}).
67 |
68 | open(N) ->
69 | call({open, N}).
70 |
71 | close(N) ->
72 | call({close, N}).
73 |
74 | put(N, K, V) ->
75 | call({put, N, K, V}).
76 |
77 | fold_range(T, Fun, Acc0, Range) ->
78 | call({fold_range, T, Fun, Acc0, Range}).
79 |
80 | stop() ->
81 | call(stop).
82 |
83 | %%%===================================================================
84 |
85 | init([]) ->
86 | {ok, #state{}}.
87 |
88 | handle_call({open, N}, _, #state { btrees = D} = State) ->
89 | case hanoidb:open(N) of
90 | {ok, Tree} ->
91 | {reply, ok, State#state { btrees = dict:store(N, Tree, D)}};
92 | Otherwise ->
93 | {reply, {error, Otherwise}, State}
94 | end;
95 | handle_call({close, N}, _, #state { btrees = D} = State) ->
96 | Tree = dict:fetch(N, D),
97 | case hanoidb:close(Tree) of
98 | ok ->
99 | {reply, ok, State#state { btrees = dict:erase(N, D)}};
100 | Otherwise ->
101 | {reply, {error, Otherwise}, State}
102 | end;
103 | handle_call({fold_range, Name, Fun, Acc0, Range},
104 | _From,
105 | #state { btrees = D } = State) ->
106 | Tree = dict:fetch(Name, D),
107 | Result = hanoidb:fold_range(Tree, Fun, Acc0, Range),
108 | {reply, Result, State};
109 | handle_call({put, N, K, V}, _, #state { btrees = D} = State) ->
110 | Tree = dict:fetch(N, D),
111 | case hanoidb:put(Tree, K, V) of
112 | ok ->
113 | {reply, ok, State};
114 | Other ->
115 | {reply, {error, Other}, State}
116 | end;
117 | handle_call({delete_exist, N, K}, _, #state { btrees = D} = State) ->
118 | Tree = dict:fetch(N, D),
119 | Reply = hanoidb:delete(Tree, K),
120 | {reply, Reply, State};
121 | handle_call({get, N, K}, _, #state { btrees = D} = State) ->
122 | Tree = dict:fetch(N, D),
123 | Reply = hanoidb:get(Tree, K),
124 | {reply, Reply, State};
125 | handle_call(stop, _, #state{ btrees = D } = State ) ->
126 | [ hanoidb:close(Tree) || {_,Tree} <- dict:to_list(D) ],
127 | {stop, normal, ok, State};
128 | handle_call(_Request, _From, State) ->
129 | Reply = ok,
130 | {reply, Reply, State}.
131 |
132 | handle_cast(_Msg, State) ->
133 | {noreply, State}.
134 |
135 | handle_info(_Info, State) ->
136 | {noreply, State}.
137 |
138 | terminate(_Reason, _State) ->
139 | ok.
140 |
141 | code_change(_OldVsn, State, _Extra) ->
142 | {ok, State}.
143 |
144 |
--------------------------------------------------------------------------------
/test/hanoidb_merger_tests.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_merger_tests).
26 |
27 | -ifdef(TEST).
28 | -include_lib("eunit/include/eunit.hrl").
29 | -endif.
30 |
31 | -compile(export_all).
32 |
33 | merge_test() ->
34 |
35 | file:delete("test1"),
36 | file:delete("test2"),
37 | file:delete("test3"),
38 |
39 | {ok, BT1} = hanoidb_writer:open("test1", [{expiry_secs, 0}]),
40 | lists:foldl(fun(N,_) ->
41 | ok = hanoidb_writer:add(BT1, <>, <<"data",N:128>>)
42 | end,
43 | ok,
44 | lists:seq(1,10000,2)),
45 | ok = hanoidb_writer:close(BT1),
46 |
47 |
48 | {ok, BT2} = hanoidb_writer:open("test2", [{expiry_secs, 0}]),
49 | lists:foldl(fun(N,_) ->
50 | ok = hanoidb_writer:add(BT2, <>, <<"data",N:128>>)
51 | end,
52 | ok,
53 | lists:seq(2,5001,1)),
54 | ok = hanoidb_writer:close(BT2),
55 |
56 |
57 | self() ! {step, {self(), none}, 2000000000},
58 | {Time,{ok,Count}} = timer:tc(hanoidb_merger, merge, ["test1", "test2", "test3", 10000, true, [{expiry_secs, 0}]]),
59 |
60 | % error_logger:info_msg("time to merge: ~p/sec (time=~p, count=~p)~n", [1000000/(Time/Count), Time/1000000, Count]),
61 |
62 | ok.
63 |
64 |
--------------------------------------------------------------------------------
/test/hanoidb_tests.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_tests).
26 |
27 | -include("include/hanoidb.hrl").
28 | -include("src/hanoidb.hrl").
29 |
30 | -ifdef(TEST).
31 | -ifdef(TRIQ).
32 | -include_lib("triq/include/triq.hrl").
33 | -include_lib("triq/include/triq_statem.hrl").
34 | -else.
35 | -include_lib("proper/include/proper.hrl").
36 | -endif.
37 | -include_lib("eunit/include/eunit.hrl").
38 | -endif.
39 |
40 | -ifdef(PROPER).
41 | -behaviour(proper_statem).
42 | -endif.
43 |
44 | -compile(export_all).
45 |
46 | -export([command/1, initial_state/0,
47 | next_state/3, postcondition/3,
48 | precondition/2]).
49 |
50 | -ifdef(pre18).
51 | -define(OTP_DICT, dict()).
52 | -else.
53 | -define(OTP_DICT, dict:dict()).
54 | -endif.
55 |
56 | -record(tree, { elements = dict:new() :: ?OTP_DICT }).
57 | -record(state, { open = dict:new() :: ?OTP_DICT,
58 | closed = dict:new() :: ?OTP_DICT}).
59 | -define(SERVER, hanoidb_drv).
60 |
61 | full_test_() ->
62 | {setup, spawn, fun () -> ok end, fun (_) -> ok end,
63 | [
64 | ?_test(test_tree_simple_1()),
65 | ?_test(test_tree_simple_2()),
66 | ?_test(test_tree_simple_4()),
67 | ?_test(test_tree_simple_5())
68 | ]}.
69 |
70 | longer_tree_test_() ->
71 | {setup,
72 | spawn,
73 | fun () -> ok end,
74 | fun (_) -> ok end,
75 | [
76 | {timeout, 300, ?_test(test_tree())}
77 | ]}.
78 |
79 | longer_qc_test_() ->
80 | {setup,
81 | spawn,
82 | fun () -> ok end,
83 | fun (_) -> ok end,
84 | [
85 | {timeout, 120, ?_test(test_qc())}
86 | ]}.
87 |
88 | -ifdef(TRIQ).
89 | test_qc() ->
90 | [?assertEqual(true, triq:module(?MODULE))].
91 | -else.
92 | qc_opts() -> [{numtests, 800}].
93 | test_qc() ->
94 | [?assertEqual([], proper:module(?MODULE, qc_opts()))].
95 | -endif.
96 |
97 | %% Generators
98 | %% ----------------------------------------------------------------------
99 |
100 | -define(NUM_TREES, 10).
101 |
102 | %% Generate a name for a btree
103 | g_btree_name() ->
104 | ?LET(I, choose(1,?NUM_TREES),
105 | btree_name(I)).
106 |
107 | %% Generate a key for the Tree
108 | g_key() ->
109 | binary().
110 |
111 | %% Generate a value for the Tree
112 | g_value() ->
113 | binary().
114 |
115 | g_fail_key() ->
116 | ?LET(T, choose(1,999999999999),
117 | term_to_binary(T)).
118 |
119 | g_open_tree(Open) ->
120 | oneof(dict:fetch_keys(Open)).
121 |
122 | %% Pick a name of a non-empty Btree
123 | g_non_empty_btree(Open) ->
124 | ?LET(TreesWithKeys, dict:filter(fun(_K, #tree { elements = D}) ->
125 | dict:size(D) > 0
126 | end,
127 | Open),
128 | oneof(dict:fetch_keys(TreesWithKeys))).
129 |
130 | g_existing_key(Name, Open) ->
131 | #tree { elements = Elems } = dict:fetch(Name, Open),
132 | oneof(dict:fetch_keys(Elems)).
133 |
134 | g_non_existing_key(Name, Open) ->
135 | ?SUCHTHAT(Key, g_fail_key(),
136 | begin
137 | #tree { elements = D } = dict:fetch(Name, Open),
138 | not dict:is_key(Key, D)
139 | end).
140 |
141 | g_fold_operation() ->
142 | oneof([{fun (K, V, Acc) -> [{K, V} | Acc] end, []}]).
143 |
144 | btree_name(I) ->
145 | "Btree_" ++ integer_to_list(I).
146 |
147 | %% Statem test
148 | %% ----------------------------------------------------------------------
149 | initial_state() ->
150 | ClosedBTrees = lists:foldl(fun(N, Closed) ->
151 | dict:store(btree_name(N),
152 | #tree { },
153 | Closed)
154 | end,
155 | dict:new(),
156 | lists:seq(1,?NUM_TREES)),
157 | #state { closed=ClosedBTrees }.
158 |
159 |
160 | command(#state { open = Open, closed = Closed } = S) ->
161 | frequency(
162 | [ {20, {call, ?SERVER, open, [oneof(dict:fetch_keys(Closed))]}}
163 | || closed_dicts(S)]
164 | ++ [ {20, {call, ?SERVER, close, [oneof(dict:fetch_keys(Open))]}}
165 | || open_dicts(S)]
166 | ++ [ {2000, {call, ?SERVER, put, cmd_put_args(S)}}
167 | || open_dicts(S)]
168 | ++ [ {1500, {call, ?SERVER, get_fail, cmd_get_fail_args(S)}}
169 | || open_dicts(S)]
170 | ++ [ {1500, {call, ?SERVER, get_exist, cmd_get_args(S)}}
171 | || open_dicts(S), open_dicts_with_keys(S)]
172 | ++ [ {500, {call, ?SERVER, delete_exist, cmd_delete_args(S)}}
173 | || open_dicts(S), open_dicts_with_keys(S)]
174 | ++ [ {125, {call, ?SERVER, fold_range, cmd_sync_fold_range_args(S)}}
175 | || open_dicts(S), open_dicts_with_keys(S)]
176 | ).
177 |
178 | %% Precondition (abstract)
179 | precondition(S, {call, ?SERVER, fold_range, [_Tree, _F, _A0, Range]}) ->
180 | is_valid_range(Range) andalso open_dicts(S) andalso open_dicts_with_keys(S);
181 | precondition(S, {call, ?SERVER, delete_exist, [_Name, _K]}) ->
182 | open_dicts(S) andalso open_dicts_with_keys(S);
183 | precondition(S, {call, ?SERVER, get_fail, [_Name, _K]}) ->
184 | open_dicts(S);
185 | precondition(S, {call, ?SERVER, get_exist, [_Name, _K]}) ->
186 | open_dicts(S) andalso open_dicts_with_keys(S);
187 | precondition(#state { open = Open }, {call, ?SERVER, put, [Name, _K, _V]}) ->
188 | dict:is_key(Name, Open);
189 | precondition(#state { open = Open, closed = Closed },
190 | {call, ?SERVER, open, [Name]}) ->
191 | (not (dict:is_key(Name, Open))) and (dict:is_key(Name, Closed));
192 | precondition(#state { open = Open, closed = Closed },
193 | {call, ?SERVER, close, [Name]}) ->
194 | (dict:is_key(Name, Open)) and (not dict:is_key(Name, Closed)).
195 |
196 | is_valid_range(#key_range{ from_key=FromKey, from_inclusive=FromIncl,
197 | to_key=ToKey, to_inclusive=ToIncl,
198 | limit=Limit })
199 | when
200 | (Limit == undefined) orelse (Limit > 0),
201 | is_binary(FromKey),
202 | (ToKey == undefined) orelse is_binary(ToKey),
203 | FromKey =< ToKey,
204 | is_boolean(FromIncl),
205 | is_boolean(ToIncl)
206 | ->
207 | if (FromKey == ToKey) ->
208 | (FromIncl == true) and (ToIncl == true);
209 | true ->
210 | true
211 | end;
212 | is_valid_range(_) ->
213 | false.
214 |
215 |
216 | %% Next state manipulation (abstract / concrete)
217 | next_state(S, _Res, {call, ?SERVER, fold_range, [_Tree, _F, _A0, _Range]}) ->
218 | S;
219 | next_state(S, _Res, {call, ?SERVER, get_fail, [_Name, _Key]}) ->
220 | S;
221 | next_state(S, _Res, {call, ?SERVER, get_exist, [_Name, _Key]}) ->
222 | S;
223 | next_state(#state { open = Open} = S, _Res,
224 | {call, ?SERVER, delete_exist, [Name, Key]}) ->
225 | S#state { open = dict:update(Name,
226 | fun(#tree { elements = Dict}) ->
227 | #tree { elements =
228 | dict:erase(Key, Dict)}
229 | end,
230 | Open)};
231 | next_state(#state { open = Open} = S, _Res,
232 | {call, ?SERVER, put, [Name, Key, Value]}) ->
233 | S#state { open = dict:update(
234 | Name,
235 | fun(#tree { elements = Dict}) ->
236 | #tree { elements =
237 | dict:store(Key, Value, Dict) }
238 | end,
239 | Open)};
240 | next_state(#state { open = Open, closed=Closed} = S,
241 | _Res, {call, ?SERVER, open, [Name]}) ->
242 | S#state { open = dict:store(Name, dict:fetch(Name, Closed) , Open),
243 | closed = dict:erase(Name, Closed) };
244 | next_state(#state { open = Open, closed=Closed} = S, _Res,
245 | {call, ?SERVER, close, [Name]}) ->
246 | S#state { closed = dict:store(Name, dict:fetch(Name, Open) , Closed),
247 | open = dict:erase(Name, Open) }.
248 |
249 | %% Postcondition check (concrete)
250 | postcondition(#state { open = Open},
251 | {call, ?SERVER, fold_range, [Tree, F, A0, Range]}, Result) ->
252 | #tree { elements = TDict } = dict:fetch(Tree, Open),
253 | DictResult = lists:sort(dict_range_query(TDict, F, A0, Range)),
254 | CallResult = lists:sort(Result),
255 | DictResult == CallResult;
256 | postcondition(_S,
257 | {call, ?SERVER, get_fail, [_Name, _Key]}, not_found) ->
258 | true;
259 | postcondition(#state { open = Open },
260 | {call, ?SERVER, get_exist, [Name, Key]}, {ok, Value}) ->
261 | #tree { elements = Elems } = dict:fetch(Name, Open),
262 | dict:fetch(Key, Elems) == Value;
263 | postcondition(_S, {call, ?SERVER, delete_exist, [_Name, _Key]}, ok) ->
264 | true;
265 | postcondition(_S, {call, ?SERVER, put, [_Name, _Key, _Value]}, ok) ->
266 | true;
267 | postcondition(_S, {call, ?SERVER, open, [_Name]}, ok) ->
268 | true;
269 | postcondition(_S, {call, ?SERVER, close, [_Name]}, ok) ->
270 | true;
271 | postcondition(_State, _Call, _Result) ->
272 | % error_logger:error_report([{not_matching_any_postcondition, _State, _Call, _Result}]),
273 | false.
274 |
275 |
276 | %% Main property. Running a random set of commands is in agreement
277 | %% with a dict.
278 | prop_dict_agree() ->
279 | ?FORALL(Cmds, commands(?MODULE),
280 | ?TRAPEXIT(
281 | begin
282 | hanoidb_drv:start_link(),
283 | {History,State,Result} = run_commands(?MODULE, Cmds),
284 | hanoidb_drv:stop(),
285 | cleanup_test_trees(State),
286 | ?WHENFAIL(io:format("History: ~w\nState: ~w\nResult: ~w\n",
287 | [History,State,Result]),
288 | Result =:= ok)
289 | end)).
290 |
291 | %% UNIT TESTS
292 | %% ----------------------------------------------------------------------
293 | test_tree_simple_1() ->
294 | {ok, Tree} = hanoidb:open("simple"),
295 | ok = hanoidb:put(Tree, <<>>, <<"data", 77:128>>),
296 | {ok, <<"data", 77:128>>} = hanoidb:get(Tree, <<>>),
297 | ok = hanoidb:close(Tree).
298 |
299 | test_tree_simple_2() ->
300 | {ok, Tree} = hanoidb:open("simple"),
301 | ok = hanoidb:put(Tree, <<"ã">>, <<"µ">>),
302 | {ok, <<"µ">>} = hanoidb:get(Tree, <<"ã">>),
303 | ok = hanoidb:delete(Tree, <<"ã">>),
304 | not_found = hanoidb:get(Tree, <<"ã">>),
305 | ok = hanoidb:close(Tree).
306 |
307 | test_tree_simple_4() ->
308 | Key = <<56,11,62,42,35,163,16,100,9,224,8,228,130,94,198,2,126,117,243,
309 | 1,122,175,79,159,212,177,30,153,71,91,85,233,41,199,190,58,3,
310 | 173,220,9>>,
311 | Value = <<212,167,12,6,105,152,17,80,243>>,
312 | {ok, Tree} = hanoidb:open("simple"),
313 | ok = hanoidb:put(Tree, Key, Value),
314 | ?assertEqual({ok, Value}, hanoidb:get(Tree, Key)),
315 | ok = hanoidb:close(Tree).
316 |
317 | test_tree_simple_5() ->
318 | {ok, Tree} = hanoidb:open("simple"),
319 | ok = hanoidb:put(Tree, <<"foo">>, <<"bar">>, 2),
320 | {ok, <<"bar">>} = hanoidb:get(Tree, <<"foo">>),
321 | ok = timer:sleep(3000),
322 | not_found = hanoidb:get(Tree, <<"foo">>),
323 | ok = hanoidb:close(Tree).
324 |
325 | test_tree() ->
326 | {ok, Tree} = hanoidb:open("simple2"),
327 | lists:foldl(fun(N,_) ->
328 | ok = hanoidb:put(Tree, <>, <<"data",N:128>>)
329 | end,
330 | ok,
331 | lists:seq(2,10000,1)),
332 | % io:format(user, "INSERT DONE 1~n", []),
333 |
334 | lists:foldl(fun(N,_) ->
335 | ok = hanoidb:put(Tree, <>, <<"data",N:128>>)
336 | end,
337 | ok,
338 | lists:seq(4000,6000,1)),
339 | % io:format(user, "INSERT DONE 2~n", []),
340 |
341 | hanoidb:delete(Tree, <<1500:128>>),
342 | % io:format(user, "DELETE DONE 3~n", []),
343 |
344 | {Time1,{ok,Count1}} = timer:tc(?MODULE, run_fold, [Tree,1000,2000,9]),
345 | % error_logger:info_msg("time to fold: ~p/sec (time=~p, count=~p)~n", [1000000/(Time1/Count1), Time1/1000000, Count1]),
346 |
347 | {Time2,{ok,Count2}} = timer:tc(?MODULE, run_fold, [Tree,1000,2000,1000]),
348 | % error_logger:info_msg("time to fold: ~p/sec (time=~p, count=~p)~n", [1000000/(Time2/Count2), Time2/1000000, Count2]),
349 | ok = hanoidb:close(Tree).
350 |
351 | run_fold(Tree,From,To,Limit) ->
352 | F = fun(<>, _Value, {N, C}) ->
353 | {N + 1, C + 1};
354 | (<<1501:128>>, _Value, {1500, C}) ->
355 | {1502, C + 1}
356 | end,
357 | {_, Count} = hanoidb:fold_range(Tree, F,
358 | {From, 0},
359 | #key_range{from_key= <>, to_key= <<(To+1):128>>, limit=Limit}),
360 | {ok, Count}.
361 |
362 |
363 | %% Command processing
364 | %% ----------------------------------------------------------------------
365 | cmd_close_args(#state { open = Open }) ->
366 | oneof(dict:fetch_keys(Open)).
367 |
368 | cmd_put_args(#state { open = Open }) ->
369 | ?LET({Name, Key, Value},
370 | {oneof(dict:fetch_keys(Open)), g_key(), g_value()},
371 | [Name, Key, Value]).
372 |
373 |
374 | cmd_get_fail_args(#state { open = Open}) ->
375 | ?LET(Name, g_open_tree(Open),
376 | ?LET(Key, g_non_existing_key(Name, Open),
377 | [Name, Key])).
378 |
379 | cmd_get_args(#state { open = Open}) ->
380 | ?LET(Name, g_non_empty_btree(Open),
381 | ?LET(Key, g_existing_key(Name, Open),
382 | [Name, Key])).
383 |
384 | cmd_delete_args(#state { open = Open}) ->
385 | ?LET(Name, g_non_empty_btree(Open),
386 | ?LET(Key, g_existing_key(Name, Open),
387 | [Name, Key])).
388 |
389 | cmd_sync_range_args(#state { open = Open }) ->
390 | ?LET(Tree, g_non_empty_btree(Open),
391 | ?LET({K1, K2}, {g_existing_key(Tree, Open),
392 | g_existing_key(Tree, Open)},
393 | [Tree, #key_range{from_key=K1, to_key=K2}])).
394 |
395 | cmd_sync_fold_range_args(State) ->
396 | ?LET([Tree, Range], cmd_sync_range_args(State),
397 | ?LET({F, Acc0}, g_fold_operation(),
398 | [Tree, F, Acc0, Range])).
399 |
400 | %% Context management
401 | %% ----------------------------------------------------------------------
402 | cleanup_test_trees(#state { open = Open, closed = Closed }) ->
403 | [cleanup_tree(N) || N <- dict:fetch_keys(Open)],
404 | [cleanup_tree(N) || N <- dict:fetch_keys(Closed)].
405 |
406 | cleanup_tree(Tree) ->
407 | case file:list_dir(Tree) of
408 | {error, enoent} ->
409 | ok;
410 | {ok, FileNames} ->
411 | [ok = file:delete(filename:join([Tree, Fname]))
412 | || Fname <- FileNames],
413 | file:del_dir(Tree)
414 | end.
415 |
416 | %% Various Helper routines
417 | %% ----------------------------------------------------------------------
418 |
419 | open_dicts_with_keys(#state { open = Open}) ->
420 | lists:any(fun({_, #tree { elements = D}}) ->
421 | dict:size(D) > 0
422 | end,
423 | dict:to_list(Open)).
424 |
425 | open_dicts(#state { open = Open}) ->
426 | dict:size(Open) > 0.
427 |
428 | closed_dicts(#state { closed = Closed}) ->
429 | dict:size(Closed) > 0.
430 |
431 | dict_range_query(Dict, Fun, Acc0, Range) ->
432 | KVs = dict_range_query(Dict, Range),
433 | lists:foldl(fun({K, V}, Acc) ->
434 | Fun(K, V, Acc)
435 | end,
436 | Acc0,
437 | KVs).
438 |
439 | dict_range_query(Dict, Range) ->
440 | [{K, V} || {K, V} <- dict:to_list(Dict),
441 | ?KEY_IN_RANGE(K, Range)].
442 |
443 |
--------------------------------------------------------------------------------
/test/hanoidb_writer_tests.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(hanoidb_writer_tests).
26 |
27 | -ifdef(TEST).
28 | -ifdef(TEST).
29 | -ifdef(TRIQ).
30 | -include_lib("triq/include/triq.hrl").
31 | -include_lib("triq/include/triq_statem.hrl").
32 | -else.
33 | -include_lib("proper/include/proper.hrl").
34 | -endif.
35 | -include_lib("eunit/include/eunit.hrl").
36 | -endif.
37 |
38 | -ifdef(PROPER).
39 | -behaviour(proper_statem).
40 | -endif.
41 | -endif.
42 |
43 | -include("include/hanoidb.hrl").
44 |
45 | -compile(export_all).
46 |
47 | simple_test() ->
48 |
49 | file:delete("testdata"),
50 | {ok, BT} = hanoidb_writer:open("testdata"),
51 | ok = hanoidb_writer:add(BT, <<"A">>, <<"Avalue">>),
52 | ok = hanoidb_writer:add(BT, <<"B">>, <<"Bvalue">>),
53 | ok = hanoidb_writer:close(BT),
54 |
55 | {ok, IN} = hanoidb_reader:open("testdata"),
56 | {ok, <<"Avalue">>} = hanoidb_reader:lookup(IN, <<"A">>),
57 | ok = hanoidb_reader:close(IN),
58 |
59 | ok = file:delete("testdata").
60 |
61 |
62 | simple1_test() ->
63 |
64 | file:delete("testdata"),
65 | {ok, BT} = hanoidb_writer:open("testdata", [{block_size, 102},{expiry_secs, 0}]),
66 |
67 | Max = 102,
68 | Seq = lists:seq(0, Max),
69 |
70 | {Time1,_} = timer:tc(
71 | fun() ->
72 | lists:foreach(
73 | fun(Int) ->
74 | ok = hanoidb_writer:add(BT, <>, <<"valuevalue/", Int:128>>)
75 | end,
76 | Seq),
77 | ok = hanoidb_writer:close(BT)
78 | end,
79 | []),
80 |
81 | error_logger:info_msg("time to insert: ~p/sec~n", [1000000/(Time1/Max)]),
82 |
83 | {ok, IN} = hanoidb_reader:open("testdata", [{expiry_secs,0}]),
84 | Middle = Max div 2,
85 | io:format("LOOKING UP ~p~n", [<>]),
86 | {ok, <<"valuevalue/", Middle:128>>} = hanoidb_reader:lookup(IN, <>),
87 |
88 |
89 | {Time2,Count} = timer:tc(
90 | fun() -> hanoidb_reader:fold(fun(_Key, <<"valuevalue/", N:128>>, N) ->
91 | N+1
92 | end,
93 | 0,
94 | IN)
95 | end,
96 | []),
97 |
98 | io:format("time to scan: ~p/sec~n", [1000000/(Time2 div Max)]),
99 |
100 | Max = Count-1,
101 |
102 | {Time3,{done,Count2}} = timer:tc(
103 | fun() -> hanoidb_reader:range_fold(fun(_Key, <<"valuevalue/", N:128>>, N) ->
104 | % io:format("[~p]~n", N),
105 | N+1
106 | end,
107 | 0,
108 | IN,
109 | #key_range{ from_key= <<>>, to_key=undefined })
110 | end,
111 | []),
112 |
113 |
114 |
115 | %error_logger:info_msg("time to range_fold: ~p/sec~n", [1000000/(Time3 div Max)]),
116 |
117 | io:format("count2=~p~n", [Count2]),
118 |
119 | Max = Count2-1,
120 |
121 | ok = hanoidb_reader:close(IN).
122 |
123 |
--------------------------------------------------------------------------------
/tools/basho_bench_driver_hanoidb.erl:
--------------------------------------------------------------------------------
1 | %% ----------------------------------------------------------------------------
2 | %%
3 | %% hanoidb: LSM-trees (Log-Structured Merge Trees) Indexed Storage
4 | %%
5 | %% Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
6 | %% http://trifork.com/ info@trifork.com
7 | %%
8 | %% Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
9 | %% http://basho.com/ info@basho.com
10 | %%
11 | %% This file is provided to you under the Apache License, Version 2.0 (the
12 | %% "License"); you may not use this file except in compliance with the License.
13 | %% You may obtain a copy of the License at
14 | %%
15 | %% http://www.apache.org/licenses/LICENSE-2.0
16 | %%
17 | %% Unless required by applicable law or agreed to in writing, software
18 | %% distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
19 | %% WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
20 | %% License for the specific language governing permissions and limitations
21 | %% under the License.
22 | %%
23 | %% ----------------------------------------------------------------------------
24 |
25 | -module(basho_bench_driver_hanoidb).
26 |
27 | -record(state, { tree,
28 | filename,
29 | flags,
30 | sync_interval,
31 | last_sync }).
32 |
33 | -export([new/1,
34 | run/4]).
35 |
36 | -include("hanoidb.hrl").
37 | -include_lib("basho_bench/include/basho_bench.hrl").
38 |
39 | -record(key_range, { from_key = <<>> :: binary(),
40 | from_inclusive = true :: boolean(),
41 | to_key :: binary() | undefined,
42 | to_inclusive = false :: boolean(),
43 | limit :: pos_integer() | undefined }).
44 |
45 | %% ====================================================================
46 | %% API
47 | %% ====================================================================
48 |
49 | new(_Id) ->
50 | %% Make sure bitcask is available
51 | case code:which(hanoidb) of
52 | non_existing ->
53 | ?FAIL_MSG("~s requires hanoidb to be available on code path.\n",
54 | [?MODULE]);
55 | _ ->
56 | ok
57 | end,
58 |
59 | %% Get the target directory
60 | Dir = basho_bench_config:get(hanoidb_dir, "."),
61 | Filename = filename:join(Dir, "test.hanoidb"),
62 | Config = basho_bench_config:get(hanoidb_flags, []),
63 |
64 | %% Look for sync interval config
65 | case basho_bench_config:get(hanoidb_sync_interval, infinity) of
66 | Value when is_integer(Value) ->
67 | SyncInterval = Value;
68 | infinity ->
69 | SyncInterval = infinity
70 | end,
71 |
72 | %% Get any bitcask flags
73 | case hanoidb:open(Filename, Config) of
74 | {error, Reason} ->
75 | ?FAIL_MSG("Failed to open hanoidb in ~s: ~p\n", [Filename, Reason]);
76 | {ok, FBTree} ->
77 | {ok, #state { tree = FBTree,
78 | filename = Filename,
79 | sync_interval = SyncInterval,
80 | last_sync = os:timestamp() }}
81 | end.
82 |
83 | run(get, KeyGen, _ValueGen, State) ->
84 | case hanoidb:lookup(State#state.tree, KeyGen()) of
85 | {ok, _Value} ->
86 | {ok, State};
87 | not_found ->
88 | {ok, State};
89 | {error, Reason} ->
90 | {error, Reason}
91 | end;
92 | run(put, KeyGen, ValueGen, State) ->
93 | case hanoidb:put(State#state.tree, KeyGen(), ValueGen()) of
94 | ok ->
95 | {ok, State};
96 | {error, Reason} ->
97 | {error, Reason}
98 | end;
99 | run(delete, KeyGen, _ValueGen, State) ->
100 | case hanoidb:delete(State#state.tree, KeyGen()) of
101 | ok ->
102 | {ok, State};
103 | {error, Reason} ->
104 | {error, Reason}
105 | end;
106 |
107 | run(fold_100, KeyGen, _ValueGen, State) ->
108 | [From,To] = lists:usort([KeyGen(), KeyGen()]),
109 | case hanoidb:sync_fold_range(State#state.tree,
110 | fun(_Key,_Value,Count) ->
111 | Count+1
112 | end,
113 | 0,
114 | #key_range{ from_key=From,
115 | to_key=To,
116 | limit=100 }) of
117 | Count when Count >= 0; Count =< 100 ->
118 | {ok,State};
119 | Count ->
120 | {error, {bad_fold_count, Count}}
121 | end.
122 |
--------------------------------------------------------------------------------
/tools/visualize-hanoi.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | ## ----------------------------------------------------------------------------
4 | ##
5 | ## hanoi: LSM-trees (Log-Structured Merge Trees) Indexed Storage
6 | ##
7 | ## Copyright 2011-2012 (c) Trifork A/S. All Rights Reserved.
8 | ## http://trifork.com/ info@trifork.com
9 | ##
10 | ## Copyright 2012 (c) Basho Technologies, Inc. All Rights Reserved.
11 | ## http://basho.com/ info@basho.com
12 | ##
13 | ## This file is provided to you under the Apache License, Version 2.0 (the
14 | ## "License"); you may not use this file except in compliance with the License.
15 | ## You may obtain a copy of the License at
16 | ##
17 | ## http://www.apache.org/licenses/LICENSE-2.0
18 | ##
19 | ## Unless required by applicable law or agreed to in writing, software
20 | ## distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
21 | ## WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
22 | ## License for the specific language governing permissions and limitations
23 | ## under the License.
24 | ##
25 | ## ----------------------------------------------------------------------------
26 |
27 | function periodic() {
28 | t=0
29 | while sleep 1 ; do
30 | let "t=t+1"
31 | printf "%5d [" "$t"
32 |
33 | for ((i=0; i<35; i++)) ; do
34 | if ! [ -f "A-$i.data" ] ; then
35 | echo -n " "
36 | elif ! [ -f "B-$i.data" ] ; then
37 | echo -n "-"
38 | elif ! [ -f "C-$i.data" ] ; then
39 | echo -n "#"
40 | elif ! [ -f "X-$i.data" ] ; then
41 | echo -n "="
42 | else
43 | echo -n "*"
44 | fi
45 | done
46 | echo
47 | done
48 | }
49 |
50 | merge_diff() {
51 | SA=`ls -l A-${ID}.data 2> /dev/null | awk '{print $5}'`
52 | SB=`ls -l B-${ID}.data 2> /dev/null | awk '{print $5}'`
53 | SX=`ls -l X-${ID}.data 2> /dev/null | awk '{print $5}'`
54 | if [ \( -n "$SA" \) -a \( -n "$SB" \) -a \( -n "$SX" \) ]; then
55 | export RES=`expr ${SX}0 / \( $SA + $SB \)`
56 | else
57 | export RES="?"
58 | fi
59 | }
60 |
61 | function dynamic() {
62 | local old s t start now
63 | t=0
64 | start=`date +%s`
65 | while true ; do
66 | s=""
67 | for ((i=8; i<22; i++)) ; do
68 | if [ -f "C-$i.data" ] ; then
69 | s="${s}C"
70 | else
71 | s="$s "
72 | fi
73 | if [ -f "B-$i.data" ] ; then
74 | s="${s}B"
75 | else
76 | s="$s "
77 | fi
78 | if [ -f "A-$i.data" ] ; then
79 | s="${s}A"
80 | else
81 | s="$s "
82 | fi
83 | if [ -f "X-$i.data" ] ; then
84 | export ID="$i"
85 | merge_diff
86 | s="${s}$RES"
87 | elif [ -f "M-$i.data" ] ; then
88 | s="${s}M"
89 | else
90 | s="$s "
91 | fi
92 | s="$s|"
93 | done
94 |
95 | if [[ "$s" != "$old" ]] ; then
96 | let "t=t+1"
97 | now=`date +%s`
98 | let "now=now-start"
99 | free=`df -m . 2> /dev/null | tail -1 | awk '{print $4}'`
100 | used=`du -m 2> /dev/null | awk '{print $1}' `
101 | printf "%5d %6d [%s\n" "$t" "$now" "$s ${used}MB (${free}MB free)"
102 | old="$s"
103 | else
104 | # Sleep a little bit:
105 | sleep 1
106 | fi
107 | done
108 | }
109 |
110 | dynamic
111 |
--------------------------------------------------------------------------------