├── .gitignore ├── AUTHORS ├── BUILD.bazel ├── CMakeLists.txt ├── LICENSE ├── README.md ├── WORKSPACE ├── bitmap_benchmark_test.cc ├── build_benchmark.cc ├── cmake ├── README.md ├── absl.cmake ├── benchmark.cmake ├── benchmarks.cmake ├── boost.cmake ├── common.cmake ├── croaring.cmake ├── csvparser.cmake ├── cuckooindex.cmake ├── googletest.cmake ├── leveldb.cmake ├── protobuf.cmake ├── tests.cmake └── xor_singleheader.cmake ├── common ├── BUILD.bazel ├── bit_packing.h ├── bit_packing_benchmark.cc ├── bit_packing_test.cc ├── bitmap.h ├── byte_coding.h ├── byte_coding_test.cc ├── profiling.cc ├── profiling.h ├── rle_bitmap.cc ├── rle_bitmap.h └── rle_bitmap_test.cc ├── croaring.BUILD ├── csv-parser.BUILD ├── cuckoo_index.cc ├── cuckoo_index.h ├── cuckoo_index_test.cc ├── cuckoo_kicker.cc ├── cuckoo_kicker.h ├── cuckoo_kicker_test.cc ├── cuckoo_utils.cc ├── cuckoo_utils.h ├── cuckoo_utils_test.cc ├── data.cc ├── data.h ├── data_test.cc ├── docs └── contributing.md ├── evaluate.cc ├── evaluation.proto ├── evaluation_utils.cc ├── evaluation_utils.h ├── evaluation_utils_test.cc ├── evaluator.cc ├── evaluator.h ├── fingerprint_store.cc ├── fingerprint_store.h ├── fingerprint_store_test.cc ├── index_structure.h ├── leveldb.BUILD ├── lookup_benchmark.cc ├── per_stripe_bloom.h ├── per_stripe_bloom_test.cc ├── per_stripe_xor.h ├── per_stripe_xor_test.cc ├── xor_filter.h ├── xor_filter_test.cc ├── xor_singleheader.BUILD ├── zone_map.h └── zone_map_test.cc /.gitignore: -------------------------------------------------------------------------------- 1 | # Bazel build output 2 | bazel-* 3 | 4 | # CMake build output 5 | build/ 6 | 7 | # IDE files 8 | .idea/ 9 | .vscode/ 10 | 11 | # Generated protocol buffer sources 12 | *.pb.cc 13 | *.pb.h 14 | 15 | # Data 16 | Vehicle__Snowmobile__and_Boat_Registrations.csv 17 | -------------------------------------------------------------------------------- /AUTHORS: -------------------------------------------------------------------------------- 1 | # This is the list of Cuckoo Index's significant contributors. 2 | # 3 | # This does not necessarily list everyone who has contributed code, 4 | # especially since many employees of one corporation may be contributing. 5 | # To see the full list of contributors, see the revision history in 6 | # source control. 7 | Google LLC 8 | Damian Chromejko 9 | Alexander Hall 10 | Andreas Kipf 11 | -------------------------------------------------------------------------------- /CMakeLists.txt: -------------------------------------------------------------------------------- 1 | # Note: CMake support is community-based. The maintainers do not use CMake internally. 2 | 3 | cmake_minimum_required(VERSION 3.18) # For SOURCE_SUBDIR in FetchContent_Declare 4 | 5 | project(cuckooindex CXX) 6 | 7 | list(APPEND CMAKE_MODULE_PATH "${CMAKE_CURRENT_SOURCE_DIR}/cmake") 8 | set(CMAKE_CXX_STANDARD 17) 9 | set(CMAKE_CXX_STANDARD_REQUIRED ON) 10 | set(CMAKE_EXPORT_COMPILE_COMMANDS ON) 11 | add_compile_options(-Wall -Wextra) 12 | 13 | enable_testing() 14 | 15 | option(CUCKOOINDEX_BUILD_TESTS "Builds the cuckoo index tests." ON) 16 | option(CUCKOOINDEX_BUILD_BENCHMARKS "Builds the cuckoo index benchmarks if the tests are built as well." ON) 17 | 18 | if(CUCKOOINDEX_BUILD_TESTS) 19 | include(tests) 20 | 21 | if(CUCKOOINDEX_BUILD_BENCHMARKS) 22 | include(benchmarks) 23 | endif() 24 | endif() 25 | 26 | include(cuckooindex) 27 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **NOTE** This is not an officially supported Google product. 2 | 3 | # Cuckoo Index 4 | 5 | ## Overview 6 | 7 | [Cuckoo Index](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf) (CI) is a lightweight secondary index structure that represents the many-to-many relationship between keys and partitions of columns in a highly space-efficient way. At its core, CI associates variable-sized fingerprints in a [Cuckoo filter](https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) with compressed bitmaps indicating qualifying partitions. 8 | 9 | ## What Problem Does It Solve? 10 | 11 | The problem of finding all partitions that possibly contain a given lookup key is traditionally solved by maintaining one filter (e.g., a Bloom filter) per partition that indexes all unique key values contained in this partition: 12 | 13 | ``` 14 | Partition 0: 15 | A, B => Bloom filter 0 16 | 17 | Partition 1: 18 | B, C => Bloom filter 1 19 | ... 20 | ``` 21 | 22 | To identify all partitions containing a key, we need to probe all per-partition filters (which could be many). Since a Bloom filter may return false positives, there is a chance (of e.g. 1%) that we accidentally identify a negative partition as positive. In the above example, a lookup for key A may return Partition 0 (true positive) and 1 (false positive). Depending on the storage medium, a false positive partition can be very expensive (e.g., many milliseconds on disk). 23 | 24 | Furthermore, secondary columns typically contain many duplicates (also across partitions). With the per-partition filter design, these duplicates may be indexed in multiple filters (in the worst case, in all filters). In the above example, the key B is redundantly indexed in Bloom filter 0 and 1. 25 | 26 | Cuckoo Index addresses both of these drawbacks of per-partition filters. 27 | 28 | ## Features 29 | 30 | * 100% correct results for lookups with occurring keys (as opposed to per-partition filters). 31 | * Configurable scan rate (ratio of false positive partitions) for lookups with non-occurring keys. 32 | * Much smaller footprint size than full-fledged indexes that store full-sized keys. 33 | * Smaller footprint size than per-partition filters for low-to-medium cardinality columns. 34 | 35 | ## Limitations 36 | 37 | * Requires access to all keys at build time. 38 | * Relatively high build time (in O(n) but with a high constant factor) compared to e.g. per-partition Bloom filters. 39 | * Once built, CI is immutable but fast to query (it uses a [rank support structure](https://www.cs.cmu.edu/~dga/papers/zhou-sea2013.pdf) for efficient rank calls). 40 | 41 | ## Running Experiments 42 | 43 | Prepare a dataset in a CSV format that you are going to use. One of the datasets we used was DMV [Vehicle, Snowmobile, and Boat Registrations](https://catalog.data.gov/dataset/vehicle-snowmobile-and-boat-registrations). 44 | 45 | ``` 46 | wget -c https://data.ny.gov/api/views/w4pv-hbkt/rows.csv -O Vehicle__Snowmobile__and_Boat_Registrations.csv 47 | ``` 48 | 49 | Add the file to the `data` dependencies in the `BUILD.bazel` file. 50 | 51 | ``` 52 | data = [ 53 | # Put your csv files here 54 | "Vehicle__Snowmobile__and_Boat_Registrations.csv" 55 | ], 56 | ``` 57 | 58 | For footprint experiments, run the following command, specifying the path to the data file, columns to test, and the tests to run. 59 | 60 | ``` 61 | bazel run -c opt --cxxopt="-std=c++17" :evaluate -- \ 62 | --input_csv_path="Vehicle__Snowmobile__and_Boat_Registrations.csv" \ 63 | --columns_to_test="City,Zip,Color" \ 64 | --test_cases="positive_uniform,positive_distinct,positive_zipf,negative,mixed" \ 65 | --output_csv_path="results.csv" 66 | ``` 67 | 68 | For lookup performance experiments, run the following command, specifying the path to the data file, and columns to test. 69 | 70 | **NOTE** You might want to use fewer rows for lookup experiments as the benchmarks are quite time-consuming. 71 | 72 | ``` 73 | bazel run -c opt --cxxopt='-std=c++17' --dynamic_mode=off :lookup_benchmark -- \ 74 | --input_csv_path="Vehicle__Snowmobile__and_Boat_Registrations.csv" \ 75 | --columns_to_test="City,Zip,Color" 76 | ``` 77 | 78 | ## CMake support 79 | 80 | **NOTE** CMake support is community-based. The maintainers do not use CMake internally. 81 | 82 | For further information have a look at the [cmake README](cmake/README.md). 83 | 84 | ## Code Organization 85 | 86 | #### Evaluation Framework 87 | 88 | * Evaluate (evaluate.h): *Entry point (binary) into our evaluation framework with instantiations of all indexes.* 89 | * Evaluator (evaluator.h): *Evaluation framework.* 90 | * Table/Column (data.h): *Integer columns that we run the benchmarks on (string columns are dict-encoded).* 91 | * IndexStructure (index_structure.h): *Interface shared among all indexes.* 92 | 93 | #### Cuckoo Index 94 | 95 | * CuckooIndex (cuckoo_index.h): *Main class of Cuckoo Index.* 96 | * CuckooKicker (cuckoo_kicker.h): *A heuristic that finds a close-to-optimal assignment of keys to buckets (in terms of the ratio of items residing in primary buckets).* 97 | * FingerprintStore (fingerprint_store.h): *Stores variable-sized fingerprints in bitpacket format.* 98 | * RleBitmap (rle_bitmap.h): *An RLE-based (bitwise, unaligned) bitmap representation (for sparse bitmaps we use position lists).* 99 | * BitPackedReader (bit_packing.h): *A helper class for storing & retrieving bitpacked data.* 100 | 101 | ## Cite 102 | 103 | Please cite our [VLDB 2020 paper](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf) if you use this code in your own work: 104 | 105 | ``` 106 | @article{cuckoo-index, 107 | author = {Kipf, Andreas and Chromejko, Damian and Hall, Alexander and Boncz, Peter and Andersen, David}, 108 | title = {Cuckoo Index: A Lightweight Secondary Index Structure}, 109 | year = {2020}, 110 | issue_date = {September 2020}, 111 | publisher = {VLDB Endowment}, 112 | volume = {13}, 113 | number = {13}, 114 | issn = {2150-8097}, 115 | url = {https://doi.org/10.14778/3424573.3424577}, 116 | doi = {10.14778/3424573.3424577}, 117 | journal = {Proc. VLDB Endow.}, 118 | month = sep, 119 | pages = {3559-3572}, 120 | numpages = {14} 121 | } 122 | ``` 123 | -------------------------------------------------------------------------------- /WORKSPACE: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | workspace(name = "com_github_google_cuckoo_index") 16 | 17 | load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive") 18 | load("@bazel_tools//tools/build_defs/repo:git.bzl", "git_repository") 19 | 20 | # abseil-cpp. 21 | http_archive( 22 | name = "com_google_absl", 23 | sha256 = "d3311ead20ffce78c7fde96df803b73d0de8d992d46bdf36753954bd2d459f31", 24 | strip_prefix = "abseil-cpp-df3ea785d8c30a9503321a3d35ee7d35808f190d", 25 | urls = ["https://github.com/abseil/abseil-cpp/archive/df3ea785d8c30a9503321a3d35ee7d35808f190d.zip"], 26 | ) 27 | 28 | # Google Test. 29 | http_archive( 30 | name = "com_google_googletest", 31 | sha256 = "7c7709af5d0c3c2514674261f9fc321b3f1099a2c57f13d0e56187d193c07e81", 32 | strip_prefix = "googletest-10b1902d893ea8cc43c69541d70868f91af3646b", 33 | urls = ["https://github.com/google/googletest/archive/10b1902d893ea8cc43c69541d70868f91af3646b.zip"], 34 | ) 35 | 36 | # Google Benchmark. 37 | http_archive( 38 | name = "com_google_benchmark", 39 | sha256 = "e777f978593ea6db38356ce09ec3902e839b3037a9a19ff543e6f901e50cc773", 40 | strip_prefix = "benchmark-090faecb454fbd6e6e17a75ef8146acb037118d4", 41 | urls = ["https://github.com/google/benchmark/archive/090faecb454fbd6e6e17a75ef8146acb037118d4.zip"], 42 | ) 43 | 44 | # C++ rules for Bazel. 45 | http_archive( 46 | name = "rules_cc", 47 | sha256 = "954b7a3efc8752da957ae193a13b9133da227bdacf5ceb111f2e11264f7e8c95", 48 | strip_prefix = "rules_cc-9e10b8a6db775b1ecd358d8ddd3dab379a2c29a5", 49 | urls = ["https://github.com/bazelbuild/rules_cc/archive/9e10b8a6db775b1ecd358d8ddd3dab379a2c29a5.zip"], 50 | ) 51 | 52 | # Build rules for Boost. 53 | # Apache License 2.0 for the rules. 54 | # Boost Software License for boost (similar to MIT or BSD). 55 | git_repository( 56 | name = "com_github_nelhage_rules_boost", 57 | commit = "353a58c5d231293795e7b63c2c21467922153add", 58 | remote = "https://github.com/nelhage/rules_boost", 59 | shallow_since = "1580416893 -0800", 60 | ) 61 | 62 | load("@com_github_nelhage_rules_boost//:boost/boost.bzl", "boost_deps") 63 | 64 | boost_deps() 65 | 66 | # Protocol buffers. 67 | http_archive( 68 | name = "com_google_protobuf", 69 | sha256 = "65e020a42bdab44a66664d34421995829e9e79c60e5adaa08282fd14ca552f57", 70 | strip_prefix = "protobuf-3.15.6", 71 | urls = ["https://github.com/protocolbuffers/protobuf/archive/v3.15.6.tar.gz"], 72 | ) 73 | 74 | load("@com_google_protobuf//:protobuf_deps.bzl", "protobuf_deps") 75 | 76 | protobuf_deps() 77 | 78 | http_archive( 79 | name = "rules_proto", 80 | sha256 = "602e7161d9195e50246177e7c55b2f39950a9cf7366f74ed5f22fd45750cd208", 81 | strip_prefix = "rules_proto-97d8af4dc474595af3900dd85cb3a29ad28cc313", 82 | urls = [ 83 | "https://mirror.bazel.build/github.com/bazelbuild/rules_proto/archive/97d8af4dc474595af3900dd85cb3a29ad28cc313.tar.gz", 84 | "https://github.com/bazelbuild/rules_proto/archive/97d8af4dc474595af3900dd85cb3a29ad28cc313.tar.gz", 85 | ], 86 | ) 87 | 88 | load("@rules_proto//proto:repositories.bzl", "rules_proto_dependencies", "rules_proto_toolchains") 89 | 90 | rules_proto_dependencies() 91 | 92 | rules_proto_toolchains() 93 | 94 | http_archive( 95 | name = "rules_cc", 96 | sha256 = "35f2fb4ea0b3e61ad64a369de284e4fbbdcdba71836a5555abb5e194cf119509", 97 | strip_prefix = "rules_cc-624b5d59dfb45672d4239422fa1e3de1822ee110", 98 | urls = [ 99 | "https://mirror.bazel.build/github.com/bazelbuild/rules_cc/archive/624b5d59dfb45672d4239422fa1e3de1822ee110.tar.gz", 100 | "https://github.com/bazelbuild/rules_cc/archive/624b5d59dfb45672d4239422fa1e3de1822ee110.tar.gz", 101 | ], 102 | ) 103 | 104 | load("@rules_cc//cc:repositories.bzl", "rules_cc_dependencies") 105 | 106 | rules_cc_dependencies() 107 | 108 | # Roaring. 109 | # Apache License 2.0 110 | http_archive( 111 | name = "CRoaring", 112 | build_file = "@//:croaring.BUILD", 113 | sha256 = "b26a1878c1016495c758e98b1ec62ed36bb401afd0d0f5f84f37615a724d2b1d", 114 | strip_prefix = "CRoaringUnityBuild-c1d1a754faa6451436efaffa3fe449edc7710b65", 115 | urls = ["https://github.com/lemire/CRoaringUnityBuild/archive/c1d1a754faa6451436efaffa3fe449edc7710b65.zip"], 116 | ) 117 | 118 | # CSV parser. 119 | # MIT License 120 | http_archive( 121 | name = "csv-parser", 122 | build_file = "@//:csv-parser.BUILD", 123 | sha256 = "550681980b7012dd9ef64dc46ff24044444c4f219b34b96f15fdc7bbe3f1fdc6", 124 | strip_prefix = "csv-parser-6fb1f43ad43fc7962baa3b0fe524b282a56ae4b0", 125 | urls = ["https://github.com/vincentlaucsb/csv-parser/archive/6fb1f43ad43fc7962baa3b0fe524b282a56ae4b0.zip"], 126 | ) 127 | 128 | # XOR filter 129 | # Apache License 2.0 130 | http_archive( 131 | name = "xor_singleheader", 132 | build_file = "@//:xor_singleheader.BUILD", 133 | sha256 = "c58d0d21404c11ccf509e9435693102ca5806ea75321d39afb894314a882f3a6", 134 | strip_prefix = "xor_singleheader-6cea6a4dcf2f18a0e3b9b9e0b94d6012b804ffa1", 135 | urls = ["https://github.com/FastFilter/xor_singleheader/archive/6cea6a4dcf2f18a0e3b9b9e0b94d6012b804ffa1.zip"], 136 | ) 137 | 138 | # LevelDB (just the Bloom filter) 139 | # BSD 3-Clause "New" or "Revised" License 140 | http_archive( 141 | name = "leveldb", 142 | build_file = "@//:leveldb.BUILD", 143 | sha256 = "2d9cc0a0c4bd1a98d6110f1abeb518086d9448ce74a0ee9deb197df4facefb04", 144 | strip_prefix = "leveldb-78b39d68c15ba020c0d60a3906fb66dbf1697595", 145 | urls = ["https://github.com/google/leveldb/archive/78b39d68c15ba020c0d60a3906fb66dbf1697595.zip"], 146 | ) 147 | -------------------------------------------------------------------------------- /bitmap_benchmark_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: bitmap_benchmark_test.cc 17 | // ----------------------------------------------------------------------------- 18 | // 19 | // Benchmarks for Roaring and ZSTD-compressed bitmaps. 20 | // 21 | // bazel run -c opt --cxxopt='-std=c++17' --dynamic_mode=off 22 | // :bitmap_benchmark_test 23 | // 24 | // Example run with bitmap index on DMV `city` column: 25 | // - 2^13 rows per stripe (1,447 stripes) 26 | // - 47,085,380 bits of which 2,021,715 are set (density of 4.3%) 27 | // 28 | // Example run with bitmap index on DMV `city` column: 29 | // - 2^13 rows per stripe (1,447 stripes) 30 | // - 47,085,380 bits of which 2,021,715 are set (density of 4.3%) 31 | // 32 | // Run on (8 X 2300 MHz CPUs); 2019-10-25T05:57:56 33 | // CPU: Intel Haswell with HyperThreading (4 cores) dL1:32KB dL2:256KB dL3:45MB 34 | // Benchmark Time(ns) CPU(ns) Iterations 35 | // ---------------------------------------------------------------------- 36 | // BM_RLECompress 123830177 123814940 5 37 | // BM_RLEDecompress 162191030 162167372 4 38 | // BM_RLEDecompressPartial 246357 246357 3100 39 | // BM_RoaringCompressFromIndexes 20305821 20308686 34 40 | // BM_RoaringCompressFromBitmap 41166051 41159732 17 41 | // BM_RoaringDecompressToIndexes 5048275 5049205 100 42 | // BM_RoaringDecompressToBitmap 17180132 17180604 40 43 | // BM_ZstdCompressBitmapBytes 11173145 11174765 63 44 | // BM_ZstdDecompressBitmapBytes 6565571 6566545 100 45 | 46 | #include "common/bitmap.h" 47 | #include "common/rle_bitmap.h" 48 | #include "evaluation_utils.h" 49 | #include "benchmark/benchmark.h" 50 | #include "absl/flags/flag.h" 51 | #include "absl/memory/memory.h" 52 | #include "absl/strings/string_view.h" 53 | 54 | ABSL_FLAG(std::string, path, "", "Path to bitmap file."); 55 | 56 | namespace ci { 57 | namespace { 58 | 59 | // **** Helper methods **** 60 | 61 | std::vector IndexesFromBitmap(const Bitmap64& bitmap) { 62 | std::vector indexes; 63 | for (const uint32_t index : bitmap.TrueBitIndices()) { 64 | indexes.push_back(index); 65 | } 66 | return indexes; 67 | } 68 | 69 | std::vector IndexesFromRoaring(const Roaring& roaring) { 70 | std::vector indexes; 71 | indexes.resize(roaring.cardinality()); 72 | roaring.toUint32Array(indexes.data()); 73 | return indexes; 74 | } 75 | 76 | Roaring RoaringFromIndexes(const std::vector& indexes) { 77 | return Roaring(indexes.size(), indexes.data()); 78 | } 79 | 80 | std::string RoaringToBytes(const Roaring& roaring) { 81 | std::string result; 82 | result.resize(roaring.getSizeInBytes(/*portable=*/false)); 83 | roaring.write(result.data()); 84 | return result; 85 | } 86 | 87 | Roaring RoaringFromBytes(const std::string& bytes) { 88 | return Roaring::readSafe(bytes.data(), bytes.size()); 89 | } 90 | 91 | // **** RLE benchmarks **** 92 | 93 | void BM_RLECompress(benchmark::State& state) { 94 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 95 | 96 | while (state.KeepRunning()) { 97 | const RleBitmap rle_bitmap(bitmap); 98 | } 99 | } 100 | BENCHMARK(BM_RLECompress); 101 | 102 | void BM_RLEDecompress(benchmark::State& state) { 103 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 104 | const RleBitmap rle_bitmap(bitmap); 105 | 106 | while (state.KeepRunning()) { 107 | const Bitmap64 extracted = 108 | rle_bitmap.Extract(/*offset=*/0, /*size=*/bitmap.bits()); 109 | } 110 | } 111 | BENCHMARK(BM_RLEDecompress); 112 | 113 | // Extract a bitmap that corresponds to a single unique value from a 114 | // back-to-back encoded bitmap (global bitmap). Each bit in this extracted 115 | // bitmap would correspond to a stripe. 116 | void BM_RLEDecompressPartial(benchmark::State& state) { 117 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 118 | const RleBitmap rle_bitmap(bitmap); 119 | 120 | while (state.KeepRunning()) { 121 | const Bitmap64 extracted = rle_bitmap.Extract(/*offset=*/bitmap.bits() / 2, 122 | /*size=*/128); 123 | } 124 | } 125 | BENCHMARK(BM_RLEDecompressPartial); 126 | 127 | // **** Roaring benchmarks **** 128 | 129 | void BM_RoaringCompressFromIndexes(benchmark::State& state) { 130 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 131 | const std::vector indexes = IndexesFromBitmap(bitmap); 132 | 133 | while (state.KeepRunning()) { 134 | const Roaring roaring = RoaringFromIndexes(indexes); 135 | benchmark::DoNotOptimize(RoaringToBytes(roaring)); 136 | } 137 | } 138 | BENCHMARK(BM_RoaringCompressFromIndexes); 139 | 140 | void BM_RoaringCompressFromBitmap(benchmark::State& state) { 141 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 142 | 143 | while (state.KeepRunning()) { 144 | Roaring roaring; 145 | for (const uint32_t index : bitmap.TrueBitIndices()) { 146 | roaring.add(index); 147 | } 148 | benchmark::DoNotOptimize(RoaringToBytes(roaring)); 149 | } 150 | } 151 | BENCHMARK(BM_RoaringCompressFromBitmap); 152 | 153 | void BM_RoaringDecompressToIndexes(benchmark::State& state) { 154 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 155 | const std::vector indexes = IndexesFromBitmap(bitmap); 156 | Roaring roaring = RoaringFromIndexes(indexes); 157 | const std::string bytes = RoaringToBytes(roaring); 158 | 159 | while (state.KeepRunning()) { 160 | const Roaring roaring = RoaringFromBytes(bytes); 161 | benchmark::DoNotOptimize(IndexesFromRoaring(roaring)); 162 | } 163 | } 164 | BENCHMARK(BM_RoaringDecompressToIndexes); 165 | 166 | void BM_RoaringDecompressToBitmap(benchmark::State& state) { 167 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 168 | const std::vector indexes = IndexesFromBitmap(bitmap); 169 | Roaring roaring = RoaringFromIndexes(indexes); 170 | const std::string bytes = RoaringToBytes(roaring); 171 | 172 | Bitmap64 decompressed(/*size=*/bitmap.bits()); 173 | while (state.KeepRunning()) { 174 | const Roaring roaring = RoaringFromBytes(bytes); 175 | for (RoaringSetBitForwardIterator it = roaring.begin(); it != roaring.end(); 176 | ++it) { 177 | decompressed.Set(*it, true); 178 | } 179 | } 180 | } 181 | BENCHMARK(BM_RoaringDecompressToBitmap); 182 | 183 | // **** ZSTD benchmarks **** 184 | 185 | void BM_ZstdCompressBitmapBytes(benchmark::State& state) { 186 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 187 | const std::string bitmap_bytes = SerializeBitmap(bitmap); 188 | 189 | while (state.KeepRunning()) { 190 | benchmark::DoNotOptimize(Compress(bitmap_bytes)); 191 | } 192 | } 193 | BENCHMARK(BM_ZstdCompressBitmapBytes); 194 | 195 | void BM_ZstdDecompressBitmapBytes(benchmark::State& state) { 196 | const Bitmap64 bitmap = ReadBitmapFromFile(absl::GetFlag(FLAGS_path)); 197 | const std::string bitmap_bytes = SerializeBitmap(bitmap); 198 | const std::string zstd_bytes = Compress(bitmap_bytes); 199 | 200 | while (state.KeepRunning()) { 201 | benchmark::DoNotOptimize(Compress(zstd_bytes)); 202 | } 203 | } 204 | BENCHMARK(BM_ZstdDecompressBitmapBytes); 205 | 206 | } // namespace 207 | } // namespace clt 208 | -------------------------------------------------------------------------------- /cmake/README.md: -------------------------------------------------------------------------------- 1 | # CMake support 2 | 3 | **NOTE** CMake support is community-based. The maintainers do not use CMake internally. 4 | 5 | ## Including Cuckoo Index in your CMake based project 6 | 7 | You can include Cuckoo Index in your own CMake based project like this: 8 | ``` cmake 9 | include(FetchContent) 10 | set(CUCKOOINDEX_BUILD_TESTS OFF) 11 | set(CUCKOOINDEX_BUILD_BENCHMARKS OFF) 12 | FetchContent_Declare( 13 | cuckooindex 14 | GIT_REPOSITORY "https://github.com/google/cuckoo-index.git" 15 | ) 16 | FetchContent_MakeAvailable(cuckooindex) 17 | FetchContent_GetProperties(cuckooindex SOURCE_DIR CUCKOOINDEX_INCLUDE_DIR) 18 | include_directories(${CUCKOOINDEX_INCLUDE_DIR}) 19 | 20 | target_link_libraries(your_target cuckoo_index) 21 | ``` -------------------------------------------------------------------------------- /cmake/absl.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | set(BUILD_TESTING OFF) 9 | set(ABSL_ENABLE_INSTALL ON) 10 | set(ABSL_USE_EXTERNAL_GOOGLETEST ON) 11 | FetchContent_Declare( 12 | absl 13 | GIT_REPOSITORY "https://github.com/abseil/abseil-cpp.git" 14 | GIT_TAG df3ea785d8c30a9503321a3d35ee7d35808f190d 15 | PATCH_COMMAND "" 16 | ) 17 | FetchContent_MakeAvailable(absl) 18 | FetchContent_GetProperties(absl SOURCE_DIR ABSL_INCLUDE_DIR) 19 | include_directories(${ABSL_INCLUDE_DIR}) 20 | -------------------------------------------------------------------------------- /cmake/benchmark.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | set(BENCHMARK_ENABLE_GTEST_TESTS OFF CACHE BOOL "" FORCE) 9 | set(BENCHMARK_ENABLE_TESTING OFF CACHE BOOL "" FORCE) 10 | FetchContent_Declare( 11 | benchmark 12 | GIT_REPOSITORY "https://github.com/google/benchmark.git" 13 | GIT_TAG 090faecb454fbd6e6e17a75ef8146acb037118d4 14 | PATCH_COMMAND "" 15 | ) 16 | FetchContent_MakeAvailable(benchmark) 17 | FetchContent_GetProperties(benchmark SOURCE_DIR BENCHMARK_INCLUDE_DIR) 18 | include_directories(${BENCHMARK_INCLUDE_DIR}) -------------------------------------------------------------------------------- /cmake/benchmarks.cmake: -------------------------------------------------------------------------------- 1 | include(benchmark) 2 | 3 | add_executable(build_benchmark "${PROJECT_SOURCE_DIR}/build_benchmark.cc") 4 | target_link_libraries(build_benchmark 5 | cuckoo_index 6 | cuckoo_utils 7 | index_structure 8 | per_stripe_bloom 9 | per_stripe_xor 10 | common_profiling 11 | absl::flags 12 | absl::flags_parse 13 | benchmark 14 | gtest 15 | ) 16 | 17 | add_executable(lookup_benchmark "${PROJECT_SOURCE_DIR}/lookup_benchmark.cc") 18 | target_link_libraries(lookup_benchmark 19 | cuckoo_index 20 | cuckoo_utils 21 | index_structure 22 | per_stripe_bloom 23 | per_stripe_xor 24 | absl::flags 25 | absl::flags_parse 26 | benchmark 27 | gtest 28 | ) -------------------------------------------------------------------------------- /cmake/boost.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | # This is a different version than the one used in the bazel BUILD as it is the first one to support FetchContent_Declare 9 | # The version in the bazel BUILD can't be updated to this version, though, as boost math is currently in a broken state there 10 | FetchContent_Declare( 11 | boost 12 | GIT_REPOSITORY "https://github.com/boostorg/boost.git" 13 | GIT_TAG boost-1.77.0 14 | PATCH_COMMAND cd /libs/math && git checkout v1.77-standalone # Boost math is broken in 1.77.0, apply fixed patch 15 | ) 16 | FetchContent_MakeAvailable(boost) 17 | FetchContent_GetProperties(boost SOURCE_DIR BOOST_INCLUDE_DIR) 18 | include_directories(${BOOST_INCLUDE_DIR}) 19 | -------------------------------------------------------------------------------- /cmake/common.cmake: -------------------------------------------------------------------------------- 1 | add_library(common_byte_coding "${PROJECT_SOURCE_DIR}/common/byte_coding.h") 2 | target_link_libraries(common_byte_coding 3 | absl::strings 4 | absl::span 5 | libprotobuf-lite 6 | ) 7 | 8 | add_library(common_bit_packing "${PROJECT_SOURCE_DIR}/common/bit_packing.h") 9 | target_link_libraries(common_bit_packing 10 | common_byte_coding 11 | absl::core_headers 12 | absl::endian 13 | absl::span 14 | ) 15 | 16 | add_library(common_bitmap "${PROJECT_SOURCE_DIR}/common/bitmap.h") 17 | target_link_libraries(common_bitmap 18 | absl::strings 19 | Boost::dynamic_bitset 20 | ) 21 | 22 | add_library(common_profiling "${PROJECT_SOURCE_DIR}/common/profiling.cc" "${PROJECT_SOURCE_DIR}/common/profiling.h") 23 | target_link_libraries(common_profiling 24 | absl::flat_hash_map 25 | absl::time 26 | ) 27 | 28 | add_library(common_rle_bitmap "${PROJECT_SOURCE_DIR}/common/rle_bitmap.cc" "${PROJECT_SOURCE_DIR}/common/rle_bitmap.h") 29 | target_link_libraries(common_rle_bitmap 30 | common_bit_packing 31 | common_bitmap 32 | absl::strings 33 | ) 34 | -------------------------------------------------------------------------------- /cmake/croaring.cmake: -------------------------------------------------------------------------------- 1 | include(ExternalProject) 2 | find_package(Git REQUIRED) 3 | 4 | # Need to use ExternalProject_Add to use custom BUILD_COMMAND 5 | ExternalProject_Add( 6 | croaring_src 7 | PREFIX "_deps/croaring" 8 | GIT_REPOSITORY "https://github.com/lemire/CroaringUnityBuild.git" 9 | GIT_TAG c1d1a754faa6451436efaffa3fe449edc7710b65 10 | TIMEOUT 10 11 | CONFIGURE_COMMAND "" 12 | UPDATE_COMMAND "" 13 | INSTALL_COMMAND "" 14 | BUILD_ALWAYS OFF 15 | BUILD_COMMAND ${CMAKE_CXX_COMPILER} -c /roaring.c -o libcroaring.o 16 | COMMAND ${CMAKE_COMMAND} -E copy /libcroaring.o /lib/libcroaring.o 17 | ) 18 | 19 | ExternalProject_Get_Property(croaring_src install_dir) 20 | set(CROARING_INCLUDE_DIR ${install_dir}/src/croaring_src) 21 | set(CROARING_LIBRARY_PATH ${install_dir}/lib/libcroaring.o) 22 | file(MAKE_DIRECTORY ${CROARING_INCLUDE_DIR}) 23 | add_library(croaring STATIC IMPORTED) 24 | set_property(TARGET croaring PROPERTY IMPORTED_LOCATION ${CROARING_LIBRARY_PATH}) 25 | set_property(TARGET croaring APPEND PROPERTY INTERFACE_INCLUDE_DIRECTORIES ${CROARING_INCLUDE_DIR}) 26 | 27 | add_dependencies(croaring croaring_src) -------------------------------------------------------------------------------- /cmake/csvparser.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | FetchContent_Declare( 9 | csv-parser 10 | GIT_REPOSITORY "https://github.com/vincentlaucsb/csv-parser.git" 11 | GIT_TAG 6fb1f43ad43fc7962baa3b0fe524b282a56ae4b0 12 | PATCH_COMMAND "" 13 | ) 14 | FetchContent_MakeAvailable(csv-parser) 15 | FetchContent_GetProperties(csv-parser SOURCE_DIR CSV_PARSER_INCLUDE_DIR) 16 | add_library(csv-parser INTERFACE) 17 | target_include_directories(csv-parser INTERFACE ${CSV_PARSER_INCLUDE_DIR}) 18 | -------------------------------------------------------------------------------- /cmake/cuckooindex.cmake: -------------------------------------------------------------------------------- 1 | # Dependencies 2 | include(absl) 3 | include(boost) 4 | include(croaring) 5 | include(csvparser) 6 | include(leveldb) 7 | include(protobuf) 8 | include(xor_singleheader) 9 | 10 | include_directories(${PROJECT_SOURCE_DIR}) 11 | include(common) 12 | 13 | # Get Protobuf include dirs 14 | get_target_property(protobuf_dirs libprotobuf INTERFACE_INCLUDE_DIRECTORIES) 15 | foreach(dir IN LISTS protobuf_dirs) 16 | if ("${dir}" MATCHES "BUILD_INTERFACE") 17 | list(APPEND PROTO_DIRS "--proto_path=${dir}") 18 | endif() 19 | endforeach() 20 | 21 | # Generate Protobuf cpp sources 22 | set(PROTO_HDRS) 23 | set(PROTO_SRCS) 24 | file(GLOB PROTO_FILES "${PROJECT_SOURCE_DIR}/*.proto") 25 | 26 | foreach(PROTO_FILE IN LISTS PROTO_FILES) 27 | get_filename_component(PROTO_NAME ${PROTO_FILE} NAME_WE) 28 | set(PROTO_HDR ${PROJECT_SOURCE_DIR}/${PROTO_NAME}.pb.h) 29 | set(PROTO_SRC ${PROJECT_SOURCE_DIR}/${PROTO_NAME}.pb.cc) 30 | 31 | add_custom_command( 32 | OUTPUT ${PROTO_SRC} ${PROTO_HDR} 33 | COMMAND protoc 34 | "--proto_path=${PROJECT_SOURCE_DIR}" 35 | ${PROTO_DIRS} 36 | "--cpp_out=${PROJECT_SOURCE_DIR}" 37 | ${PROTO_FILE} 38 | DEPENDS ${PROTO_FILE} protoc 39 | COMMENT "Generate C++ protocol buffer for ${PROTO_FILE}" 40 | VERBATIM) 41 | list(APPEND PROTO_HDRS ${PROTO_HDR}) 42 | list(APPEND PROTO_SRCS ${PROTO_SRC}) 43 | endforeach() 44 | 45 | add_library(evaluation_cc_proto ${PROTO_SRCS} ${PROTO_HDRS}) 46 | target_link_libraries(evaluation_cc_proto libprotobuf) 47 | 48 | add_library(data "${PROJECT_SOURCE_DIR}/data.cc" "${PROJECT_SOURCE_DIR}/data.h") 49 | target_link_libraries(data 50 | evaluation_utils 51 | common_byte_coding 52 | Boost::math 53 | Boost::multiprecision 54 | absl::flat_hash_set 55 | absl::memory 56 | absl::random_random 57 | absl::strings 58 | csv-parser 59 | ) 60 | 61 | add_library(per_stripe_bloom "${PROJECT_SOURCE_DIR}/per_stripe_bloom.h") 62 | target_link_libraries(per_stripe_bloom 63 | data 64 | evaluation_utils 65 | index_structure 66 | absl::strings 67 | leveldb 68 | ) 69 | 70 | add_library(xor_filter "${PROJECT_SOURCE_DIR}/xor_filter.h") 71 | target_link_libraries(xor_filter 72 | absl::strings 73 | xor_singleheader 74 | ) 75 | 76 | add_library(per_stripe_xor "${PROJECT_SOURCE_DIR}/per_stripe_xor.h") 77 | target_link_libraries(per_stripe_xor 78 | data 79 | evaluation_utils 80 | index_structure 81 | absl::strings 82 | xor_singleheader 83 | ) 84 | 85 | add_library(cuckoo_index "${PROJECT_SOURCE_DIR}/cuckoo_index.cc" "${PROJECT_SOURCE_DIR}/cuckoo_index.h") 86 | target_link_libraries(cuckoo_index 87 | cuckoo_kicker 88 | cuckoo_utils 89 | evaluation_utils 90 | fingerprint_store 91 | index_structure 92 | common_byte_coding 93 | common_profiling 94 | common_rle_bitmap 95 | absl::flat_hash_map 96 | absl::memory 97 | absl::strings 98 | ) 99 | 100 | add_library(cuckoo_kicker "${PROJECT_SOURCE_DIR}/cuckoo_kicker.cc" "${PROJECT_SOURCE_DIR}/cuckoo_kicker.h") 101 | target_link_libraries(cuckoo_kicker 102 | cuckoo_utils 103 | absl::flat_hash_map 104 | absl::random_random 105 | ) 106 | 107 | add_library(evaluation_utils "${PROJECT_SOURCE_DIR}/evaluation_utils.cc" "${PROJECT_SOURCE_DIR}/evaluation_utils.h") 108 | target_link_libraries(evaluation_utils 109 | evaluation_cc_proto 110 | common_bitmap 111 | common_rle_bitmap 112 | croaring 113 | absl::memory 114 | absl::strings 115 | Boost::iostreams 116 | ) 117 | 118 | add_library(cuckoo_utils "${PROJECT_SOURCE_DIR}/cuckoo_utils.cc" "${PROJECT_SOURCE_DIR}/cuckoo_utils.h") 119 | target_link_libraries(cuckoo_utils 120 | common_bit_packing 121 | common_bitmap 122 | common_byte_coding 123 | croaring 124 | absl::flat_hash_set 125 | absl::city 126 | absl::memory 127 | absl::strings 128 | absl::str_format 129 | ) 130 | 131 | add_library(fingerprint_store "${PROJECT_SOURCE_DIR}/fingerprint_store.cc" "${PROJECT_SOURCE_DIR}/fingerprint_store.h") 132 | target_link_libraries(fingerprint_store 133 | cuckoo_utils 134 | evaluation_utils 135 | common_bitmap 136 | common_rle_bitmap 137 | absl::flat_hash_map 138 | absl::strings 139 | ) 140 | 141 | add_library(index_structure "${PROJECT_SOURCE_DIR}/index_structure.h") 142 | target_link_libraries(index_structure 143 | data 144 | evaluation_cc_proto 145 | ) 146 | 147 | add_library(zone_map "${PROJECT_SOURCE_DIR}/zone_map.h") 148 | target_link_libraries(zone_map 149 | data 150 | evaluation_utils 151 | index_structure 152 | absl::memory 153 | absl::strings 154 | ) 155 | 156 | add_library(evaluator "${PROJECT_SOURCE_DIR}/evaluator.cc" "${PROJECT_SOURCE_DIR}/evaluator.h") 157 | target_link_libraries(evaluator 158 | data 159 | evaluation_cc_proto 160 | index_structure 161 | absl::random_random 162 | absl::str_format 163 | ) 164 | 165 | add_executable(evaluate "${PROJECT_SOURCE_DIR}/evaluate.cc") 166 | target_link_libraries(evaluate 167 | cuckoo_index 168 | cuckoo_utils 169 | data 170 | evaluation_cc_proto 171 | evaluation_utils 172 | evaluator 173 | index_structure 174 | per_stripe_bloom 175 | per_stripe_xor 176 | zone_map 177 | absl::flags 178 | absl::flags_parse 179 | absl::memory 180 | absl::strings 181 | absl::str_format 182 | ) 183 | -------------------------------------------------------------------------------- /cmake/googletest.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | set(BUILD_GMOCK ON CACHE BOOL "" FORCE) 9 | set(BUILD_GTEST OFF CACHE BOOL "" FORCE) 10 | FetchContent_Declare( 11 | gtest 12 | GIT_REPOSITORY "https://github.com/google/googletest.git" 13 | GIT_TAG 10b1902d893ea8cc43c69541d70868f91af3646b 14 | PATCH_COMMAND "" 15 | ) 16 | FetchContent_MakeAvailable(gtest) 17 | FetchContent_GetProperties(gtest SOURCE_DIR GTEST_INCLUDE_DIR) 18 | include_directories(${GTEST_INCLUDE_DIR}/googlemock/include) 19 | include_directories(${GTEST_INCLUDE_DIR}/googletest/include) -------------------------------------------------------------------------------- /cmake/leveldb.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | SET(LEVELDB_BUILD_TESTS OFF CACHE BOOL "" FORCE) 9 | SET(LEVELDB_BUILD_BENCHMARKS OFF CACHE BOOL "" FORCE) 10 | SET(LEVELDB_INSTALL ON CACHE BOOL "" FORCE) 11 | FetchContent_Declare( 12 | leveldb 13 | GIT_REPOSITORY "https://github.com/google/leveldb.git" 14 | GIT_TAG 78b39d68c15ba020c0d60a3906fb66dbf1697595 15 | PATCH_COMMAND "" 16 | ) 17 | FetchContent_MakeAvailable(leveldb) 18 | FetchContent_GetProperties(leveldb SOURCE_DIR LEVELDB_INCLUDE_DIR) 19 | include_directories(${LEVELDB_INCLUDE_DIR}) -------------------------------------------------------------------------------- /cmake/protobuf.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | SET(protobuf_BUILD_TESTS OFF CACHE BOOL "" FORCE) 9 | SET(protobuf_BUILD_CONFORMANCE OFF CACHE BOOL "" FORCE) 10 | SET(protobuf_BUILD_EXAMPLES OFF CACHE BOOL "" FORCE) 11 | set(protobuf_BUILD_EXPORT OFF CACHE BOOL "" FORCE) 12 | SET(protobuf_WITH_ZLIB OFF CACHE BOOL "" FORCE) 13 | FetchContent_Declare( 14 | protobuf 15 | GIT_REPOSITORY "https://github.com/protocolbuffers/protobuf.git" 16 | GIT_TAG v3.15.6 17 | PATCH_COMMAND "" 18 | SOURCE_SUBDIR cmake 19 | ) 20 | FetchContent_MakeAvailable(protobuf) 21 | FetchContent_GetProperties(protobuf SOURCE_DIR PROTOBUF_INCLUDE_DIR) 22 | SET(PROTOBUF_INCLUDE_DIR "${PROTOBUF_INCLUDE_DIR}/src") 23 | include_directories(${PROTOBUF_INCLUDE_DIR}) -------------------------------------------------------------------------------- /cmake/tests.cmake: -------------------------------------------------------------------------------- 1 | include(googletest) 2 | 3 | add_executable(data_test "${PROJECT_SOURCE_DIR}/data_test.cc") 4 | target_link_libraries(data_test 5 | data 6 | gtest_main 7 | ) 8 | 9 | add_executable(per_stripe_bloom_test "${PROJECT_SOURCE_DIR}/per_stripe_bloom_test.cc") 10 | target_link_libraries(per_stripe_bloom_test 11 | per_stripe_bloom 12 | gtest_main 13 | ) 14 | 15 | add_executable(xor_filter_test "${PROJECT_SOURCE_DIR}/xor_filter_test.cc") 16 | target_link_libraries(xor_filter_test 17 | xor_filter 18 | gtest_main 19 | ) 20 | 21 | add_executable(per_stripe_xor_test "${PROJECT_SOURCE_DIR}/per_stripe_xor_test.cc") 22 | target_link_libraries(per_stripe_xor_test 23 | per_stripe_xor 24 | gtest_main 25 | ) 26 | 27 | add_executable(cuckoo_index_test "${PROJECT_SOURCE_DIR}/cuckoo_index_test.cc") 28 | target_link_libraries(cuckoo_index_test 29 | cuckoo_index 30 | gtest_main 31 | ) 32 | 33 | add_executable(cuckoo_kicker_test "${PROJECT_SOURCE_DIR}/cuckoo_kicker_test.cc") 34 | target_link_libraries(cuckoo_kicker_test 35 | cuckoo_kicker 36 | gtest_main 37 | ) 38 | 39 | add_executable(evaluation_utils_test "${PROJECT_SOURCE_DIR}/evaluation_utils_test.cc") 40 | target_link_libraries(evaluation_utils_test 41 | evaluation_utils 42 | gtest_main 43 | ) 44 | 45 | add_executable(cuckoo_utils_test "${PROJECT_SOURCE_DIR}/cuckoo_utils_test.cc") 46 | target_link_libraries(cuckoo_utils_test 47 | cuckoo_utils 48 | evaluation_utils 49 | gtest_main 50 | ) 51 | 52 | add_executable(fingerprint_store_test "${PROJECT_SOURCE_DIR}/fingerprint_store_test.cc") 53 | target_link_libraries(fingerprint_store_test 54 | fingerprint_store 55 | gtest_main 56 | ) 57 | 58 | add_executable(zone_map_test "${PROJECT_SOURCE_DIR}/zone_map_test.cc") 59 | target_link_libraries(zone_map_test 60 | zone_map 61 | gtest_main 62 | ) 63 | 64 | add_executable(bitmap_benchmark_test "${PROJECT_SOURCE_DIR}/bitmap_benchmark_test.cc") 65 | target_link_libraries(bitmap_benchmark_test 66 | evaluation_utils 67 | common_bitmap 68 | common_rle_bitmap 69 | croaring 70 | absl::flags 71 | absl::memory 72 | absl::strings 73 | benchmark 74 | gtest_main 75 | ) -------------------------------------------------------------------------------- /cmake/xor_singleheader.cmake: -------------------------------------------------------------------------------- 1 | include(FetchContent) 2 | set(FETCHCONTENT_QUIET ON) 3 | set(FETCHCONTENT_UPDATES_DISCONNECTED ON) 4 | set(BUILD_SHARED_LIBS OFF) 5 | set(CMAKE_POSITION_INDEPENDENT_CODE ON) 6 | find_package(Git REQUIRED) 7 | 8 | FetchContent_Declare( 9 | xor_singleheader 10 | GIT_REPOSITORY "https://github.com/FastFilter/xor_singleheader.git" 11 | GIT_TAG 6cea6a4dcf2f18a0e3b9b9e0b94d6012b804ffa1 12 | PATCH_COMMAND "" 13 | ) 14 | FetchContent_MakeAvailable(xor_singleheader) 15 | FetchContent_GetProperties(xor_singleheader SOURCE_DIR XOR_SINGLEHEADER_INCLUDE_DIR) 16 | add_library(xor_singleheader INTERFACE) 17 | target_include_directories(xor_singleheader INTERFACE ${XOR_SINGLEHEADER_INCLUDE_DIR}) 18 | -------------------------------------------------------------------------------- /common/BUILD.bazel: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | 15 | load("@rules_cc//cc:defs.bzl", "cc_binary", "cc_library") 16 | 17 | package(default_visibility = ["//visibility:public"]) 18 | 19 | licenses(["notice"]) # Apache 2.0 20 | 21 | cc_library( 22 | name = "byte_coding", 23 | hdrs = ["byte_coding.h"], 24 | deps = [ 25 | "@com_google_absl//absl/strings", 26 | "@com_google_absl//absl/types:span", 27 | "@com_google_protobuf//:protobuf_lite", 28 | ], 29 | ) 30 | 31 | cc_test( 32 | name = "byte_coding_test", 33 | srcs = ["byte_coding_test.cc"], 34 | deps = [ 35 | ":byte_coding", 36 | "@com_google_absl//absl/strings", 37 | "@com_google_absl//absl/types:span", 38 | "@com_google_googletest//:gtest_main", 39 | "@com_google_protobuf//:protobuf_lite", 40 | ], 41 | ) 42 | 43 | cc_library( 44 | name = "bit_packing", 45 | hdrs = ["bit_packing.h"], 46 | deps = [ 47 | ":byte_coding", 48 | "@com_google_absl//absl/base:core_headers", 49 | "@com_google_absl//absl/base:endian", 50 | "@com_google_absl//absl/types:span", 51 | ], 52 | ) 53 | 54 | cc_test( 55 | name = "bit_packing_test", 56 | size = "small", 57 | srcs = ["bit_packing_test.cc"], 58 | deps = [ 59 | ":bit_packing", 60 | "@com_google_googletest//:gtest_main", 61 | ], 62 | ) 63 | 64 | cc_binary( 65 | name = "bit_packing_benchmark", 66 | srcs = ["bit_packing_benchmark.cc"], 67 | deps = [ 68 | ":bit_packing", 69 | "@com_google_benchmark//:benchmark_main", 70 | ], 71 | ) 72 | 73 | cc_library( 74 | name = "bitmap", 75 | hdrs = ["bitmap.h"], 76 | deps = [ 77 | "@boost//:dynamic_bitset", 78 | "@com_google_absl//absl/strings", 79 | ], 80 | ) 81 | 82 | cc_library( 83 | name = "profiling", 84 | srcs = ["profiling.cc"], 85 | hdrs = ["profiling.h"], 86 | deps = [ 87 | "@com_google_absl//absl/container:flat_hash_map", 88 | "@com_google_absl//absl/time", 89 | ], 90 | ) 91 | 92 | cc_library( 93 | name = "rle_bitmap", 94 | srcs = ["rle_bitmap.cc"], 95 | hdrs = ["rle_bitmap.h"], 96 | deps = [ 97 | ":bit_packing", 98 | ":bitmap", 99 | "@com_google_absl//absl/strings", 100 | ], 101 | ) 102 | 103 | cc_test( 104 | name = "rle_bitmap_test", 105 | srcs = ["rle_bitmap_test.cc"], 106 | deps = [ 107 | ":bitmap", 108 | ":rle_bitmap", 109 | "@com_google_googletest//:gtest_main", 110 | ], 111 | ) 112 | -------------------------------------------------------------------------------- /common/bit_packing_benchmark.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: bit_packing_benchmark.cc 17 | // ----------------------------------------------------------------------------- 18 | // 19 | // Benchmarks for our bit-packing methods. 20 | // 21 | // To run the benchmarks call (turn off dynamic linking): 22 | // bazel run -c opt --dynamic_mode=off common:bit_packing_benchmark 23 | // 24 | // Run on (12 X 4500 MHz CPU s) 25 | // CPU Caches: 26 | // L1 Data 32 KiB (x6) 27 | // L1 Instruction 32 KiB (x6) 28 | // L2 Unified 1024 KiB (x6) 29 | // L3 Unified 8448 KiB (x1) 30 | // Load Average: 1.44, 2.07, 1.93 31 | // -------------------------------------------------------------------- 32 | // Benchmark Time CPU Iterations 33 | // -------------------------------------------------------------------- 34 | // BM_BitPack32_Zeros 0.000 ns 0.000 ns 1000000000 35 | // BM_BitPack32_1Bit 1.05 ns 1.04 ns 655400000 36 | // BM_BitPack32_7Bits 1.04 ns 1.04 ns 672100000 37 | // BM_BitPack32_15Bits 1.04 ns 1.04 ns 673200000 38 | // BM_BitPack32_31Bits 1.05 ns 1.05 ns 665700000 39 | // BM_Read_Zeros 0.722 ns 0.722 ns 987300000 40 | // BM_Read_1Bit 0.752 ns 0.752 ns 948200000 41 | // BM_Read_7Bits 0.748 ns 0.748 ns 945300000 42 | // BM_Read_15Bits 0.742 ns 0.742 ns 956000000 43 | // BM_Read_31Bits 0.764 ns 0.764 ns 953600000 44 | // BM_Read_32Bits 0.723 ns 0.723 ns 960900000 45 | // BM_BatchRead_Zeros 0.250 ns 0.250 ns 1000000000 46 | // BM_BatchRead_1Bit 0.357 ns 0.357 ns 1000000000 47 | // BM_BatchRead_7Bits 0.472 ns 0.472 ns 1000000000 48 | // BM_BatchRead_15Bits 0.513 ns 0.513 ns 1000000000 49 | // BM_BatchRead_31Bits 0.685 ns 0.685 ns 775800000 50 | // BM_BatchRead_32Bits 0.723 ns 0.723 ns 1000000000 51 | // BM_BatchRead_6Bits_64Vals 0.378 ns 0.378 ns 1000000000 52 | // BM_BatchRead_6Bits_31Vals 0.852 ns 0.852 ns 799455187 53 | 54 | #include 55 | #include 56 | 57 | #include "benchmark/benchmark.h" 58 | #include "common/bit_packing.h" 59 | 60 | namespace ci { 61 | namespace { 62 | 63 | // Helper method used in benchmarks to check the time it takes to create 64 | // a bit-packed array with the given number of entries of the given value. 65 | void CheckStoreBitPacked32(benchmark::State& state, int size, uint32_t value) { 66 | const std::vector vec(size, value); 67 | const int bw = BitWidth(value); 68 | ByteBuffer buffer; 69 | 70 | while (state.KeepRunningBatch(size)) { 71 | buffer.set_pos(0); 72 | StoreBitPacked(vec, bw, &buffer); 73 | } 74 | BitPackedReader reader(bw, buffer.data()); 75 | assert(value == reader.Get(0)); 76 | } 77 | 78 | constexpr int kArraySize = 100 * 1000; 79 | 80 | static void BM_BitPack32_Zeros(benchmark::State& state) { 81 | CheckStoreBitPacked32(state, kArraySize, 0); 82 | } 83 | BENCHMARK(BM_BitPack32_Zeros); 84 | 85 | static void BM_BitPack32_1Bit(benchmark::State& state) { 86 | CheckStoreBitPacked32(state, kArraySize, 1); 87 | } 88 | BENCHMARK(BM_BitPack32_1Bit); 89 | 90 | static void BM_BitPack32_7Bits(benchmark::State& state) { 91 | CheckStoreBitPacked32(state, kArraySize, 127); 92 | } 93 | BENCHMARK(BM_BitPack32_7Bits); 94 | 95 | static void BM_BitPack32_15Bits(benchmark::State& state) { 96 | CheckStoreBitPacked32(state, kArraySize, (1UL << 15) - 1); 97 | } 98 | BENCHMARK(BM_BitPack32_15Bits); 99 | 100 | static void BM_BitPack32_31Bits(benchmark::State& state) { 101 | CheckStoreBitPacked32(state, kArraySize, (1UL << 31) - 1); 102 | } 103 | BENCHMARK(BM_BitPack32_31Bits); 104 | 105 | // Helper method used in benchmarks to check the time it takes to read 106 | // a bit-packed array with the given number of entries of the given value. 107 | void ReadBitPacked32(benchmark::State& state, int size, uint32_t value) { 108 | const std::vector vec(size, value); 109 | const int bw = BitWidth(value); 110 | ByteBuffer buffer; 111 | StoreBitPacked(vec, bw, &buffer); 112 | 113 | while (state.KeepRunningBatch(size)) { 114 | BitPackedReader reader(bw, buffer.data()); 115 | for (int i = 0; i < size; ++i) benchmark::DoNotOptimize(reader.Get(i)); 116 | } 117 | } 118 | 119 | static void BM_Read_Zeros(benchmark::State& state) { 120 | ReadBitPacked32(state, kArraySize, 0); 121 | } 122 | BENCHMARK(BM_Read_Zeros); 123 | 124 | static void BM_Read_1Bit(benchmark::State& state) { 125 | ReadBitPacked32(state, kArraySize, 1); 126 | } 127 | BENCHMARK(BM_Read_1Bit); 128 | 129 | static void BM_Read_7Bits(benchmark::State& state) { 130 | ReadBitPacked32(state, kArraySize, 127); 131 | } 132 | BENCHMARK(BM_Read_7Bits); 133 | 134 | static void BM_Read_15Bits(benchmark::State& state) { 135 | ReadBitPacked32(state, kArraySize, (1UL << 15) - 1); 136 | } 137 | BENCHMARK(BM_Read_15Bits); 138 | 139 | static void BM_Read_31Bits(benchmark::State& state) { 140 | ReadBitPacked32(state, kArraySize, (1UL << 31) - 1); 141 | } 142 | BENCHMARK(BM_Read_31Bits); 143 | 144 | static void BM_Read_32Bits(benchmark::State& state) { 145 | ReadBitPacked32(state, kArraySize, std::numeric_limits::max()); 146 | } 147 | BENCHMARK(BM_Read_32Bits); 148 | 149 | // Helper method used in benchmarks to check the time it takes to read 150 | // a bit-packed array with the given number of entries of the given value. 151 | void BatchReadBitPacked32(benchmark::State& state, int size, uint32_t value) { 152 | const std::vector vec(size, value); 153 | const int bw = BitWidth(value); 154 | ByteBuffer buffer; 155 | StoreBitPacked(vec, bw, &buffer); 156 | 157 | std::vector batch(size); 158 | while (state.KeepRunningBatch(size)) { 159 | BitPackedReader reader(bw, buffer.data()); 160 | reader.GetBatch(size, [&](size_t i, uint32_t value) { batch[i] = value; }); 161 | } 162 | assert(batch[0] == value); 163 | } 164 | 165 | static void BM_BatchRead_Zeros(benchmark::State& state) { 166 | BatchReadBitPacked32(state, kArraySize, 0); 167 | } 168 | BENCHMARK(BM_BatchRead_Zeros); 169 | 170 | static void BM_BatchRead_1Bit(benchmark::State& state) { 171 | BatchReadBitPacked32(state, kArraySize, 1); 172 | } 173 | BENCHMARK(BM_BatchRead_1Bit); 174 | 175 | static void BM_BatchRead_7Bits(benchmark::State& state) { 176 | BatchReadBitPacked32(state, kArraySize, 127); 177 | } 178 | BENCHMARK(BM_BatchRead_7Bits); 179 | 180 | static void BM_BatchRead_15Bits(benchmark::State& state) { 181 | BatchReadBitPacked32(state, kArraySize, (1UL << 15) - 1); 182 | } 183 | BENCHMARK(BM_BatchRead_15Bits); 184 | 185 | static void BM_BatchRead_31Bits(benchmark::State& state) { 186 | BatchReadBitPacked32(state, kArraySize, (1UL << 31) - 1); 187 | } 188 | BENCHMARK(BM_BatchRead_31Bits); 189 | 190 | static void BM_BatchRead_32Bits(benchmark::State& state) { 191 | BatchReadBitPacked32(state, kArraySize, std::numeric_limits::max()); 192 | } 193 | BENCHMARK(BM_BatchRead_32Bits); 194 | 195 | static void BM_BatchRead_6Bits_64Vals(benchmark::State& state) { 196 | BatchReadBitPacked32(state, 64, 63); 197 | } 198 | BENCHMARK(BM_BatchRead_6Bits_64Vals); 199 | 200 | static void BM_BatchRead_6Bits_31Vals(benchmark::State& state) { 201 | BatchReadBitPacked32(state, 31, 63); 202 | } 203 | BENCHMARK(BM_BatchRead_6Bits_31Vals); 204 | 205 | } // namespace 206 | } // namespace ci 207 | -------------------------------------------------------------------------------- /common/bit_packing_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: bit_packing_test.cc 17 | // ----------------------------------------------------------------------------- 18 | // 19 | // Tests for our bit-packing methods. 20 | 21 | #include "common/bit_packing.h" 22 | 23 | #include 24 | #include 25 | 26 | #include "gtest/gtest.h" 27 | 28 | namespace ci { 29 | namespace { 30 | 31 | TEST(BitPackingTest, BytesRequired) { 32 | EXPECT_EQ(0, BitPackingBytesRequired(0)); 33 | EXPECT_EQ(1, BitPackingBytesRequired(1)); 34 | EXPECT_EQ(1, BitPackingBytesRequired(8)); 35 | EXPECT_EQ(2, BitPackingBytesRequired(9)); 36 | EXPECT_EQ(33, BitPackingBytesRequired(257)); 37 | EXPECT_EQ(2305843009213693951, 38 | BitPackingBytesRequired(std::numeric_limits::max() - 7)); 39 | } 40 | 41 | TEST(BitPackingTest, BitWidth) { 42 | // 32 bit version. 43 | EXPECT_EQ(0, BitWidth(0)); 44 | EXPECT_EQ(1, BitWidth(1)); 45 | EXPECT_EQ(2, BitWidth(2)); 46 | EXPECT_EQ(2, BitWidth(3)); 47 | EXPECT_EQ(3, BitWidth(4)); 48 | EXPECT_EQ(3, BitWidth(7)); 49 | EXPECT_EQ(8, BitWidth(255)); 50 | EXPECT_EQ(9, BitWidth(256)); 51 | EXPECT_EQ(32, BitWidth(std::numeric_limits::max())); 52 | 53 | // 64 bit version. 54 | EXPECT_EQ(0, BitWidth(0)); 55 | EXPECT_EQ(1, BitWidth(1)); 56 | EXPECT_EQ(2, BitWidth(2)); 57 | EXPECT_EQ(2, BitWidth(3)); 58 | EXPECT_EQ(3, BitWidth(4)); 59 | EXPECT_EQ(3, BitWidth(7)); 60 | EXPECT_EQ(8, BitWidth(255)); 61 | EXPECT_EQ(9, BitWidth(256)); 62 | EXPECT_EQ(32, BitWidth(std::numeric_limits::max())); 63 | EXPECT_EQ(33, 64 | BitWidth(std::numeric_limits::max() + 1ULL)); 65 | EXPECT_EQ(64, BitWidth(std::numeric_limits::max())); 66 | } 67 | 68 | TEST(BitPackingTest, MaxBitWidthOnEmptyArray) { 69 | const std::vector empty32; 70 | const std::vector empty64; 71 | 72 | EXPECT_EQ(0, MaxBitWidth(empty32)); 73 | EXPECT_EQ(0, MaxBitWidth(empty64)); 74 | } 75 | 76 | TEST(BitPackingTest, MaxBitWidthOnZeros) { 77 | const std::vector zeros32{0, 0, 0, 0, 0}; 78 | const std::vector zeros64{0ULL, 0ULL, 0ULL}; 79 | 80 | EXPECT_EQ(0, MaxBitWidth(zeros32)); 81 | EXPECT_EQ(0, MaxBitWidth(zeros64)); 82 | } 83 | 84 | TEST(BitPackingTest, MaxBitWidthOnSmallValues) { 85 | const std::vector vec32{0, 1, 3, 0, 7}; 86 | const std::vector vec64{0ULL, 127ULL, 0ULL}; 87 | 88 | EXPECT_EQ(3, MaxBitWidth(vec32)); 89 | EXPECT_EQ(7, MaxBitWidth(vec64)); 90 | } 91 | 92 | TEST(BitPackingTest, MaxBitWidthOnMaxValues) { 93 | const std::vector vec32{0, 1, 3, 0, 94 | std::numeric_limits::max()}; 95 | const std::vector vec64{0ULL, 127ULL, 96 | std::numeric_limits::max()}; 97 | 98 | EXPECT_EQ(32, MaxBitWidth(vec32)); 99 | EXPECT_EQ(64, MaxBitWidth(vec64)); 100 | } 101 | 102 | // Bit-packs the given array and then compares the result with the given array. 103 | // Also checks the size of the bit-packed array. 104 | template 105 | void CheckBitPack(absl::Span array, int bit_packed_size) { 106 | const int bw = MaxBitWidth(array); 107 | const size_t num_bytes = BitPackingBytesRequired(bw * array.size()); 108 | EXPECT_EQ(bit_packed_size, num_bytes); 109 | ByteBuffer buffer; 110 | StoreBitPacked(array, bw, &buffer); 111 | EXPECT_EQ(bit_packed_size, buffer.pos()); 112 | PutSlopBytes(&buffer); 113 | EXPECT_EQ(bit_packed_size + 8, buffer.pos()); 114 | 115 | BitPackedReader reader(bw, buffer.data()); 116 | 117 | for (size_t i = 0; i < array.size(); ++i) EXPECT_EQ(array[i], reader.Get(i)); 118 | } 119 | 120 | TEST(BitPackingTest, BitPack32EmptyArray) { 121 | const std::vector empty; 122 | CheckBitPack(empty, 0); 123 | } 124 | 125 | TEST(BitPackingTest, BitPack32Zeros) { 126 | const std::vector zeros{0, 0, 0, 0, 0, 0, 0, 0}; 127 | CheckBitPack(zeros, 0); 128 | } 129 | 130 | TEST(BitPackingTest, BitPack32Bits) { 131 | const std::vector bits{0, 1, 0, 1, 0, 0, 1, 1}; 132 | CheckBitPack(bits, 1); 133 | } 134 | 135 | TEST(BitPackingTest, BitPack32SmallValues) { 136 | const std::vector values{7, 2, 0, 1, 0, 4, 3, 0}; 137 | CheckBitPack(values, 3); 138 | } 139 | 140 | TEST(BitPackingTest, BitPack32LargeValues) { 141 | const std::vector values{0, 42, 142 | std::numeric_limits::max() / 8, 143 | std::numeric_limits::max() / 4}; 144 | CheckBitPack(values, 15); 145 | } 146 | 147 | TEST(BitPackingTest, BitPack32RangeOfValues) { 148 | std::vector values; 149 | for (uint32_t i = 0; i < (std::numeric_limits::max() >> 4); 150 | i = (i + 1) * 2) 151 | values.push_back(i); 152 | CheckBitPack(values, 98); 153 | } 154 | 155 | // Test bit-widths 0-32 with array-lengths 0-1024. 156 | TEST(BitPackingTest, BitPack32GetRange) { 157 | constexpr int kMaxLength = 1024; 158 | 159 | for (int bit_width = 0; bit_width <= 32; ++bit_width) { 160 | // Fill an array with values that take up to bit_width bits. 161 | std::vector src(kMaxLength); 162 | if (bit_width > 0) { 163 | for (int i = 0; i < kMaxLength; ++i) src[i] = 1 << (i % bit_width); 164 | } 165 | ASSERT_EQ(MaxBitWidth(src), bit_width); 166 | ByteBuffer buffer; 167 | StoreBitPacked(src, bit_width, &buffer); 168 | PutSlopBytes(&buffer); 169 | BitPackedReader reader(bit_width, buffer.data()); 170 | 171 | // Check all array lengths from 0 to kMaxLength. 172 | for (int length = 0; length < kMaxLength; ++length) { 173 | std::vector result(length); 174 | reader.GetBatch(length, 175 | [&](size_t i, uint32_t value) { result[i] = value; }); 176 | ASSERT_EQ(result, absl::Span(src.data(), length)) 177 | << "bit_width: " << bit_width << ", length: " << length; 178 | // Also double-check with Get(..). 179 | for (int i = 0; i < length; i++) ASSERT_EQ(reader.Get(i), src[i]); 180 | } 181 | } 182 | } 183 | 184 | TEST(BitPackingTest, BitPack64EmptyArray) { 185 | const std::vector empty; 186 | CheckBitPack(empty, 0); 187 | } 188 | 189 | TEST(BitPackingTest, BitPack64Zeros) { 190 | const std::vector zeros{0, 0, 0, 0, 0, 0, 0, 0}; 191 | CheckBitPack(zeros, 0); 192 | } 193 | 194 | TEST(BitPackingTest, BitPack64Bits) { 195 | const std::vector bits{0, 1, 0, 1, 0, 0, 1, 1}; 196 | CheckBitPack(bits, 1); 197 | } 198 | 199 | TEST(BitPackingTest, BitPack64SmallValues) { 200 | const std::vector values{7, 2, 0, 1, 0, 4, 3, 0}; 201 | CheckBitPack(values, 3); 202 | } 203 | 204 | TEST(BitPackingTest, BitPack64LargeValues) { 205 | const std::vector values{0, 42, 206 | std::numeric_limits::max() / 8, 207 | std::numeric_limits::max() / 4}; 208 | CheckBitPack(values, 31); 209 | } 210 | 211 | TEST(BitPackingTest, BitPack64MaxValue) { 212 | const std::vector values{0, std::numeric_limits::max()}; 213 | CheckBitPack(values, 16); 214 | } 215 | 216 | TEST(BitPackingTest, BitPack64RangeOfValues) { 217 | std::vector values; 218 | for (uint64_t i = 0; i < (std::numeric_limits::max() >> 4); 219 | i = (i + 1) * 2) 220 | values.push_back(i); 221 | CheckBitPack(values, 450); 222 | } 223 | 224 | TEST(BitPackingTest, BitPack64ForPowersOf2) { 225 | for (int shift = 0; shift < 64; ++shift) { 226 | std::vector values; 227 | const uint64_t val = 1ULL << shift; 228 | // Add 8 values, since at the latest after adding 8 values of (shift + 1) 229 | // bits length, the "start-bit" repeats. 230 | for (int i = 0; i < 8; ++i) values.push_back(val); 231 | 232 | CheckBitPack( 233 | values, static_cast(BitPackingBytesRequired(8 * (shift + 1)))); 234 | } 235 | } 236 | 237 | TEST(BitPackingTest, EmptyBitPackedReaderDebugString) { 238 | constexpr int bit_width = 0; 239 | ByteBuffer buffer; 240 | StoreBitPacked({}, bit_width, &buffer); 241 | BitPackedReader reader(bit_width, buffer.data()); 242 | EXPECT_EQ(reader.DebugString(/*size=*/0), "size: 0, bit-width: 0, bytes: 0"); 243 | } 244 | 245 | TEST(BitPackingTest, NonEmptyBitPackedReaderDebugString) { 246 | constexpr int bit_width = 8; 247 | ByteBuffer buffer; 248 | StoreBitPacked({0, 1, 2, 255}, bit_width, &buffer); 249 | BitPackedReader reader(bit_width, buffer.data()); 250 | EXPECT_EQ(reader.DebugString(/*size=*/4), "size: 4, bit-width: 8, bytes: 4"); 251 | } 252 | 253 | } // namespace 254 | } // namespace ci 255 | -------------------------------------------------------------------------------- /common/bitmap.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: bitmap.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_COMMON_BITMAP_H_ 20 | #define CUCKOO_INDEX_COMMON_BITMAP_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | 28 | #include "absl/strings/string_view.h" 29 | #include "boost/dynamic_bitset.hpp" 30 | 31 | namespace ci { 32 | 33 | // Note: Rank implementation is adapted from SuRF: 34 | // https://github.com/efficient/SuRF/blob/master/include/rank.hpp 35 | // Like SuRF, we precompute the ranks of bit-blocks of size `kRankBlockSize` 36 | // (512 by default), which adds around 6% of size overhead. 37 | 38 | // Number of bits in a rank block. 39 | static constexpr size_t kRankBlockSize = 512; 40 | 41 | class Bitmap64; 42 | using Bitmap64Ptr = std::unique_ptr; 43 | 44 | class Bitmap64 { 45 | public: 46 | static Bitmap64 GetGlobalBitmap(const std::vector& bitmaps) { 47 | size_t num_bits = 0; 48 | for (const Bitmap64Ptr& bitmap : bitmaps) { 49 | if (bitmap != nullptr) num_bits += bitmap->bits(); 50 | } 51 | 52 | Bitmap64 global_bitmap(num_bits); 53 | size_t base_index = 0; 54 | for (const Bitmap64Ptr& bitmap : bitmaps) { 55 | if (bitmap == nullptr) continue; 56 | for (const size_t index : bitmap->TrueBitIndices()) { 57 | global_bitmap.Set(base_index + index, true); 58 | } 59 | base_index += bitmap->bits(); 60 | } 61 | return global_bitmap; 62 | } 63 | 64 | static void DenseEncode(const Bitmap64& bitmap, std::string* out) { 65 | using Block = boost::dynamic_bitset<>::block_type; 66 | const size_t 67 | bitmap_size_in_bytes = bitmap.bitset_.num_blocks() * sizeof(Block); 68 | const uint32_t num_rank_blocks = bitmap.rank_lookup_table_.size(); 69 | const size_t rank_size_in_bytes = num_rank_blocks * sizeof(uint32_t); 70 | const size_t size_in_bytes = sizeof(uint32_t) // Number of bits. 71 | + bitmap_size_in_bytes 72 | + sizeof(uint32_t) // Number of rank entries. 73 | + rank_size_in_bytes; 74 | 75 | if (out->size() < size_in_bytes) out->resize(size_in_bytes); 76 | 77 | // Encode bitmap. 78 | const uint32_t num_bits = bitmap.bits(); 79 | *reinterpret_cast(out->data()) = num_bits; 80 | size_t pos = sizeof(uint32_t); 81 | boost::to_block_range(bitmap.bitset_, 82 | reinterpret_cast(out->data() + pos)); 83 | pos += bitmap_size_in_bytes; 84 | 85 | // Encode `rank_lookup_table_`. 86 | *reinterpret_cast(out->data() + pos) = num_rank_blocks; 87 | pos += sizeof(uint32_t); 88 | std::memcpy(out->data() + pos, 89 | bitmap.rank_lookup_table_.data(), 90 | rank_size_in_bytes); 91 | } 92 | 93 | static Bitmap64 DenseDecode(absl::string_view encoded) { 94 | using DynamicBitset = boost::dynamic_bitset<>; 95 | using Block = DynamicBitset::block_type; 96 | 97 | // Decode bitmap. 98 | const uint32_t 99 | num_bits = *reinterpret_cast(encoded.data()); 100 | size_t pos = sizeof(uint32_t); 101 | const Block* begin = reinterpret_cast(encoded.data() + pos); 102 | Bitmap64 decoded(num_bits); 103 | size_t num_blocks = std::ceil(static_cast(num_bits) 104 | / DynamicBitset::bits_per_block); 105 | boost::from_block_range(begin, begin + num_blocks, decoded.bitset_); 106 | pos += num_blocks * sizeof(Block); 107 | 108 | // Decode `rank_lookup_table_`. 109 | const uint32_t 110 | num_rank_blocks = 111 | *reinterpret_cast(encoded.data() + pos); 112 | pos += sizeof(uint32_t); 113 | decoded.rank_lookup_table_.resize(num_rank_blocks); 114 | std::memcpy(decoded.rank_lookup_table_.data(), 115 | encoded.data() + pos, 116 | num_rank_blocks * sizeof(uint32_t)); 117 | 118 | return decoded; 119 | } 120 | 121 | Bitmap64() = default; 122 | 123 | explicit Bitmap64(size_t num_bits) : bitset_(num_bits) {} 124 | 125 | Bitmap64(size_t num_bits, bool fill_value) : Bitmap64(num_bits) { 126 | for (size_t i = 0; i < bits(); ++i) bitset_[i] = fill_value; 127 | } 128 | 129 | // Calls copy constructor of underlying boost dynamic bitset. 130 | Bitmap64(const Bitmap64& other) : bitset_(other.bitset_) {} 131 | 132 | size_t bits() const { return bitset_.size(); } 133 | 134 | bool Get(size_t pos) const { return bitset_[pos]; } 135 | 136 | // Initializes `rank_lookup_table_`. Precomputes the ranks of bit-blocks of 137 | // size `kRankBlockSize`. 138 | void InitRankLookupTable() { 139 | // Do not build lookup table if there is only a single block. 140 | if (bits() <= kRankBlockSize) return; 141 | 142 | const size_t num_rank_blocks = bits() / kRankBlockSize + 1; 143 | rank_lookup_table_.resize(num_rank_blocks); 144 | size_t cumulative_rank = 0; 145 | for (size_t i = 0; i < num_rank_blocks - 1; ++i) { 146 | rank_lookup_table_[i] = cumulative_rank; 147 | // Add number of set bits of current block to `cumulative_rank`. 148 | cumulative_rank += 149 | GetOnesCountInRankBlock(/*rank_block_id=*/i, /*limit_within_block*/ 150 | kRankBlockSize); 151 | } 152 | rank_lookup_table_[num_rank_blocks - 1] = cumulative_rank; 153 | } 154 | 155 | // Returns rank of `limit`, i.e., the number of set bits in [0, limit). 156 | size_t GetOnesCountBeforeLimit(size_t limit) const { 157 | assert(limit <= bits()); 158 | 159 | if (limit == 0) return 0; 160 | 161 | if (rank_lookup_table_.empty()) { 162 | // No precomputed ranks. Compute rank manually. 163 | size_t ones_count = 0; 164 | for (size_t i = 0; i < limit; ++i) ones_count += bitset_[i]; 165 | return ones_count; 166 | } 167 | 168 | // Get rank from `rank_lookup_table_` and add rank of last rank block. 169 | const size_t last_pos = limit - 1; 170 | const size_t rank_block_id = last_pos / kRankBlockSize; 171 | const size_t limit_within_block = (last_pos & (kRankBlockSize - 1)) + 1; 172 | return rank_lookup_table_[rank_block_id] 173 | + GetOnesCountInRankBlock(rank_block_id, limit_within_block); 174 | } 175 | 176 | size_t GetOnesCount() const { return GetOnesCountBeforeLimit(bits()); } 177 | 178 | size_t GetZeroesCount() const { return bits() - GetOnesCount(); } 179 | 180 | bool IsAllZeroes() const { return bitset_.none(); } 181 | 182 | void Set(size_t pos, bool value) { bitset_[pos] = value; } 183 | 184 | std::vector TrueBitIndices() const { 185 | std::vector indices; 186 | 187 | for (size_t i = 0; i < bits(); ++i) { 188 | if (bitset_[i]) indices.push_back(i); 189 | } 190 | 191 | return indices; 192 | } 193 | 194 | std::string ToString() const { 195 | std::string result; 196 | boost::to_string(bitset_, result); 197 | return result; 198 | } 199 | 200 | private: 201 | // Returns the number of set bits in rank block `rank_block_id` before 202 | // `limit_within_block`. 203 | size_t GetOnesCountInRankBlock(const size_t rank_block_id, 204 | const size_t limit_within_block) const { 205 | const size_t start = rank_block_id * kRankBlockSize; 206 | const size_t end = start + limit_within_block; 207 | assert(end <= bits()); 208 | size_t ones_count = 0; 209 | for (size_t i = start; i < end; ++i) { 210 | ones_count += bitset_[i]; 211 | } 212 | return ones_count; 213 | } 214 | 215 | boost::dynamic_bitset<> bitset_; 216 | // Stores precomputed ranks of bit-blocks of size `kRankBlockSize`. 217 | std::vector rank_lookup_table_; 218 | }; 219 | 220 | } // namespace ci 221 | 222 | #endif // CUCKOO_INDEX_COMMON_BITMAP_H_ 223 | -------------------------------------------------------------------------------- /common/byte_coding_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: byte_coding_test.cc 17 | // ----------------------------------------------------------------------------- 18 | // 19 | // Tests for the vector-coders. 20 | 21 | #include "common/byte_coding.h" 22 | 23 | #include 24 | #include 25 | #include 26 | #include 27 | 28 | #include "absl/strings/str_cat.h" 29 | #include "absl/strings/string_view.h" 30 | #include "absl/types/span.h" 31 | #include "gtest/gtest.h" 32 | 33 | namespace ci { 34 | namespace { 35 | 36 | // Size to use for vectors to be filled in tests. 37 | constexpr size_t kVectorSize = 1000; 38 | 39 | class CodersTest : public testing::Test { 40 | protected: 41 | void SetUp() override { 42 | vec_.clear(); 43 | buf_.set_pos(0); 44 | } 45 | 46 | // Checks all PutPrimitive and GetPrimitive variants on std::vector and 47 | // ByteBuffer. 48 | template 49 | void CheckPutAndGetPrimitive(const T& value) { 50 | // *** std::vector *** 51 | const size_t old_size = vec_.size(); 52 | vec_.resize(vec_.size() + sizeof(T)); 53 | size_t pos = old_size; 54 | PutPrimitive(value, absl::Span(vec_), &pos); 55 | EXPECT_EQ(pos, vec_.size()); 56 | pos = old_size; 57 | EXPECT_EQ(value, GetPrimitive(vec_, &pos)); 58 | EXPECT_EQ(pos, vec_.size()); 59 | 60 | // *** ByteBuffer *** 61 | const size_t old_pos = buf_.pos(); 62 | PutPrimitive(value, &buf_); 63 | size_t end_pos = buf_.pos(); 64 | EXPECT_EQ(end_pos - old_pos, sizeof(T)); 65 | buf_.set_pos(old_pos); 66 | EXPECT_EQ(value, GetPrimitive(&buf_)); 67 | EXPECT_EQ(buf_.pos(), end_pos); 68 | } 69 | 70 | template 71 | void CheckPrimitives(absl::Span values) { 72 | for (const T& value : values) CheckPutAndGetPrimitive(value); 73 | } 74 | 75 | std::vector vec_; 76 | ByteBuffer buf_; 77 | }; 78 | 79 | TEST_F(CodersTest, Checkint32) { 80 | CheckPrimitives({-17, -1, 0, 1, 17, 42, 81 | std::numeric_limits::min(), 82 | std::numeric_limits::max()}); 83 | } 84 | 85 | TEST_F(CodersTest, CheckUint32) { 86 | CheckPrimitives( 87 | {0, 1, 17, 42, std::numeric_limits::max()}); 88 | } 89 | 90 | TEST_F(CodersTest, Checkint64) { 91 | const std::vector int64_ts({-17, -1, 0, 1, 17, 42, 92 | std::numeric_limits::min(), 93 | std::numeric_limits::max()}); 94 | CheckPrimitives(int64_ts); 95 | } 96 | 97 | TEST_F(CodersTest, CheckUint64) { 98 | const std::vector uint64_ts( 99 | {0, 1, 17, 42, std::numeric_limits::max()}); 100 | CheckPrimitives(uint64_ts); 101 | } 102 | 103 | TEST_F(CodersTest, CheckFloat) { 104 | using nl = std::numeric_limits; 105 | 106 | CheckPrimitives({-17, -1, 0, 1, 17, 42, nl::min(), -nl::min(), 107 | nl::max(), -nl::max(), nl::infinity(), 108 | -nl::infinity()}); 109 | } 110 | 111 | TEST_F(CodersTest, CheckDouble) { 112 | using nl = std::numeric_limits; 113 | 114 | const std::vector doubles({-17, -1, 0, 1, 17, 42, nl::min(), 115 | -nl::min(), nl::max(), -nl::max(), 116 | nl::infinity(), -nl::infinity()}); 117 | CheckPrimitives(doubles); 118 | } 119 | 120 | TEST_F(CodersTest, CheckString) { 121 | const std::string arr[] = { 122 | "", "James", "Dean", "Humphrey Bogart", std::string("\xFF\0\xFF", 3), ""}; 123 | size_t pos = 0; 124 | vec_.resize(kVectorSize); 125 | for (const std::string& value : arr) { 126 | PutString(value, &buf_); 127 | PutString(value, absl::Span(vec_), &pos); 128 | } 129 | pos = 0; 130 | buf_.set_pos(0); 131 | for (const std::string& value : arr) { 132 | EXPECT_EQ(value, GetString(&buf_)); 133 | EXPECT_EQ(value, GetString(vec_, &pos)); 134 | } 135 | } 136 | 137 | } // namespace 138 | } // namespace ci 139 | -------------------------------------------------------------------------------- /common/profiling.cc: -------------------------------------------------------------------------------- 1 | #include "common/profiling.h" 2 | 3 | namespace ci { 4 | 5 | Profiler& Profiler::GetThreadInstance() { 6 | thread_local static Profiler static_profiler; 7 | return static_profiler; 8 | } 9 | 10 | } // namespace ci 11 | -------------------------------------------------------------------------------- /common/profiling.h: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #include "absl/container/flat_hash_map.h" 4 | #include "absl/time/time.h" 5 | 6 | namespace ci { 7 | 8 | enum class Counter { 9 | ValueToStripeBitmaps, 10 | DistributeValues, 11 | CreateSlots, 12 | CreateFingerprintStore, 13 | GetGlobalBitmap 14 | }; 15 | 16 | // A simple profiler that can collect stats. Use `ScopedProfile` for registering 17 | // counters to profile. 18 | // 19 | // Is thread-safe, because there can be only a single instance per thread. 20 | class Profiler { 21 | public: 22 | // Retrieves a `Profiler` instance for the current thread. 23 | static Profiler& GetThreadInstance(); 24 | 25 | Profiler(const Profiler&) = delete; 26 | Profiler& operator=(const Profiler&) = delete; 27 | virtual ~Profiler() {} 28 | 29 | int64_t GetValue(Counter counter) const { 30 | const auto value_it = counters_.find(counter); 31 | return value_it != counters_.end() ? value_it->second : 0; 32 | } 33 | 34 | void Reset() { counters_.clear(); } 35 | 36 | private: 37 | friend class ScopedProfile; 38 | 39 | Profiler() {} 40 | 41 | // Starts profiling for the given counter. 42 | void Start(Counter counter) { 43 | counters_[counter] -= absl::GetCurrentTimeNanos(); 44 | } 45 | 46 | // Stops profiling for the given counter. 47 | void Stop(Counter counter) { 48 | counters_[counter] += absl::GetCurrentTimeNanos(); 49 | } 50 | 51 | absl::flat_hash_map counters_; 52 | }; 53 | 54 | // Instantiate a local variable with this class to profile the local scope. 55 | // Example: 56 | // void MyClass::MyMethod() { 57 | // ScopedProfile t(Counters::MyClass_MyMethod); 58 | // .... // Do expensive stuff. 59 | // } 60 | class ScopedProfile { 61 | public: 62 | // ScopedProfile is neither copyable nor moveable. 63 | ScopedProfile(const Profiler&) = delete; 64 | ScopedProfile& operator=(const Profiler&) = delete; 65 | 66 | explicit ScopedProfile(Counter counter) : counter_(counter) { 67 | Profiler::GetThreadInstance().Start(counter); 68 | } 69 | 70 | ~ScopedProfile() { Profiler::GetThreadInstance().Stop(counter_); } 71 | 72 | private: 73 | Counter counter_; 74 | }; 75 | 76 | } // namespace ci 77 | -------------------------------------------------------------------------------- /common/rle_bitmap.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: rle_bitmap.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_COMMON_RLE_BITMAP_H_ 20 | #define CUCKOO_INDEX_COMMON_RLE_BITMAP_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | 26 | #include "absl/strings/string_view.h" 27 | #include "common/bit_packing.h" 28 | #include "common/bitmap.h" 29 | 30 | namespace ci { 31 | 32 | class RleBitmap; 33 | using RleBitmapPtr = std::unique_ptr; 34 | 35 | class RleBitmap { 36 | public: 37 | explicit RleBitmap(const Bitmap64& bitmap); 38 | 39 | // Forbid copying and moving. 40 | RleBitmap(const RleBitmap&) = delete; 41 | RleBitmap& operator=(const RleBitmap&) = delete; 42 | RleBitmap(RleBitmap&&) = delete; 43 | RleBitmap& operator=(RleBitmap&&) = delete; 44 | 45 | absl::string_view data() const { return data_; } 46 | 47 | // Returns the slice of the bitmap from `offset` on of the given `size`. 48 | Bitmap64 Extract(size_t offset, size_t size) const; 49 | 50 | bool Get(size_t pos) const { return Extract(pos, 1).Get(0); } 51 | 52 | private: 53 | // Extract(..) implementations for the dense and the sparse encoding. 54 | Bitmap64 ExtractDense(size_t offset, size_t size) const; 55 | Bitmap64 ExtractSparse(size_t offset, size_t size) const; 56 | 57 | bool is_sparse_; 58 | size_t size_; 59 | uint32_t skip_offsets_step_; 60 | size_t skip_offsets_size_; 61 | size_t run_lengths_size_; 62 | size_t bits_size_; 63 | std::string data_; 64 | 65 | BitPackedReader skip_offsets_; 66 | BitPackedReader run_lengths_; 67 | BitPackedReader bits_; 68 | }; 69 | 70 | } // namespace ci 71 | 72 | #endif // CUCKOO_INDEX_COMMON_RLE_BITMAP_H_ 73 | -------------------------------------------------------------------------------- /common/rle_bitmap_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: rle_bitmap_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "common/rle_bitmap.h" 20 | 21 | #include "common/bitmap.h" 22 | #include "gtest/gtest.h" 23 | 24 | namespace ci { 25 | 26 | void CheckBitmap(const Bitmap64& bitmap) { 27 | const RleBitmap rle_bitmap(bitmap); 28 | 29 | // For a host of slices, check that Extract(..) fetches the expected bitmap. 30 | for (size_t offset = 0; offset < bitmap.bits(); ++offset) { 31 | for (size_t size = 0; size < bitmap.bits() - offset; size = size * 2 + 1) { 32 | const Bitmap64 extracted = rle_bitmap.Extract(offset, size); 33 | for (size_t i = 0; i < size; ++i) 34 | ASSERT_EQ(extracted.Get(i), bitmap.Get(i + offset)); 35 | } 36 | } 37 | } 38 | 39 | TEST(RleBitmapTest, EmptyBitmap) { CheckBitmap(Bitmap64()); } 40 | 41 | TEST(RleBitmapTest, SingleValueBitmaps) { 42 | CheckBitmap(Bitmap64(1, false)); 43 | CheckBitmap(Bitmap64(1, true)); 44 | 45 | CheckBitmap(Bitmap64(2, false)); 46 | CheckBitmap(Bitmap64(2, true)); 47 | 48 | CheckBitmap(Bitmap64(100, false)); 49 | CheckBitmap(Bitmap64(100, true)); 50 | 51 | CheckBitmap(Bitmap64(2000, false)); 52 | CheckBitmap(Bitmap64(2000, true)); 53 | } 54 | 55 | TEST(RleBitmapTest, SparseBitmaps) { 56 | Bitmap64 bitmap(4000); 57 | 58 | bitmap.Set(2018, true); 59 | CheckBitmap(bitmap); 60 | bitmap.Set(2019, true); 61 | CheckBitmap(bitmap); 62 | bitmap.Set(3025, true); 63 | CheckBitmap(bitmap); 64 | bitmap.Set(3999, true); 65 | CheckBitmap(bitmap); 66 | } 67 | 68 | TEST(RleBitmapTest, InterleavedBitmap) { 69 | Bitmap64 bitmap(4000); 70 | size_t step = 0; 71 | bool bit = true; 72 | for (size_t i = 0; i < bitmap.bits(); i += step) { 73 | ++step; 74 | for (size_t j = 0; (j < step) && (i + j < bitmap.bits()); ++j) 75 | bitmap.Set(i + j, bit); 76 | bit ^= true; 77 | } 78 | CheckBitmap(bitmap); 79 | } 80 | 81 | } // namespace ci 82 | -------------------------------------------------------------------------------- /croaring.BUILD: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | # ----------------------------------------------------------------------------- 16 | # File: croaring.BUILD 17 | # ----------------------------------------------------------------------------- 18 | 19 | load("@rules_cc//cc:defs.bzl", "cc_library") 20 | 21 | package(default_visibility = ["//visibility:public"]) 22 | 23 | cc_library( 24 | name = "roaring_cpp", 25 | hdrs = ["roaring.h", "roaring.hh"], 26 | srcs = ["roaring.c"], 27 | ) 28 | -------------------------------------------------------------------------------- /csv-parser.BUILD: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | # ----------------------------------------------------------------------------- 16 | # File: csv-parser.BUILD 17 | # ----------------------------------------------------------------------------- 18 | 19 | load("@rules_cc//cc:defs.bzl", "cc_library") 20 | 21 | package(default_visibility = ["//visibility:public"]) 22 | 23 | cc_library( 24 | name = "csv-parser", 25 | hdrs = ["single_include/csv.hpp"], 26 | ) 27 | -------------------------------------------------------------------------------- /cuckoo_index.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: cuckoo_index.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_CUCKOO_INDEX_H_ 20 | #define CUCKOO_INDEX_CUCKOO_INDEX_H_ 21 | 22 | #include "absl/container/flat_hash_map.h" 23 | #include "common/rle_bitmap.h" 24 | #include "cuckoo_utils.h" 25 | #include "fingerprint_store.h" 26 | #include "index_structure.h" 27 | 28 | namespace ci { 29 | 30 | class CuckooIndex : public IndexStructure { 31 | public: 32 | bool StripeContains(size_t stripe_id, int value) const override; 33 | 34 | Bitmap64 GetQualifyingStripes(int value, size_t num_stripes) const override; 35 | 36 | std::string name() const override { return name_; } 37 | 38 | // Returns the in-memory size of the index structure. 39 | size_t byte_size() const override { return byte_size_; } 40 | 41 | // Returns the in-memory size of the compressed index structure. 42 | size_t compressed_byte_size() const override { return compressed_byte_size_; } 43 | 44 | size_t active_slots() const { 45 | size_t active_slots = 0; 46 | for (size_t i = 0; i < fingerprint_store_->num_slots(); ++i) { 47 | if (fingerprint_store_->GetFingerprint(i).active) ++active_slots; 48 | } 49 | return active_slots; 50 | } 51 | 52 | private: 53 | friend class CuckooIndexFactory; 54 | 55 | CuckooIndex(std::string name, size_t num_stripes, size_t slots_per_bucket, 56 | std::unique_ptr fingerprint_store, 57 | Bitmap64Ptr use_prefix_bits_bitmap, 58 | RleBitmapPtr global_slot_bitmap, size_t byte_size, 59 | size_t compressed_byte_size) 60 | : name_(name), 61 | num_stripes_(num_stripes), 62 | num_buckets_(fingerprint_store->num_slots() / slots_per_bucket), 63 | slots_per_bucket_(slots_per_bucket), 64 | fingerprint_store_(std::move(fingerprint_store)), 65 | use_prefix_bits_bitmap_(std::move(use_prefix_bits_bitmap)), 66 | global_slot_bitmap_(std::move(global_slot_bitmap)), 67 | byte_size_(byte_size), 68 | compressed_byte_size_(compressed_byte_size) { 69 | assert(fingerprint_store_->num_slots() % slots_per_bucket_ == 0); 70 | } 71 | 72 | // Returns true if the given bucket contains the fingerprint (taking only 73 | // the relevant bits into account). In case it does, `slot` is set to the 74 | // slot which contains it (one of the `slots_per_bucket_` possible ones). 75 | bool BucketContains(size_t bucket, uint64_t fingerprint, size_t* slot) const; 76 | 77 | size_t GetNthNonEmptyBitmapSlot(size_t n) const { 78 | // Inactive slots are empty and their corresponding bitmaps are skipped in 79 | // the `global_slot_bitmap_`, so we need to compute the actual slot by 80 | // subtracting the number of skipped (empty) slots before `slot`. 81 | return n - 82 | fingerprint_store_->EmptySlotsBitmap().GetOnesCountBeforeLimit(n); 83 | } 84 | 85 | const std::string name_; 86 | const size_t num_stripes_; 87 | const size_t num_buckets_; 88 | const size_t slots_per_bucket_; 89 | 90 | const std::unique_ptr fingerprint_store_; 91 | // Indicates for every bucket whether prefix or suffix bits of hash 92 | // fingerprints were used. 93 | const Bitmap64Ptr use_prefix_bits_bitmap_; 94 | // Concatenated slot bitmaps for *active* slots. 95 | const RleBitmapPtr global_slot_bitmap_; 96 | 97 | // The sizes of the encoded data-structures. 98 | // TODO: after fine-tuning the encodings, actually store the encoded 99 | // data-structures and add methods which serialize / deserialize for testing 100 | // purposes. As follow-up also do the lookups based on these data-structures 101 | // instead of the expanded Fingerprints and Bitmaps above. 102 | const size_t byte_size_; 103 | const size_t compressed_byte_size_; 104 | }; 105 | 106 | // How the distribution of values to their primary / secondary bucket is chosen: 107 | // "Classically" by kicking out existing values (KICKING), using a biased coin 108 | // toss during the kicking procedure to increase the ratio of primary-bucket 109 | // placements (SKEWED_KICKING), or by finding an optimal solution via a 110 | // weighted-matching algorithm (MATCHING). 111 | enum class CuckooAlgorithm { KICKING, SKEWED_KICKING, MATCHING }; 112 | 113 | class CuckooIndexFactory : public IndexStructureFactory { 114 | public: 115 | explicit CuckooIndexFactory(CuckooAlgorithm cuckoo_alg, 116 | double max_load_factor, double scan_rate, 117 | size_t slots_per_bucket, 118 | bool prefix_bits_optimization) 119 | : cuckoo_alg_(cuckoo_alg), 120 | max_load_factor_(max_load_factor), 121 | scan_rate_(scan_rate), 122 | slots_per_bucket_(slots_per_bucket), 123 | prefix_bits_optimization_(prefix_bits_optimization) {} 124 | 125 | std::unique_ptr Create( 126 | const Column& column, size_t num_rows_per_stripe) const override; 127 | 128 | std::string index_name() const override; 129 | 130 | private: 131 | const CuckooAlgorithm cuckoo_alg_; 132 | const double max_load_factor_; 133 | const double scan_rate_; 134 | const size_t slots_per_bucket_; 135 | // If set, dynamically uses either prefix or suffix bits of hash fingerprints 136 | // on a bucket basis (depending on which of the two requires fewer bits to 137 | // make fingerprints collision free). 138 | const bool prefix_bits_optimization_; 139 | }; 140 | 141 | } // namespace ci 142 | 143 | #endif // CUCKOO_INDEX_CUCKOO_INDEX_H_ 144 | -------------------------------------------------------------------------------- /cuckoo_index_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: cuckoo_index_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "cuckoo_index.h" 20 | 21 | #include 22 | #include 23 | 24 | #include "cuckoo_utils.h" 25 | #include "data.h" 26 | #include "gmock/gmock.h" 27 | #include "gtest/gtest.h" 28 | #include "index_structure.h" 29 | 30 | namespace ci { 31 | 32 | // For all tests we create CuckooIndexes with 900 rows and 3 rows per stripe. 33 | constexpr size_t kNumRows = 300; 34 | constexpr size_t kNumRowsPerStripe = 3; 35 | // Number of negative lookups to perform. 36 | constexpr int kNumNegativeLookups = 10 * 1000; 37 | 38 | // Returns a column with `num_rows` entries and `num_values` different values, 39 | // set in increasing manner of `num_rows / num_values` runs with the same value. 40 | ColumnPtr FillColumn(size_t num_rows, size_t num_values) { 41 | assert(num_values > 0); 42 | assert(num_rows >= num_values); 43 | assert(num_rows % num_values == 0); 44 | const size_t factor = num_rows / num_values; 45 | std::vector data(num_rows, 0); 46 | for (size_t i = 0; i < num_rows; ++i) data[i] = i / factor; 47 | return Column::IntColumn("int-column", std::move(data)); 48 | } 49 | 50 | // Checks that positive lookups are exact (for all values in the column). 51 | void CheckPositiveLookups(const Column& column, const IndexStructure* index) { 52 | const size_t num_stripes = column.num_rows() / kNumRowsPerStripe; 53 | for (const int value : column.distinct_values()) { 54 | const Bitmap64 result = index->GetQualifyingStripes(value, num_stripes); 55 | for (size_t stripe_id = 0; stripe_id < num_stripes; ++stripe_id) { 56 | EXPECT_EQ(column.StripeContains(kNumRowsPerStripe, stripe_id, value), 57 | result.Get(stripe_id)); 58 | } 59 | } 60 | } 61 | 62 | // Returns the average scan-rate of `kNumNegativeLookups` negative lookups. 63 | double ScanRateNegativeLookups(const Column& column, 64 | const IndexStructure* index) { 65 | const size_t num_stripes = column.num_rows() / kNumRowsPerStripe; 66 | const int start = column.max() + 1; 67 | assert(start + kNumNegativeLookups <= std::numeric_limits::max()); 68 | size_t num_false_positive_stripes = 0; 69 | for (int value = start; value < start + kNumNegativeLookups; ++value) { 70 | const Bitmap64 result = index->GetQualifyingStripes(value, num_stripes); 71 | num_false_positive_stripes += result.GetOnesCount(); 72 | } 73 | return static_cast(num_false_positive_stripes) / 74 | (num_stripes * kNumNegativeLookups); 75 | } 76 | 77 | // Helper for the PositiveLookups* tests below: checks lookups of all existing 78 | // values are exact. 79 | void PositiveLookups(const size_t num_values, 80 | const bool prefix_bits_optimization) { 81 | const ColumnPtr column = FillColumn(kNumRows, num_values); 82 | 83 | for (const CuckooAlgorithm alg : 84 | std::vector{CuckooAlgorithm::KICKING}) { 85 | const IndexStructurePtr index = 86 | CuckooIndexFactory(alg, kMaxLoadFactor2SlotsPerBucket, 87 | /*scan_rate=*/0.05, 88 | /*slots_per_bucket=*/2, prefix_bits_optimization) 89 | .Create(*column, kNumRowsPerStripe); 90 | CheckPositiveLookups(*column, index.get()); 91 | } 92 | } 93 | 94 | // Creates two different cuckoo-indexes with scan-rate 0.1 and 0.01 and checks 95 | // that the scan-rate is bounded as expected. 96 | void NegativeLookups(const size_t num_values, 97 | const bool prefix_bits_optimization) { 98 | const ColumnPtr column = FillColumn(kNumRows, num_values); 99 | 100 | for (const CuckooAlgorithm alg : 101 | std::vector{CuckooAlgorithm::KICKING}) { 102 | const IndexStructurePtr index1 = 103 | CuckooIndexFactory(alg, kMaxLoadFactor2SlotsPerBucket, 104 | /*scan_rate=*/0.1, /*slots_per_bucket=*/2, 105 | prefix_bits_optimization) 106 | .Create(*column, kNumRowsPerStripe); 107 | const double scan_rate1 = ScanRateNegativeLookups(*column, index1.get()); 108 | EXPECT_LE(scan_rate1, 0.101); 109 | EXPECT_GT(scan_rate1, 0.0); 110 | 111 | // TODO: The following test is very time-consuming, causing the suite to 112 | // time out after 5 minutes. Find a way to parallelize it or move to 113 | // a separate test suite. 114 | // const IndexStructurePtr index2 = 115 | // CuckooIndexFactory(alg, kMaxLoadFactor2SlotsPerBucket, 116 | // [>scan_rate=*/0.01, /*slots_per_bucket=<]2, 117 | // prefix_bits_optimization) 118 | // .Create(*column, kNumRowsPerStripe); 119 | // const double scan_rate2 = ScanRateNegativeLookups(*column, index2.get()); 120 | // EXPECT_LE(scan_rate2, 0.0101); 121 | // EXPECT_GT(scan_rate2, 0.0); 122 | } 123 | } 124 | 125 | // *** The actual tests: *** 126 | 127 | TEST(CuckooIndexTest, PositiveLookupsSingleValue) { 128 | PositiveLookups(/*num_values=*/1, /*prefix_bits_optimization=*/false); 129 | } 130 | 131 | TEST(CuckooIndexTest, PositiveLookupsSingleValueWithPrefixBitsOptimization) { 132 | PositiveLookups(/*num_values=*/1, /*prefix_bits_optimization=*/true); 133 | } 134 | 135 | TEST(CuckooIndexTest, NegativeLookupsSingleValue) { 136 | NegativeLookups(/*num_values=*/1, /*prefix_bits_optimization=*/false); 137 | } 138 | 139 | TEST(CuckooIndexTest, NegativeLookupsSingleValueWithPrefixBitsOptimization) { 140 | NegativeLookups(/*num_values=*/1, /*prefix_bits_optimization=*/true); 141 | } 142 | 143 | TEST(CuckooIndexTest, PositiveLookupsFewValues) { 144 | PositiveLookups(/*num_values=*/30, /*prefix_bits_optimization=*/false); 145 | } 146 | 147 | TEST(CuckooIndexTest, PositiveLookupsFewValuesWithPrefixBitsOptimization) { 148 | PositiveLookups(/*num_values=*/30, /*prefix_bits_optimization=*/true); 149 | } 150 | 151 | TEST(CuckooIndexTest, NegativeLookupsFewValues) { 152 | NegativeLookups(/*num_values=*/30, /*prefix_bits_optimization=*/false); 153 | } 154 | 155 | TEST(CuckooIndexTest, NegativeLookupsFewValuesWithPrefixBitsOptimization) { 156 | NegativeLookups(/*num_values=*/30, /*prefix_bits_optimization=*/true); 157 | } 158 | 159 | TEST(CuckooIndexTest, PositiveLookupsAllUniques) { 160 | PositiveLookups(/*num_values=*/kNumRows, /*prefix_bits_optimization=*/false); 161 | } 162 | 163 | TEST(CuckooIndexTest, PositiveLookupsAllUniquesWithPrefixBitsOptimization) { 164 | PositiveLookups(/*num_values=*/kNumRows, /*prefix_bits_optimization=*/true); 165 | } 166 | 167 | TEST(CuckooIndexTest, NegativeLookupsAllUniques) { 168 | NegativeLookups(/*num_values=*/kNumRows, /*prefix_bits_optimization=*/false); 169 | } 170 | 171 | // TEST(CuckooIndexTest, NegativeLookupsAllUniquesWithPrefixBitsOptimization) { 172 | // NegativeLookups([>num_values=<]kNumRows, [>prefix_bits_optimization=<]true); 173 | // } 174 | 175 | TEST(CuckooIndexTest, LastRowDropped) { 176 | // The last row will be dropped, since only stripes with `kNumRowsPerStripe` 177 | // (= 3) rows are created. 178 | const ColumnPtr column = FillColumn(/*num_rows=*/4, /*num_values=*/4); 179 | const IndexStructurePtr index = 180 | CuckooIndexFactory(CuckooAlgorithm::KICKING, 181 | kMaxLoadFactor2SlotsPerBucket, 182 | /*scan_rate=*/0.1, /*slots_per_bucket=*/2, 183 | /*prefix_bits_optimization=*/false) 184 | .Create(*column, kNumRowsPerStripe); 185 | EXPECT_EQ(reinterpret_cast(*index).active_slots(), 3); 186 | } 187 | 188 | } // namespace ci 189 | -------------------------------------------------------------------------------- /cuckoo_kicker.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: cuckoo_kicker.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "cuckoo_kicker.h" 20 | 21 | #include 22 | 23 | #include "cuckoo_utils.h" 24 | 25 | namespace ci { 26 | 27 | constexpr size_t CuckooKicker::kDefaultMaxKicks; 28 | 29 | size_t CuckooKicker::GetNumSecondaryItems(const size_t bucket_idx) const { 30 | const Bucket& bucket = buckets_[bucket_idx]; 31 | return absl::c_count_if(bucket.slots_, 32 | [bucket_idx](const CuckooValue& value) { 33 | return value.secondary_bucket == bucket_idx; 34 | }); 35 | } 36 | 37 | void CuckooKicker::FindVictim(const size_t victim_idx, 38 | const size_t primary_bucket_idx, 39 | const size_t secondary_bucket_idx, 40 | const bool kick_secondary, 41 | size_t* victim_bucket_idx, 42 | size_t* idx_within_victim_bucket) const { 43 | // Iterate through potential victims in primary and secondary bucket until 44 | // we've found `victim_idx`. 45 | size_t curr_victim_idx = 0; 46 | 47 | auto search_bucket = [&](size_t bucket_idx) { 48 | const Bucket& bucket = buckets_[bucket_idx]; 49 | for (size_t i = 0; i < bucket.slots_.size(); ++i) { 50 | const CuckooValue& curr_val = bucket.slots_[i]; 51 | const size_t bucket_idx_to_compare = 52 | kick_secondary ? curr_val.secondary_bucket : curr_val.primary_bucket; 53 | if (bucket_idx_to_compare == bucket_idx) { 54 | // Slot `i` contains a potential victim. 55 | if (curr_victim_idx == victim_idx) { 56 | *victim_bucket_idx = bucket_idx; 57 | *idx_within_victim_bucket = i; 58 | return true; 59 | } 60 | ++curr_victim_idx; 61 | } 62 | } 63 | return false; 64 | }; 65 | 66 | if (search_bucket(primary_bucket_idx)) return; 67 | if (search_bucket(secondary_bucket_idx)) return; 68 | 69 | assert(false); // "Couldn't find victim with idx " << victim_idx; 70 | } 71 | 72 | CuckooValue CuckooKicker::SwapWithValue(Bucket* bucket, const size_t victim_idx, 73 | const CuckooValue& value) { 74 | const CuckooValue victim = bucket->slots_[victim_idx]; 75 | bucket->slots_[victim_idx] = value; 76 | return victim; 77 | } 78 | 79 | CuckooValue CuckooKicker::SwapWithRandomValue(const CuckooValue& value, 80 | size_t* victim_bucket_idx) { 81 | Bucket* primary_bucket = &buckets_[value.primary_bucket]; 82 | Bucket* secondary_bucket = &buckets_[value.secondary_bucket]; 83 | 84 | // Method may only be called when both buckets are full. 85 | assert(primary_bucket->slots_.size() == slots_per_bucket_); 86 | assert(secondary_bucket->slots_.size() == slots_per_bucket_); 87 | 88 | if (!skew_kicking_) { 89 | // Select victim bucket. 90 | *victim_bucket_idx = 91 | GetRandomBool() ? value.primary_bucket : value.secondary_bucket; 92 | Bucket* victim_bucket = &buckets_[*victim_bucket_idx]; 93 | // Choose any value as victim (irrespective of whether it resides in its 94 | // primary or secondary bucket). 95 | return SwapWithValue(victim_bucket, GetRandomVictimIndex(), value); 96 | } 97 | 98 | // Skew kicking. 99 | 100 | const size_t num_slots_both_buckets = 2 * slots_per_bucket_; 101 | 102 | // Count number of items that reside in their secondary bucket. 103 | const size_t num_in_secondary = GetNumSecondaryItems(value.primary_bucket) + 104 | GetNumSecondaryItems(value.secondary_bucket); 105 | 106 | if (num_in_secondary == 0 || num_in_secondary == num_slots_both_buckets) { 107 | // Can't perform skewed kick. Just kick any item. 108 | *victim_bucket_idx = 109 | GetRandomBool() ? value.primary_bucket : value.secondary_bucket; 110 | Bucket* victim_bucket = &buckets_[*victim_bucket_idx]; 111 | return SwapWithValue(victim_bucket, GetRandomVictimIndex(), value); 112 | } 113 | const size_t num_in_primary = num_slots_both_buckets - num_in_secondary; 114 | 115 | // "Weigh" probability according to the ratio of items in secondary buckets 116 | // vs. items in primary buckets. For example, if the ratio is 3:1, we give 117 | // three times as much weight to the set of items in secondary buckets to have 118 | // equal probabilities for all items under a `kick_skew_factor_` of 1.0. The 119 | // reason for that is that we below first decide whether to kick a primary or 120 | // a secondary item and then choose a victim within the respective set. 121 | double secondary_weight_factor = 122 | static_cast(num_in_secondary) / num_in_primary; 123 | 124 | // Add `kick_skew_factor_`. 125 | secondary_weight_factor *= kick_skew_factor_; 126 | 127 | // Set probability such that it is `secondary_weight_factor` times more likely 128 | // to kick an item from the secondary vs. one from the primary set. 129 | const double weighted_probability = 130 | secondary_weight_factor / (secondary_weight_factor + 1); 131 | assert(weighted_probability > 0.0 && weighted_probability < 1.0); 132 | 133 | // Roll a (skewed) dice to decide whether we want to kick a primary or a 134 | // secondary item. 135 | const bool kick_secondary = GetRandomBool(weighted_probability); 136 | 137 | // Choose victim index (within set of primary/secondary items). 138 | const size_t num_potential_victims = 139 | kick_secondary ? num_in_secondary : num_in_primary; 140 | const size_t victim_idx = GetRandomVictimIndex(num_potential_victims); 141 | 142 | // Find and kick victim. 143 | size_t idx_within_victim_bucket; 144 | FindVictim(victim_idx, value.primary_bucket, value.secondary_bucket, 145 | kick_secondary, victim_bucket_idx, &idx_within_victim_bucket); 146 | Bucket* victim_bucket = &buckets_[*victim_bucket_idx]; 147 | return SwapWithValue(victim_bucket, idx_within_victim_bucket, value); 148 | } 149 | 150 | bool CuckooKicker::InsertValueWithKick(CuckooValue* value) { 151 | // Swap `value` with random value inside its primary or secondary bucket. 152 | size_t victim_bucket_idx; 153 | const CuckooValue victim = SwapWithRandomValue(*value, &victim_bucket_idx); 154 | 155 | // Try to insert `victim` into its alternative bucket. 156 | const size_t alternative_bucket_idx = 157 | victim_bucket_idx == victim.primary_bucket ? victim.secondary_bucket 158 | : victim.primary_bucket; 159 | Bucket* alternative_bucket = &buckets_[alternative_bucket_idx]; 160 | if (alternative_bucket->InsertValue(victim)) return true; 161 | 162 | // Alternative bucket is full. Victim becomes new in-flight value. 163 | *value = victim; 164 | return false; 165 | } 166 | 167 | bool CuckooKicker::InsertValueWithKicking(const CuckooValue& value) { 168 | Bucket* primary_bucket = &buckets_[value.primary_bucket]; 169 | Bucket* secondary_bucket = &buckets_[value.secondary_bucket]; 170 | 171 | if (primary_bucket->InsertValue(value)) return true; 172 | if (secondary_bucket->InsertValue(value)) return true; 173 | 174 | // Both buckets are full. Try to insert with kicking. 175 | CuckooValue in_flight_value = value; 176 | for (size_t num_kicks = 0; num_kicks <= max_kicks_; ++num_kicks) { 177 | if (InsertValueWithKick(&in_flight_value)) { 178 | if (num_kicks > max_kicks_observed_) max_kicks_observed_ = num_kicks; 179 | return true; 180 | } 181 | } 182 | 183 | // Exceeded `max_kicks_` kicks. Insertion failed. 184 | return false; 185 | } 186 | 187 | } // namespace ci 188 | -------------------------------------------------------------------------------- /cuckoo_kicker.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: cuckoo_kicker.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_CUCKOO_KICKER_H_ 20 | #define CUCKOO_INDEX_CUCKOO_KICKER_H_ 21 | 22 | #include "absl/container/flat_hash_map.h" 23 | #include "absl/random/random.h" 24 | #include "cuckoo_utils.h" 25 | 26 | namespace ci { 27 | 28 | // Used to skew the kicking procedure towards items that reside in their 29 | // secondary bucket (by using a value greater than 1.0). Note that skewed 30 | // kicking affects build performance and may lead to build failures. Below 31 | // constants were obtained empirically using a random test set of 1M items. 32 | constexpr double kKickSkewFactor1SlotsPerBucket = 1.1; 33 | constexpr double kKickSkewFactor2SlotsPerBucket = 16.0; 34 | constexpr double kKickSkewFactor4SlotsPerBucket = 128.0; 35 | constexpr double kKickSkewFactor8SlotsPerBucket = 1024.0; 36 | static const absl::flat_hash_map GetSkewFactorMap() { 37 | return absl::flat_hash_map{ 38 | {1, kKickSkewFactor1SlotsPerBucket}, 39 | {2, kKickSkewFactor2SlotsPerBucket}, 40 | {4, kKickSkewFactor4SlotsPerBucket}, 41 | {8, kKickSkewFactor8SlotsPerBucket}}; 42 | } 43 | 44 | // Distributes values to `buckets` using the kicking algorithm. 45 | class CuckooKicker { 46 | public: 47 | static constexpr size_t kDefaultMaxKicks = 50000; 48 | 49 | // Setting `skew_kicking` may lead to a smaller index (since items in 50 | // secondary buckets may affect the minimum fingerprint lengths of the 51 | // corresponding primary buckets). Another positive effect is that positive 52 | // lookups are more likely to find a match in their primary bucket. On the 53 | // contrary, users should be aware that this increases build time and may lead 54 | // to build failures. 55 | CuckooKicker(size_t slots_per_bucket, absl::Span buckets, 56 | bool skew_kicking = false, size_t max_kicks = kDefaultMaxKicks) 57 | : gen_(absl::SeedSeq({42})), 58 | slots_per_bucket_(slots_per_bucket), 59 | buckets_(buckets), 60 | skew_kicking_(skew_kicking), 61 | kick_skew_factor_(GetSkewFactorMap().at(slots_per_bucket)), 62 | max_kicks_(max_kicks), 63 | max_kicks_observed_(0), 64 | successful_inserts_(0) {} 65 | 66 | // Returns false if `values` couldn't be distributed to `buckets` with 67 | // kicking. 68 | bool InsertValues(absl::Span values) { 69 | for (const CuckooValue& value : values) { 70 | if (!InsertValueWithKicking(value)) return false; 71 | ++successful_inserts_; 72 | } 73 | return true; 74 | } 75 | 76 | void PrintStats() { 77 | std::cout << "slots per bucket: " << slots_per_bucket_ << std::endl; 78 | std::cout << "max kicks observed: " << max_kicks_observed_ << std::endl; 79 | std::cout << "successful inserts: " << successful_inserts_ << std::endl; 80 | std::cout << "load factor: " 81 | << static_cast(successful_inserts_) / 82 | (buckets_.size() * slots_per_bucket_) 83 | << std::endl; 84 | } 85 | 86 | private: 87 | // Returns a "random" bool with the probability of drawing true being 88 | // `true_probability`. By default, performs an unbiased toin coss. 89 | bool GetRandomBool(const double true_probability = 0.5) { 90 | return absl::Bernoulli(gen_, true_probability); 91 | } 92 | 93 | // Returns a uniformly chosen victim index between 0 and `size` (exclusive). 94 | size_t GetRandomVictimIndex(const size_t size) { 95 | return absl::Uniform(gen_, 0u, size); 96 | } 97 | 98 | // Returns a uniformly chosen victim index between 0 and `slots_per_bucket_` 99 | // (exclusive). 100 | size_t GetRandomVictimIndex() { 101 | return GetRandomVictimIndex(slots_per_bucket_); 102 | } 103 | 104 | // Returns the number of items in bucket `bucket_idx` for which this bucket is 105 | // their secondary bucket. 106 | size_t GetNumSecondaryItems(const size_t bucket_idx) const; 107 | 108 | // Finds `victim_idx` in the set of primary or secondary items (depending on 109 | // whether `kick_secondary` is set) in both buckets (`primary_bucket_idx` and 110 | // `secondary_bucket_idx`) and sets the output params `victim_bucket_idx` & 111 | // `idx_within_victim_bucket` accordingly. 112 | void FindVictim(const size_t victim_idx, const size_t primary_bucket_idx, 113 | const size_t secondary_bucket_idx, const bool kick_secondary, 114 | size_t* victim_bucket_idx, 115 | size_t* idx_within_victim_bucket) const; 116 | 117 | // Swaps `value` with value at `bucket->slots_[victim_idx]` and returns the 118 | // victim. 119 | CuckooValue SwapWithValue(Bucket* bucket, const size_t victim_idx, 120 | const CuckooValue& value); 121 | 122 | // Swaps `value` with a random value inside its primary or secondary bucket. 123 | // May only be used when both buckets are full (i.e. all `slots_per_bucket_` 124 | // slots are occupied). 125 | CuckooValue SwapWithRandomValue(const CuckooValue& value, 126 | size_t* victim_bucket_idx); 127 | 128 | // Performs a single "kick". Returns true if the kicked value could be 129 | // inserted into its alternative bucket. Sets `*value` to the victim value. 130 | bool InsertValueWithKick(CuckooValue* value); 131 | 132 | // Tries to insert `value` into `buckets_`. Does not check for duplicates 133 | // (i.e., would insert duplicate fingerprints in the unlikely event of a 134 | // 64-bit hash collision). Duplicates either need to be removed prior to 135 | // calling this method or when determining minimal per-bucket fingerprint 136 | // lengths. Returns false if insertion failed (i.e., we exceeded 137 | // `kNumMaxKicks` kicks). 138 | bool InsertValueWithKicking(const CuckooValue& value); 139 | 140 | absl::BitGen gen_; 141 | const size_t slots_per_bucket_; 142 | absl::Span buckets_; 143 | 144 | // Used to skew kicking towards items that reside in their secondary bucket. 145 | const bool skew_kicking_; 146 | // A factor that defines how much more likely it is for a value that resides 147 | // in its secondary bucket to be kicked (compared to a value that resides in 148 | // its primary bucket). Only used if `skew_kicking_` is set. 149 | const double kick_skew_factor_; 150 | // Maximum number of kicks allowed before an insertion fails. 151 | const size_t max_kicks_; 152 | 153 | // ** Statistics. 154 | // Maximum number of kicks observed. 155 | size_t max_kicks_observed_; 156 | // Number of successfully inserted items. 157 | size_t successful_inserts_; 158 | }; 159 | 160 | } // namespace ci 161 | 162 | #endif // CUCKOO_INDEX_CUCKOO_KICKER_H_ 163 | -------------------------------------------------------------------------------- /cuckoo_kicker_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: cuckoo_kicker_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "cuckoo_kicker.h" 20 | 21 | #include 22 | 23 | #include "absl/types/span.h" 24 | #include "cuckoo_utils.h" 25 | #include "gtest/gtest.h" 26 | 27 | namespace ci { 28 | 29 | constexpr size_t kNumValues = 1e5; 30 | constexpr size_t kSlotsPerBucket = 2; 31 | 32 | constexpr size_t kMaxNumRetries = 10; 33 | 34 | // **** Helper methods **** 35 | 36 | std::vector CreateValues(const size_t num_values) { 37 | std::vector values(num_values); 38 | for (size_t i = 0; i < num_values; ++i) values[i] = i; 39 | return values; 40 | } 41 | 42 | // Returns true if all `values` could be found in `buckets`. Sets 43 | // `in_primary_ratio` according to the ratio of items residing in primary 44 | // buckets. 45 | bool LookupValuesInBuckets(const std::vector& buckets, 46 | const std::vector& values, 47 | double* in_primary_ratio) { 48 | size_t num_in_primary = 0; 49 | for (const int value : values) { 50 | CuckooValue cuckoo_value(value, buckets.size()); 51 | bool in_primary_flag; 52 | if (!LookupValueInBuckets(buckets, cuckoo_value, &in_primary_flag)) 53 | return false; 54 | num_in_primary += in_primary_flag; 55 | } 56 | *in_primary_ratio = static_cast(num_in_primary) / values.size(); 57 | return true; 58 | } 59 | 60 | // **** Test cases **** 61 | 62 | // Starts with the minimum number of buckets required for `kSlotsPerBucket` 63 | // slots and `values.size()`. If construction fails, increases the number of 64 | // buckets and retries (one additional bucket at a time). 65 | std::vector DistributeValuesByKicking(const std::vector& values, 66 | const bool skew_kicking) { 67 | size_t num_buckets = GetMinNumBuckets(kNumValues, kSlotsPerBucket); 68 | 69 | for (size_t i = 0; i < kMaxNumRetries; ++i) { 70 | std::vector buckets(num_buckets, Bucket(kSlotsPerBucket)); 71 | std::vector cuckoo_values; 72 | cuckoo_values.reserve(values.size()); 73 | for (const int value : values) 74 | cuckoo_values.push_back(CuckooValue(value, num_buckets)); 75 | CuckooKicker kicker(kSlotsPerBucket, absl::MakeSpan(buckets), skew_kicking); 76 | if (kicker.InsertValues(cuckoo_values)) return buckets; 77 | ++num_buckets; 78 | } 79 | 80 | std::cerr << "Exceeded kMaxNumRetries: " << kMaxNumRetries << std::endl; 81 | exit(EXIT_FAILURE); 82 | } 83 | 84 | TEST(CuckooKickerTest, InsertValues) { 85 | const std::vector values = CreateValues(kNumValues); 86 | 87 | // Distribute values by kicking. 88 | const std::vector buckets = 89 | DistributeValuesByKicking(values, /*skew_kicking=*/false); 90 | 91 | // Lookup values. 92 | double in_primary_ratio; 93 | ASSERT_TRUE(LookupValuesInBuckets(buckets, values, &in_primary_ratio)); 94 | ASSERT_GT(in_primary_ratio, 0.0); 95 | } 96 | 97 | TEST(CuckooKickerTest, InsertValuesWithSkewedKicking) { 98 | const std::vector values = CreateValues(kNumValues); 99 | 100 | // Distribute values by kicking. 101 | const std::vector buckets = 102 | DistributeValuesByKicking(values, /*skew_kicking=*/true); 103 | 104 | // Lookup values. 105 | double in_primary_ratio; 106 | ASSERT_TRUE(LookupValuesInBuckets(buckets, values, &in_primary_ratio)); 107 | ASSERT_GT(in_primary_ratio, 0.6); 108 | } 109 | 110 | TEST(CuckooKickerTest, CheckForDeterministicBehavior) { 111 | const std::vector values = CreateValues(kNumValues); 112 | 113 | // Distribute values twice using skewed kicker. 114 | const std::vector buckets = 115 | DistributeValuesByKicking(values, /*skew_kicking=*/true); 116 | const std::vector buckets2 = 117 | DistributeValuesByKicking(values, /*skew_kicking=*/true); 118 | 119 | // Check that both bucket vectors contain the same CuckooValues in `slots_` 120 | // and `kicked_`. 121 | ASSERT_EQ(buckets.size(), buckets2.size()); 122 | for (size_t i = 0; i < buckets.size(); ++i) { 123 | ASSERT_EQ(buckets[i].slots_.size(), buckets2[i].slots_.size()); 124 | ASSERT_EQ(buckets[i].kicked_.size(), buckets2[i].kicked_.size()); 125 | for (size_t j = 0; j < buckets[i].slots_.size(); j++) { 126 | EXPECT_EQ(buckets[i].slots_[j].ToString(), 127 | buckets2[i].slots_[j].ToString()); 128 | } 129 | for (size_t j = 0; j < buckets[i].kicked_.size(); j++) { 130 | EXPECT_EQ(buckets[i].kicked_[j].ToString(), 131 | buckets2[i].kicked_[j].ToString()); 132 | } 133 | } 134 | } 135 | 136 | } // namespace ci 137 | -------------------------------------------------------------------------------- /cuckoo_utils.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: cuckoo_utils.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "cuckoo_utils.h" 20 | 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | 27 | #include "absl/container/flat_hash_set.h" 28 | #include "common/bit_packing.h" 29 | #include "common/byte_coding.h" 30 | 31 | namespace ci { 32 | 33 | size_t GetMinNumBuckets(const size_t num_values, const size_t slots_per_bucket, 34 | double max_load_factor) { 35 | assert(max_load_factor > 0.0 && max_load_factor < 1.0); 36 | const size_t min_num_buckets = static_cast(std::ceil( 37 | (static_cast(num_values) / max_load_factor) / slots_per_bucket)); 38 | return min_num_buckets; 39 | } 40 | 41 | size_t GetMinNumBuckets(const size_t num_values, 42 | const size_t slots_per_bucket) { 43 | if (slots_per_bucket == 1) { 44 | return GetMinNumBuckets(num_values, slots_per_bucket, 45 | kMaxLoadFactor1SlotsPerBucket); 46 | } else if (slots_per_bucket == 2) { 47 | return GetMinNumBuckets(num_values, slots_per_bucket, 48 | kMaxLoadFactor2SlotsPerBucket); 49 | } else if (slots_per_bucket == 4) { 50 | return GetMinNumBuckets(num_values, slots_per_bucket, 51 | kMaxLoadFactor4SlotsPerBucket); 52 | } else if (slots_per_bucket == 8) { 53 | return GetMinNumBuckets(num_values, slots_per_bucket, 54 | kMaxLoadFactor8SlotsPerBucket); 55 | } 56 | std::cerr << "No default max load factor for " << slots_per_bucket 57 | << " slots per bucket." << std::endl; 58 | exit(EXIT_FAILURE); 59 | } 60 | 61 | size_t GetMinCollisionFreeFingerprintLength( 62 | const std::vector& fingerprints, const bool use_prefix_bits) { 63 | if (fingerprints.size() < 2) return 0; 64 | int num_bits = 1; 65 | for (; num_bits <= 64; ++num_bits) { 66 | absl::flat_hash_set unique_fingerprints; 67 | bool success = true; 68 | for (const uint64_t fp : fingerprints) { 69 | // Extract `num_bits` prefix or suffix bits. 70 | const uint64_t fp_bits = use_prefix_bits 71 | ? GetFingerprintPrefix(fp, num_bits) 72 | : GetFingerprintSuffix(fp, num_bits); 73 | 74 | if (auto [_, was_inserted] = unique_fingerprints.insert(fp_bits); 75 | !was_inserted) { 76 | // Detected conflict. Try next higher number of bits. 77 | success = false; 78 | break; 79 | } 80 | } 81 | if (success) break; 82 | } 83 | if (num_bits == 65) { 84 | std::cerr << "Exhaused all 64 bits and still having collisions." 85 | << std::endl; 86 | exit(EXIT_FAILURE); 87 | } 88 | return num_bits; 89 | } 90 | 91 | size_t GetMinCollisionFreeFingerprintPrefixOrSuffix( 92 | const std::vector& fingerprints, bool* use_prefix_bits) { 93 | // Determine minimum number of bits starting with lowest bit. 94 | const int num_suffix_bits = GetMinCollisionFreeFingerprintLength( 95 | fingerprints, /*use_prefix_bits=*/false); 96 | 97 | if (num_suffix_bits <= 1) { 98 | // No need to check prefix bits. 99 | *use_prefix_bits = false; 100 | return num_suffix_bits; 101 | } 102 | 103 | // Determine minimum number of bits starting with highest bit. 104 | const int num_prefix_bits = GetMinCollisionFreeFingerprintLength( 105 | fingerprints, /*use_prefix_bits=*/true); 106 | 107 | if (num_suffix_bits <= num_prefix_bits) { 108 | // Prefer using suffix bits. 109 | *use_prefix_bits = false; 110 | return num_suffix_bits; 111 | } 112 | 113 | *use_prefix_bits = true; 114 | return num_prefix_bits; 115 | } 116 | 117 | bool CheckWhetherAllBucketsOnlyContainSameSizeFingerprints( 118 | const std::vector& fingerprints, 119 | const size_t slots_per_bucket) { 120 | for (size_t i = 0; i < fingerprints.size(); i += slots_per_bucket) { 121 | bool found_active = false; 122 | size_t num_bits; 123 | for (size_t j = 0; j < slots_per_bucket; ++j) { 124 | const ci::Fingerprint& fp = fingerprints[i + j]; 125 | if (!fp.active) continue; 126 | if (!found_active) { 127 | num_bits = fp.num_bits; 128 | found_active = true; 129 | continue; 130 | } 131 | if (fp.num_bits != num_bits) return false; 132 | } 133 | } 134 | return true; 135 | } 136 | 137 | bool Bucket::InsertValue(const CuckooValue& value) { 138 | if (slots_.size() < num_slots_) { 139 | slots_.push_back(value); 140 | return true; 141 | } 142 | return false; 143 | } 144 | 145 | namespace { 146 | 147 | bool ContainsValue(absl::Span values, 148 | const CuckooValue& value) { 149 | for (const CuckooValue& val : values) { 150 | if (val.orig_value == value.orig_value) return true; 151 | } 152 | return false; 153 | } 154 | 155 | } // namespace 156 | 157 | bool Bucket::ContainsValue(const CuckooValue& value) const { 158 | return ci::ContainsValue(slots_, value); 159 | } 160 | 161 | bool LookupValueInBuckets(absl::Span buckets, 162 | const CuckooValue value, bool* in_primary) { 163 | if (buckets[value.primary_bucket].ContainsValue(value)) { 164 | *in_primary = true; 165 | return true; 166 | } 167 | if (buckets[value.secondary_bucket].ContainsValue(value)) { 168 | *in_primary = false; 169 | return true; 170 | } 171 | return false; 172 | } 173 | 174 | void FillKicked(absl::Span values, 175 | absl::Span buckets) { 176 | for (const CuckooValue& value : values) { 177 | bool in_primary; 178 | LookupValueInBuckets(buckets, value, &in_primary); 179 | std::vector& kicked = buckets[value.primary_bucket].kicked_; 180 | // If the value isn't in its primary bucket and not in `kicked`, add it. 181 | if (!in_primary && !ContainsValue(kicked, value)) kicked.push_back(value); 182 | } 183 | } 184 | 185 | size_t GetRank(const Bitmap64& bitmap, const size_t idx) { 186 | assert(idx <= bitmap.bits()); 187 | return bitmap.GetOnesCountBeforeLimit(/*limit=*/idx); 188 | } 189 | 190 | namespace { 191 | // Helper function that implements SelectOne() or SelectZero() depending on 192 | // whether `count_ones` is set. 193 | bool Select(const Bitmap64& bitmap, const size_t ith, const bool count_ones, 194 | size_t* pos) { 195 | size_t count = 0; 196 | for (size_t i = 0; i < bitmap.bits(); ++i) { 197 | if (bitmap.Get(i) == count_ones) { 198 | if (count == ith) { 199 | *pos = i; 200 | return true; 201 | } 202 | ++count; 203 | } 204 | } 205 | return false; 206 | } 207 | } // namespace 208 | 209 | bool SelectOne(const Bitmap64& bitmap, const size_t ith, size_t* pos) { 210 | return Select(bitmap, ith, /*count_ones=*/true, pos); 211 | } 212 | 213 | bool SelectZero(const Bitmap64& bitmap, const size_t ith, size_t* pos) { 214 | return Select(bitmap, ith, /*count_ones=*/false, pos); 215 | } 216 | 217 | Bitmap64Ptr GetEmptyBucketsBitmap(const Bitmap64& empty_slots_bitmap, 218 | const size_t slots_per_bucket) { 219 | assert(empty_slots_bitmap.bits() % slots_per_bucket == 0); 220 | Bitmap64Ptr empty_buckets_bitmap = absl::make_unique( 221 | /*size=*/empty_slots_bitmap.bits() / slots_per_bucket); 222 | for (size_t i = 0; i < empty_slots_bitmap.bits(); i += slots_per_bucket) { 223 | bool empty_bucket = true; 224 | for (size_t j = 0; j < slots_per_bucket; ++j) { 225 | if (!empty_slots_bitmap.Get(i + j)) { 226 | empty_bucket = false; 227 | break; 228 | } 229 | } 230 | if (empty_bucket) empty_buckets_bitmap->Set(i / slots_per_bucket, true); 231 | } 232 | return empty_buckets_bitmap; 233 | } 234 | 235 | } // namespace ci 236 | -------------------------------------------------------------------------------- /cuckoo_utils.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: cuckoo_utils.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_CUCKOO_UTILS_H_ 20 | #define CUCKOO_INDEX_CUCKOO_UTILS_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | #include 26 | 27 | #include "absl/hash/internal/city.h" 28 | #include "absl/memory/memory.h" 29 | #include "absl/strings/str_format.h" 30 | #include "absl/strings/string_view.h" 31 | #include "common/bit_packing.h" 32 | #include "common/bitmap.h" 33 | 34 | namespace ci { 35 | 36 | // The seeds for the primary & secondary buckets and the fingerprint. 37 | constexpr uint64_t kSeedPrimaryBucket = 17; 38 | constexpr uint64_t kSeedSecondaryBucket = 23; 39 | constexpr uint64_t kSeedFingerprint = 42; 40 | 41 | // Maximum load factors (in terms of occupied vs. all slots). Obtained from the 42 | // Cuckoo filter paper: 43 | // https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf. 44 | // Since we don't use partial-key Cuckoo hashing, we could theoretically achieve 45 | // slightly higher load factors. 46 | // However, empirical testing showed that this is not the case (at least an 47 | // extra % is not possible with the current kicking implementation, see 48 | // `cuckoo_kicker_test.cc` for test code). Using a victim stash might allow for 49 | // slightly higher load factors. 50 | constexpr inline double kMaxLoadFactor1SlotsPerBucket = 0.49; 51 | constexpr inline double kMaxLoadFactor2SlotsPerBucket = 0.84; 52 | constexpr inline double kMaxLoadFactor4SlotsPerBucket = 0.95; 53 | constexpr inline double kMaxLoadFactor8SlotsPerBucket = 0.98; 54 | 55 | // Returns the minimum number of buckets required to accommodate `num_values` 56 | // values with `slots_per_bucket` slots per bucket under a load factor of 57 | // `max_load_factor`. 58 | size_t GetMinNumBuckets(const size_t num_values, const size_t slots_per_bucket, 59 | double max_load_factor); 60 | 61 | // Uses empirically obtained max load factors. 62 | size_t GetMinNumBuckets(const size_t num_values, const size_t slots_per_bucket); 63 | 64 | struct Fingerprint { 65 | // Indicates whether the corresponding slot in the Cuckoo table is active 66 | // (i.e., filled with a fingerprint). 67 | bool active; 68 | // Number of significant bits (counting from least significant). 69 | size_t num_bits; 70 | // Variable-sized fingerprint. May use up to 64 bits. Non-significant bits 71 | // have to be cleared. 72 | uint64_t fingerprint; 73 | }; 74 | 75 | // Returns a mask which has the lowest `num_bits` set. 76 | inline uint64_t FingerprintSuffixMask(const size_t num_bits) { 77 | return num_bits >= 64 ? std::numeric_limits::max() 78 | : (1ULL << num_bits) - 1ULL; 79 | } 80 | 81 | // Returns `num_bits` suffix (lowest) bits. 82 | inline uint64_t GetFingerprintSuffix(const uint64_t fingerprint, 83 | const size_t num_bits) { 84 | return fingerprint & FingerprintSuffixMask(num_bits); 85 | } 86 | 87 | // Returns `num_bits` prefix (highest) bits. 88 | inline uint64_t GetFingerprintPrefix(const uint64_t fingerprint, 89 | const size_t num_bits) { 90 | if (num_bits == 0) return 0; 91 | return num_bits >= 64 ? fingerprint : fingerprint >> (64 - num_bits); 92 | } 93 | 94 | // Determines the minimum number of bits to make `fingerprints` collision free. 95 | // Uses either prefix or suffix bits (depending on `use_prefix_bits`). 96 | size_t GetMinCollisionFreeFingerprintLength( 97 | const std::vector& fingerprints, const bool use_prefix_bits); 98 | 99 | // Convenience method that tries both prefix and suffix bits to determine the 100 | // minimum number of bits to make `fingerprints` collision free. Sets 101 | // `use_prefix_bits` depending on whether it used prefix or suffix bits (prefers 102 | // suffix bits and only chooses prefix bits if it would result in fewer bits). 103 | size_t GetMinCollisionFreeFingerprintPrefixOrSuffix( 104 | const std::vector& fingerprints, bool* use_prefix_bits); 105 | 106 | // Returns true if all buckets only contain equally long fingerprints (i.e., all 107 | // non-empty slots of a bucket contain fingerprints that have the same length). 108 | bool CheckWhetherAllBucketsOnlyContainSameSizeFingerprints( 109 | const std::vector& fingerprints, 110 | const size_t slots_per_bucket); 111 | 112 | // Representation of a value as its two buckets and fingerprint. 113 | struct CuckooValue { 114 | CuckooValue(int value, size_t num_buckets) { 115 | orig_value = value; 116 | 117 | // *** POSSIBLY CHOOSE ANOTHER HASHING ALGORITHM *** 118 | auto value_data = reinterpret_cast(&value); 119 | primary_bucket = absl::hash_internal::CityHash64WithSeed( 120 | value_data, sizeof(value), kSeedPrimaryBucket) % 121 | num_buckets; 122 | secondary_bucket = absl::hash_internal::CityHash64WithSeed( 123 | value_data, sizeof(value), kSeedSecondaryBucket) % 124 | num_buckets; 125 | fingerprint = absl::hash_internal::CityHash64WithSeed( 126 | value_data, sizeof(value), kSeedFingerprint); 127 | } 128 | 129 | std::string ToString() const { 130 | return absl::StrFormat("{v=%d fp=%llx (%llu | %llu)}", orig_value, 131 | fingerprint, primary_bucket, secondary_bucket); 132 | } 133 | 134 | int orig_value; 135 | size_t primary_bucket; 136 | size_t secondary_bucket; 137 | uint64_t fingerprint; 138 | }; 139 | 140 | // Class for temporary usage when assigning values to buckets. In particular, 141 | // it keeps a list of values which could *not* be assigned to this bucket even 142 | // though it was their primary choice. 143 | class Bucket { 144 | public: 145 | explicit Bucket(size_t num_slots) : num_slots_(num_slots) {} 146 | 147 | // Returns false if bucket is full (i.e. all `num_slots_` slots are occupied). 148 | bool InsertValue(const CuckooValue& value); 149 | 150 | // Checks whether bucket contains `value`. 151 | bool ContainsValue(const CuckooValue& value) const; 152 | 153 | size_t num_slots() const { return num_slots_; } 154 | 155 | // The actually assigned values -- up to `num_slots_` entries. 156 | std::vector slots_; 157 | // The values which were kicked out of the bucket even though it was their 158 | // primary choice. 159 | std::vector kicked_; 160 | 161 | private: 162 | size_t num_slots_; 163 | }; 164 | 165 | // Searches for `value` in its primary and secondary bucket. Returns true if 166 | // value was found and sets `in_primary` to true if `value` resides in its 167 | // primary bucket, and to false otherwise. 168 | bool LookupValueInBuckets(absl::Span buckets, 169 | const CuckooValue value, bool* in_primary); 170 | 171 | // Goes over the given `values` and CHECKs for each value that it has been added 172 | // to its primary or secondary bucket. If it was added to the secondary bucket, 173 | // makes sure that the `kicked_` vector in its primary bucket contains the value 174 | // as expected. 175 | void FillKicked(absl::Span values, 176 | absl::Span buckets); 177 | 178 | // Returns the rank of `idx` in `bitmap`. 179 | size_t GetRank(const Bitmap64& bitmap, const size_t idx); 180 | 181 | // Sets `pos` according to the `ith` one-bit in `bitmap`. For example, 182 | // Select(bitmap=0010, ith=1, &pos) would set `pos` to the zero-based position 183 | // of the 1st one-bit (2 in this case). Returns true if `ith` one-bit was 184 | // found and false otherwise. 185 | bool SelectOne(const Bitmap64& bitmap, const size_t ith, size_t* pos); 186 | 187 | // Sets `pos` according to the `ith` zero-bit in `bitmap`. Returns true if `ith` 188 | // zero-bit was found and false otherwise. 189 | bool SelectZero(const Bitmap64& bitmap, const size_t ith, size_t* pos); 190 | 191 | // Creates an empty buckets bitmap from an `empty_slots_bitmap`. 192 | Bitmap64Ptr GetEmptyBucketsBitmap(const Bitmap64& empty_slots_bitmap, 193 | const size_t slots_per_bucket); 194 | 195 | } // namespace ci 196 | 197 | #endif // CUCKOO_INDEX_CUCKOO_UTILS_H_ 198 | -------------------------------------------------------------------------------- /data.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: data.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include 20 | 21 | #include "data.h" 22 | 23 | namespace ci { 24 | 25 | const int Column::kIntNullSentinel; 26 | const char* const Column::kStringNullSentinel; 27 | 28 | std::unique_ptr GenerateUniformData(const size_t generate_num_values, 29 | const size_t num_unique_values) { 30 | std::mt19937 gen(42); 31 | 32 | // Generate unique values. 33 | std::unordered_set set; 34 | set.reserve(num_unique_values); 35 | std::uniform_int_distribution 36 | d_int(std::numeric_limits::min(), std::numeric_limits::max()); 37 | while (set.size() < num_unique_values) { 38 | set.insert(d_int(gen)); 39 | } 40 | // Copy to vector. 41 | std::vector unique_values(set.begin(), set.end()); 42 | 43 | // Draw each unique value once to ensure `num_unique_values`. Without 44 | // this, we might miss out on certain unique values. 45 | std::vector values(unique_values.begin(), unique_values.end()); 46 | 47 | // Fill up remaining values by drawing random `unique_values`. 48 | values.reserve(generate_num_values); 49 | std::uniform_int_distribution d_unique(0, unique_values.size() - 1); 50 | while (values.size() < generate_num_values) { 51 | values.push_back(unique_values[d_unique(gen)]); 52 | } 53 | 54 | // Shuffle resulting vector to avoid skew. 55 | std::shuffle(values.begin(), values.end(), gen); 56 | 57 | // Create column & return table. 58 | ColumnPtr column = ci::Column::IntColumn(absl::StrCat("uni_", 59 | generate_num_values 60 | / 1000, 61 | "K_val_", 62 | num_unique_values, 63 | "_uniq"), values); 64 | std::vector> columns; 65 | columns.push_back(std::move(column)); 66 | return ci::Table::Create(/*name=*/"", std::move(columns)); 67 | } 68 | 69 | } // namespace ci 70 | -------------------------------------------------------------------------------- /data_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: data_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "data.h" 20 | 21 | #include 22 | #include 23 | 24 | #include "gmock/gmock.h" 25 | #include "gtest/gtest.h" 26 | 27 | namespace ci { 28 | namespace { 29 | 30 | using ::testing::ElementsAreArray; 31 | 32 | // Checks that `ValueAt()` method of a column created with the given data type 33 | // and values produces expected results. 34 | template 35 | void CheckValueAt(DataType data_type, const std::vector expected_values) {} 36 | 37 | TEST(ColumnTest, ValueAtReturnsValuesForIntColumns) { 38 | std::vector expected_values = {"42", "13", "1", "42"}; 39 | auto column = 40 | absl::make_unique("column_name", DataType::INT, expected_values); 41 | 42 | std::vector actual_values; 43 | actual_values.reserve(column->num_rows()); 44 | for (size_t i = 0; i < column->num_rows(); ++i) { 45 | actual_values.push_back(column->ValueAt(i)); 46 | } 47 | 48 | EXPECT_THAT(actual_values, ElementsAreArray(expected_values)); 49 | } 50 | 51 | TEST(ColumnTest, ValueAtReturnsValuesForStringColumns) { 52 | std::vector expected_values = {"US", "CH", "US", "CH"}; 53 | auto column = absl::make_unique("column_name", DataType::STRING, 54 | expected_values); 55 | 56 | std::vector actual_values; 57 | actual_values.reserve(column->num_rows()); 58 | for (size_t i = 0; i < column->num_rows(); ++i) { 59 | actual_values.push_back(column->ValueAt(i)); 60 | } 61 | 62 | EXPECT_THAT(actual_values, ElementsAreArray(expected_values)); 63 | } 64 | 65 | TEST(ColumnTest, CompressInts) { 66 | auto column = 67 | absl::make_unique("column_name", DataType::INT, 68 | std::vector{"0", "1", "2", "3"}); 69 | 70 | const size_t compressed_size_1_stripe = 71 | column->compressed_size_bytes(/*num_rows_per_stripe=*/4); 72 | EXPECT_GT(compressed_size_1_stripe, 20); 73 | EXPECT_LT(compressed_size_1_stripe, 30); 74 | 75 | const size_t compressed_size_4_stripes = 76 | column->compressed_size_bytes(/*num_rows_per_stripe=*/1); 77 | EXPECT_GT(compressed_size_4_stripes, 50); 78 | EXPECT_LT(compressed_size_4_stripes, 60); 79 | } 80 | 81 | TEST(ColumnTest, CompressStrings) { 82 | auto column = absl::make_unique( 83 | "column_name", DataType::STRING, 84 | std::vector{"DE", "US", "IT", "FR"}); 85 | 86 | const size_t compressed_size_1_stripe = 87 | column->compressed_size_bytes(/*num_rows_per_stripe=*/4); 88 | EXPECT_GT(compressed_size_1_stripe, 20); 89 | EXPECT_LT(compressed_size_1_stripe, 30); 90 | 91 | const size_t compressed_size_4_stripes = 92 | column->compressed_size_bytes(/*num_rows_per_stripe=*/1); 93 | EXPECT_GT(compressed_size_4_stripes, 40); 94 | EXPECT_LT(compressed_size_4_stripes, 50); 95 | } 96 | 97 | TEST(ColumnTest, Reorder) { 98 | auto column = 99 | absl::make_unique("column_name", DataType::INT, 100 | std::vector{"0", "1", "2", "3"}); 101 | 102 | const std::vector indexes = {2, 1, 3, 0}; 103 | column->Reorder(indexes); 104 | for (size_t i = 0; i < column->num_rows(); ++i) { 105 | EXPECT_EQ((*column)[i], indexes[i]); 106 | } 107 | } 108 | 109 | TEST(DataTest, SortWithCardinalityKey) { 110 | std::vector> columns; 111 | columns.push_back(absl::make_unique( 112 | "customer_id", DataType::INT, 113 | std::vector({"42", "13", "1", "42"}))); 114 | columns.push_back(absl::make_unique( 115 | "country", DataType::STRING, 116 | std::vector({"US", "CH", "US", "CH"}))); 117 | std::unique_ptr
table = Table::Create("test", std::move(columns)); 118 | 119 | EXPECT_EQ(table->ToCsvString(), 120 | "42,US\n" 121 | "13,CH\n" 122 | "1,US\n" 123 | "42,CH\n"); 124 | 125 | table->SortWithCardinalityKey(); 126 | 127 | // Expect that the table is sorted by "country" first as it has lower 128 | // cardinality. 129 | EXPECT_EQ(table->ToCsvString(), 130 | "13,CH\n" 131 | "42,CH\n" 132 | "1,US\n" 133 | "42,US\n"); 134 | } 135 | 136 | } // namespace 137 | } // namespace ci 138 | -------------------------------------------------------------------------------- /docs/contributing.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We're currently not accepting code contributions. 4 | 5 | ## Contributor License Agreement 6 | 7 | Contributions to this project must be accompanied by a Contributor License 8 | Agreement. You (or your employer) retain the copyright to your contribution; 9 | this simply gives us permission to use and redistribute your contributions as 10 | part of the project. Head over to to see 11 | your current agreements on file or to sign a new one. 12 | 13 | You generally only need to submit a CLA once, so if you've already submitted one 14 | (even if it was for a different project), you probably don't need to do it 15 | again. 16 | 17 | ## Code reviews 18 | 19 | All submissions, including submissions by project members, require review. We 20 | use GitHub pull requests for this purpose. Consult 21 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more 22 | information on using pull requests. 23 | 24 | ## Community Guidelines 25 | 26 | This project follows 27 | [Google's Open Source Community Guidelines](https://opensource.google/conduct/). 28 | -------------------------------------------------------------------------------- /evaluate.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: evaluate.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include 20 | #include 21 | #include 22 | #include 23 | 24 | #include "absl/flags/flag.h" 25 | #include "absl/flags/parse.h" 26 | #include "absl/memory/memory.h" 27 | #include "absl/strings/str_cat.h" 28 | #include "absl/strings/str_format.h" 29 | #include "absl/strings/string_view.h" 30 | #include "cuckoo_index.h" 31 | #include "cuckoo_utils.h" 32 | #include "data.h" 33 | #include "evaluation.pb.h" 34 | #include "evaluation_utils.h" 35 | #include "evaluator.h" 36 | #include "index_structure.h" 37 | #include "per_stripe_bloom.h" 38 | #include "per_stripe_xor.h" 39 | #include "zone_map.h" 40 | 41 | ABSL_FLAG(int, generate_num_values, 100000, 42 | "Number of values to generate (number of rows)."); 43 | ABSL_FLAG(int, num_unique_values, 1000, 44 | "Number of unique values to generate (cardinality)."); 45 | ABSL_FLAG(std::string, input_csv_path, "", "Path to the input CSV file."); 46 | ABSL_FLAG(std::string, output_csv_path, "", 47 | "Path to write the output CSV file to."); 48 | ABSL_FLAG(std::vector, columns_to_test, {"company_name"}, 49 | "Comma-separated list of columns to tests, e.g. " 50 | "'company_name,country_code'."); 51 | ABSL_FLAG(std::vector, num_rows_per_stripe_to_test, {"10000"}, 52 | "Number of rows per stripe. Defaults to 10,000."); 53 | ABSL_FLAG(int, num_lookups, 1000, "Number of lookups. Defaults to 1,000."); 54 | ABSL_FLAG(std::vector, test_cases, {"positive_uniform"}, 55 | "Comma-separated list of test cases, e.g. " 56 | "'positive_uniform,positive_distinct'."); 57 | ABSL_FLAG(std::string, sorting, "NONE", 58 | "Sorting to apply to the data. Supported values: 'NONE', " 59 | "'BY_CARDINALITY' (sorts lexicographically, starting with columns " 60 | "with the lowest cardinality), 'RANDOM'"); 61 | 62 | namespace { 63 | static constexpr absl::string_view kNoSorting = "NONE"; 64 | static constexpr absl::string_view kByCardinalitySorting = "BY_CARDINALITY"; 65 | static constexpr absl::string_view kRandomSorting = "RANDOM"; 66 | static bool IsValidSorting(absl::string_view sorting) { 67 | static const auto* values = new absl::flat_hash_set( 68 | {kNoSorting, kByCardinalitySorting, kRandomSorting}); 69 | 70 | return values->contains(sorting); 71 | } 72 | } // namespace 73 | 74 | int main(int argc, char* argv[]) { 75 | absl::ParseCommandLine(argc, argv); 76 | 77 | const size_t generate_num_values = absl::GetFlag(FLAGS_generate_num_values); 78 | const size_t num_unique_values = absl::GetFlag(FLAGS_num_unique_values); 79 | const std::string input_csv_path = absl::GetFlag(FLAGS_input_csv_path); 80 | const std::string output_csv_path = absl::GetFlag(FLAGS_output_csv_path); 81 | if (output_csv_path.empty()) { 82 | std::cerr << "You must specify --output_csv_path" << std::endl; 83 | std::exit(EXIT_FAILURE); 84 | } 85 | const std::vector columns_to_test = 86 | absl::GetFlag(FLAGS_columns_to_test); 87 | std::vector num_rows_per_stripe_to_test; 88 | for (const std::string num_rows : 89 | absl::GetFlag(FLAGS_num_rows_per_stripe_to_test)) { 90 | num_rows_per_stripe_to_test.push_back(std::stoull(num_rows)); 91 | } 92 | const size_t num_lookups = absl::GetFlag(FLAGS_num_lookups); 93 | const std::vector test_cases = absl::GetFlag(FLAGS_test_cases); 94 | const std::string sorting = absl::GetFlag(FLAGS_sorting); 95 | 96 | // Define data. 97 | std::unique_ptr table; 98 | if (input_csv_path.empty() || columns_to_test.empty()) { 99 | std::cerr 100 | << "[WARNING] --input_csv_path or --columns_to_test not specified, " 101 | "generating synthetic data." 102 | << std::endl; 103 | std::cout << "Generating " << generate_num_values << " values (" 104 | << static_cast(num_unique_values) / generate_num_values * 105 | 100 106 | << "% unique)..." << std::endl; 107 | table = ci::GenerateUniformData(generate_num_values, num_unique_values); 108 | } else { 109 | std::cout << "Loading data from file " << input_csv_path << "..." 110 | << std::endl; 111 | table = ci::Table::FromCsv(input_csv_path, columns_to_test); 112 | } 113 | 114 | // Potentially sort the data. 115 | if (!IsValidSorting(sorting)) { 116 | std::cerr << "Invalid sorting method: " << sorting << std::endl; 117 | std::exit(EXIT_FAILURE); 118 | } 119 | if (sorting == kByCardinalitySorting) { 120 | std::cerr << "Sorting the table according to column cardinality..." 121 | << std::endl; 122 | table->SortWithCardinalityKey(); 123 | } else if (sorting == kRandomSorting) { 124 | std::cerr << "Randomly shuffling the table..." << std::endl; 125 | table->Shuffle(); 126 | } 127 | 128 | // Define competitors. 129 | std::vector> index_factories; 130 | index_factories.push_back(absl::make_unique( 131 | ci::CuckooAlgorithm::SKEWED_KICKING, ci::kMaxLoadFactor1SlotsPerBucket, 132 | /*scan_rate=*/0.001, /*slots_per_bucket=*/1, 133 | /*prefix_bits_optimization=*/false)); 134 | index_factories.push_back( 135 | absl::make_unique( 136 | absl::make_unique( 137 | ci::CuckooAlgorithm::SKEWED_KICKING, 138 | ci::kMaxLoadFactor1SlotsPerBucket, 139 | /*scan_rate=*/0.001, /*slots_per_bucket=*/1, 140 | /*prefix_bits_optimization=*/false))); 141 | index_factories.push_back(absl::make_unique( 142 | ci::CuckooAlgorithm::SKEWED_KICKING, ci::kMaxLoadFactor1SlotsPerBucket, 143 | /*scan_rate=*/0.01, /*slots_per_bucket=*/1, 144 | /*prefix_bits_optimization=*/false)); 145 | index_factories.push_back( 146 | absl::make_unique( 147 | absl::make_unique( 148 | ci::CuckooAlgorithm::SKEWED_KICKING, 149 | ci::kMaxLoadFactor1SlotsPerBucket, 150 | /*scan_rate=*/0.01, /*slots_per_bucket=*/1, 151 | /*prefix_bits_optimization=*/false))); 152 | index_factories.push_back(absl::make_unique()); 153 | index_factories.push_back(absl::make_unique()); 154 | 155 | // Evaluate competitors. 156 | ci::Evaluator evaluator; 157 | std::vector results = evaluator.RunExperiments( 158 | std::move(index_factories), table, num_rows_per_stripe_to_test, 159 | num_lookups, test_cases); 160 | 161 | ci::WriteToCsv(output_csv_path, results); 162 | 163 | std::cout << std::endl 164 | << "** Result summary **" << std::endl 165 | << absl::StrFormat("%-50s %10s %10s %11s %11s", 166 | "field & index-type", "column", "index", 167 | "relative", "scan-rate") 168 | << std::endl; 169 | for (const auto& result : results) { 170 | const std::string column_compr_size = 171 | absl::StrCat(result.column_compressed_size_bytes()); 172 | const std::string index_compr_size = 173 | absl::StrCat(result.index_compressed_size_bytes()); 174 | double scan_rate = -1.0; 175 | for (const auto& test_case : result.test_cases()) { 176 | if (test_case.name() == "negative") 177 | scan_rate = 100.0 * test_case.num_false_positives() / 178 | (test_case.num_lookups() * result.num_stripes()); 179 | } 180 | std::cout << absl::StrFormat("%-50s %10s %10s %10.2f%% %10.2f%%", 181 | absl::StrCat(result.column_name(), ", ", 182 | result.index_structure(), ":"), 183 | column_compr_size, index_compr_size, 184 | 100.0 * result.index_compressed_size_bytes() / 185 | result.column_compressed_size_bytes(), 186 | scan_rate) 187 | << std::endl; 188 | } 189 | 190 | return 0; 191 | } 192 | -------------------------------------------------------------------------------- /evaluation.proto: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: evaluation.proto 17 | // ----------------------------------------------------------------------------- 18 | 19 | syntax = "proto2"; 20 | 21 | package ci; 22 | 23 | // Statistics about various possible bitmaps created for a CLT. Should not be 24 | // populated for other index structures. 25 | // 26 | // Next tag: 11 27 | message BitmapStats { 28 | optional double density = 1; 29 | optional double clustering = 2; 30 | 31 | // Bitmap stats when using `clt::Bitmap64`a. 32 | optional int64 bitpacked_size = 3; 33 | optional int64 bitpacked_compressed_size = 4; 34 | 35 | // Bitmap stats when using Roaring bitmaps to encode all CLT bitmaps 36 | // back-to-back. 37 | optional int64 roaring_size = 5; 38 | optional int64 roaring_compressed_size = 6; 39 | 40 | // Bitmap stats when using Roaring bitmaps to encode CLT bitmaps 41 | // individually. 42 | optional int64 roaring_individual_size = 7; 43 | optional int64 roaring_individual_compressed_size = 8; 44 | // Bitmap stats when using `clt::RleBitmap` to encode all CLT bitmaps 45 | // back-to-back. 46 | optional int64 rle_size = 9; 47 | optional int64 rle_compressed_size = 10; 48 | } 49 | 50 | message EvaluationResults { 51 | // A message describing a an evaluation test case. 52 | // 53 | // Note: We assume there can be no false negatives, as that would lead to 54 | // incorrect results. 55 | message TestCase { 56 | optional string name = 1; 57 | // Number of lookups performed as a part of this test case. 58 | optional int64 num_lookups = 2; 59 | // Number of cases where we incorrectly marked a data slice (e.g. a stripe) 60 | // as active. 61 | optional int64 num_false_positives = 3; 62 | // Number of cases where we marked a data slice (e.g. a stripe) as inactive. 63 | optional int64 num_true_negatives = 4; 64 | } 65 | 66 | optional string index_structure = 1; 67 | 68 | optional int64 num_rows_per_stripe = 2; 69 | 70 | optional int64 num_stripes = 3; 71 | 72 | optional string column_name = 4; 73 | 74 | optional string column_type = 5; 75 | 76 | optional int64 column_cardinality = 6; 77 | 78 | optional int64 column_compressed_size_bytes = 10; 79 | 80 | optional int64 index_size_bytes = 7; 81 | 82 | optional int64 index_compressed_size_bytes = 9; 83 | 84 | optional BitmapStats bitmap_stats = 11; 85 | 86 | repeated TestCase test_cases = 8; 87 | } 88 | -------------------------------------------------------------------------------- /evaluation_utils.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: evaluation_utils.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_EVALUATION_UTILS_H_ 20 | #define CUCKOO_INDEX_EVALUATION_UTILS_H_ 21 | 22 | #include 23 | #include 24 | 25 | #include "absl/memory/memory.h" 26 | #include "absl/strings/string_view.h" 27 | #include "absl/types/span.h" 28 | #include "common/bitmap.h" 29 | #include "evaluation.pb.h" 30 | #include "roaring.hh" 31 | 32 | namespace ci { 33 | 34 | // Writes `evaluation_results` to a CSV file at the given `path`. 35 | void WriteToCsv(const std::string path, 36 | const std::vector& evaluation_results); 37 | 38 | // Compresses the given bytes `in`. 39 | std::string Compress(absl::string_view in); 40 | 41 | // Uncompresses the given bytes `in` (which were compressed with ZSTD). 42 | std::string Uncompress(absl::string_view in); 43 | 44 | // Serializes the given `bitmap` to a string. 45 | std::string SerializeBitmap(const Bitmap64& bitmap); 46 | 47 | // Returns a bitmap with the given `bits`. bits[i] can be 0 or 1 and determines 48 | // whether the i-th bit is set. For example, CreateBitmap({1,0}) would return a 49 | // two-bit bitmap with the first one being set. 50 | Bitmap64 CreateBitmap(absl::Span bits); 51 | 52 | // Returns the density d of the given `bitmap`. d is the share of 1 bits (e.g., 53 | // 0.1 means that 10% of the bits are set). 54 | double GetBitmapDensity(const Bitmap64& bitmap); 55 | 56 | // Returns the clustering factor f of the given `bitmap`. f is the average 57 | // length of all 1-fills (i.e., consecutive 1s). 58 | double GetBitmapClustering(const Bitmap64& bitmap); 59 | 60 | // Returns a Roaring bitmap for the given `bitmap`. 61 | Roaring ToRoaring(const Bitmap64& bitmap); 62 | 63 | // Returns the number of Bitmap64Ptrs that are not set to nullptr. 64 | size_t GetNumBitmaps(const std::vector& bitmaps); 65 | 66 | // Returns byte size assuming bitpacked `bitmaps`. 67 | size_t GetBitmapsByteSize(const std::vector& bitmaps, 68 | const size_t num_stripes); 69 | 70 | // Returns a bitmap that encompasses all individual `bitmaps` (back-to-back). 71 | Bitmap64 GetGlobalBitmap(const std::vector& bitmaps); 72 | 73 | // Returns stats for the given `bitmaps`. Individual entries may be empty (i.e., 74 | // set to nullptr) and will be skipped. 75 | ::ci::BitmapStats GetBitmapStats(const std::vector& bitmaps, 76 | const size_t num_stripes); 77 | 78 | // Prints stats for the given `bitmaps`. Individual entries may be empty (i.e., 79 | // set to nullptr) and will be skipped. 80 | void PrintBitmapStats(const std::vector& bitmaps, 81 | const size_t num_stripes); 82 | 83 | // Writes `bitmap` to a file at the given `path`. 84 | void WriteBitmapToFile(const std::string& path, const Bitmap64& bitmap); 85 | 86 | // Returns a bitmap read from a file at the given `path`. 87 | Bitmap64 ReadBitmapFromFile(const std::string& path); 88 | 89 | } // namespace ci 90 | 91 | #endif // CUCKOO_INDEX_EVALUATION_UTILS_H_ 92 | -------------------------------------------------------------------------------- /evaluation_utils_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: evaluation_utils_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "evaluation_utils.h" 20 | 21 | #include 22 | 23 | #include "gmock/gmock.h" 24 | #include "gtest/gtest.h" 25 | 26 | namespace ci { 27 | 28 | using ::testing::EndsWith; 29 | 30 | TEST(EvaluationUtilsTest, CompressAndUncompress) { 31 | const std::string orig = "Alain Delon and Jean Paul Belmondo"; 32 | const std::string compressed = Compress(orig); 33 | const std::string uncompressed = Uncompress(compressed); 34 | 35 | EXPECT_EQ(uncompressed, orig); 36 | } 37 | 38 | TEST(EvaluationUtilsTest, GetBitmapDensity) { 39 | const Bitmap64 none = CreateBitmap({0, 0, 0, 0}); 40 | EXPECT_EQ(GetBitmapDensity(none), 0.0); 41 | 42 | const Bitmap64 quarter = CreateBitmap({0, 0, 0, 1}); 43 | EXPECT_EQ(GetBitmapDensity(quarter), 0.25); 44 | 45 | const Bitmap64 half = CreateBitmap({0, 0, 1, 1}); 46 | EXPECT_EQ(GetBitmapDensity(half), 0.5); 47 | 48 | const Bitmap64 all = CreateBitmap({1, 1, 1, 1}); 49 | EXPECT_EQ(GetBitmapDensity(all), 1.0); 50 | } 51 | 52 | TEST(EvaluationUtilsTest, GetBitmapClustering) { 53 | const Bitmap64 none = CreateBitmap({0, 0, 0, 0}); 54 | EXPECT_EQ(GetBitmapClustering(none), 0.0); 55 | 56 | const Bitmap64 all = CreateBitmap({1, 1, 1, 1}); 57 | EXPECT_EQ(GetBitmapClustering(all), 4.0); 58 | 59 | const Bitmap64 one_fill_1 = CreateBitmap({0, 0, 0, 1}); 60 | EXPECT_EQ(GetBitmapClustering(one_fill_1), 1.0); 61 | 62 | const Bitmap64 one_fill_2 = CreateBitmap({0, 0, 1, 1}); 63 | EXPECT_EQ(GetBitmapClustering(one_fill_2), 2.0); 64 | 65 | const Bitmap64 two_fills_1 = CreateBitmap({1, 0, 1, 0}); 66 | EXPECT_EQ(GetBitmapClustering(two_fills_1), 1.0); 67 | 68 | const Bitmap64 two_fills_1_5 = CreateBitmap({1, 0, 1, 1}); 69 | EXPECT_EQ(GetBitmapClustering(two_fills_1_5), 1.5); 70 | } 71 | 72 | TEST(EvaluationUtilsTest, ToRoaring) { 73 | const Bitmap64 bitmap = CreateBitmap({0, 1, 0, 1}); 74 | const Roaring roaring = ToRoaring(bitmap); 75 | 76 | EXPECT_EQ(bitmap.GetOnesCount(), roaring.cardinality()); 77 | for (size_t i = 0; i < bitmap.bits(); ++i) { 78 | EXPECT_EQ(bitmap.Get(i), roaring.contains(i)); 79 | } 80 | } 81 | 82 | TEST(EvaluationUtilsTest, GetGlobalBitmap) { 83 | std::vector bitmaps; 84 | bitmaps.push_back(nullptr); // Bitmap64Ptrs are allowed to be nullptr. 85 | bitmaps.push_back(absl::make_unique(CreateBitmap({0, 1}))); 86 | bitmaps.push_back(absl::make_unique(CreateBitmap({1, 0}))); 87 | 88 | const size_t num_stripes = 2; 89 | const Bitmap64 global_bitmap = GetGlobalBitmap(bitmaps); 90 | ASSERT_EQ(global_bitmap.bits(), GetNumBitmaps(bitmaps) * num_stripes); 91 | size_t curr_bitmap_rank = 0; 92 | for (size_t i = 0; i < bitmaps.size(); ++i) { 93 | if (bitmaps[i] != nullptr) { 94 | for (size_t bit_idx = 0; bit_idx < bitmaps[i]->bits(); ++bit_idx) { 95 | const size_t global_idx = (curr_bitmap_rank * num_stripes) + bit_idx; 96 | EXPECT_EQ(global_bitmap.Get(global_idx), bitmaps[i]->Get(bit_idx)); 97 | } 98 | ++curr_bitmap_rank; 99 | } 100 | } 101 | } 102 | 103 | TEST(EvaluationUtilsTest, BitmapToAndFromFile) { 104 | const std::string path = "/tmp/bitmap"; 105 | const Bitmap64 bitmap = CreateBitmap({0, 1}); 106 | WriteBitmapToFile(path, bitmap); 107 | const Bitmap64 decoded = ReadBitmapFromFile(path); 108 | EXPECT_THAT(decoded.ToString(), EndsWith(bitmap.ToString())); 109 | } 110 | 111 | } // namespace ci 112 | -------------------------------------------------------------------------------- /evaluator.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: evaluator.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_EVALUATOR_H 20 | #define CUCKOO_INDEX_EVALUATOR_H 21 | 22 | #include 23 | 24 | #include "data.h" 25 | #include "evaluation.pb.h" 26 | #include "index_structure.h" 27 | 28 | namespace ci { 29 | 30 | class Evaluator { 31 | public: 32 | // Runs experiments for the given parameters and returns their results. 33 | // 34 | // Selected experiments `test_cases` (e.g. positive uniform look-ups) are run 35 | // for a pair of (column, num_rows_per_stripe). `index_structure_factories` 36 | // are used to create an index structure per experiment. 37 | std::vector RunExperiments( 38 | std::vector> 39 | index_structure_factories, 40 | const std::unique_ptr
& table, 41 | const std::vector& num_rows_per_stripe_to_test, 42 | size_t num_lookups, const std::vector& test_cases); 43 | 44 | private: 45 | // Performs positive lookups with values drawn from random row offsets. This 46 | // assumes that positive lookup values follow the same distribution than 47 | // stored values, i.e., frequent values are queried frequently. 48 | ci::EvaluationResults::TestCase DoPositiveUniformLookups( 49 | const Column& column, const IndexStructure& index_structure, 50 | std::size_t num_rows_per_stripe, std::size_t num_lookups); 51 | 52 | // Performs positive lookups with a subset of all distinct values (chosen 53 | // uniformly at random). This means, e.g., that lookup values that only occur 54 | // in a single stripe can cause up to N-1 false positive stripes. Values that 55 | // occur in all stripes, on the other hand, cannot cause any false positive 56 | // stripes. 57 | ci::EvaluationResults::TestCase DoPositiveDistinctLookups( 58 | const Column& column, const IndexStructure& index, 59 | std::size_t num_rows_per_stripe, std::size_t num_lookups); 60 | 61 | // Performs positive lookups with a subset of all distinct values (chosen 62 | // based on a Zipf distribution). 63 | ci::EvaluationResults::TestCase DoPositiveZipfLookups( 64 | const Column& column, const IndexStructure& index, 65 | std::size_t num_rows_per_stripe, std::size_t num_lookups); 66 | 67 | // Performs negative lookups with random values not present the `column`. 68 | // Note that in this test case, ZoneMaps will be 100% effective for 69 | // dict-encoded string columns as lookup keys will have values >= number of 70 | // distinct strings (i.e., they'll be outside of the domain of the 71 | // dict-encoded column). Alternatively, we could exclude one stripe and use 72 | // distinct values that only occur in that stripe as lookup values. However, 73 | // this wouldn't necessarily produce the same results as if we wouldn't 74 | // dict-encode string columns in the first place (particularly it depends on 75 | // which stripe we exclude). Also, this strategy wouldn't work for low 76 | // cardinality columns. Therefore, we've decided to not evaluate ZoneMaps on 77 | // string columns with negative lookup keys and use a rather simple lookup key 78 | // generation here that only ensures that a lookup key doesn't occur in any 79 | // stripe. 80 | ci::EvaluationResults::TestCase DoNegativeLookups( 81 | const Column& column, const IndexStructure& index, 82 | std::size_t num_rows_per_stripe, std::size_t num_lookups); 83 | 84 | // Performs lookups with a mix between positive (chosen from distinct 85 | // values like in DoPositiveDistinctLookups) and negative lookup keys. 86 | // `hit_rate` determines the share of positive lookups (e.g., 0.1 means that 87 | // 10% of the lookup keys are positive, i.e., are at least present in one 88 | // stripe). 89 | ci::EvaluationResults::TestCase DoMixedLookups( 90 | const Column& column, const IndexStructure& index, 91 | std::size_t num_rows_per_stripe, std::size_t num_lookups, 92 | double hit_rate); 93 | 94 | // Probes all stripes for the given `column` and `index`, and updates 95 | // `num_true_negative_stripes` (ground truth true negatives) and 96 | // `num_false_positive_stripes` (number of times the `index` did not prune 97 | // a stripe even though it could have). 98 | void ProbeAllStripes(const Column& column, const IndexStructure& index, 99 | int value, std::size_t num_rows_per_stripe, 100 | std::size_t num_stripes, 101 | std::size_t* num_true_negative_stripes, 102 | std::size_t* num_false_positive_stripes); 103 | }; 104 | 105 | } // namespace ci 106 | 107 | #endif // CUCKOO_INDEX_EVALUATOR_H 108 | -------------------------------------------------------------------------------- /fingerprint_store.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: fingerprint_store.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_FINGERPRINT_STORE_H_ 20 | #define CUCKOO_INDEX_FINGERPRINT_STORE_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | #include 26 | 27 | #include "absl/container/flat_hash_map.h" 28 | #include "cuckoo_utils.h" 29 | #include "evaluation_utils.h" 30 | 31 | namespace ci { 32 | 33 | // Stores fingerprints with a fixed number of bits (`num_bits`). The maximum bit 34 | // width of `fingerprints` has to be at most `num_bits` bits. 35 | class Block { 36 | public: 37 | explicit Block(const size_t num_bits, 38 | const std::vector& fingerprints); 39 | 40 | // Forbid copying and moving. 41 | Block(const Block&) = delete; 42 | Block& operator=(const Block&) = delete; 43 | Block(Block&&) = delete; 44 | Block& operator=(Block&&) = delete; 45 | 46 | size_t num_bits() const { return num_bits_; } 47 | 48 | // Returns the fingerprint bits stored at `idx`. 49 | uint64_t Get(const size_t idx) const { 50 | assert(idx < num_fingerprints_); 51 | return fingerprints_.Get(idx); 52 | } 53 | 54 | const std::string& GetData() const { return data_; } 55 | 56 | private: 57 | // The number of bits of fingerprints stored in this block. 58 | const size_t num_bits_; 59 | const size_t num_fingerprints_; 60 | 61 | std::string data_; 62 | BitPackedReader fingerprints_; 63 | }; 64 | 65 | // Stores variable-sized fingerprints in different blocks, each block storing 66 | // fingerprints of a fixed length. For each block, we maintain a bitmap 67 | // indicating which buckets are stored in this block. The individual blocks 68 | // allow for random access (i.e., they do not need to be decompressed to 69 | // reconstruct individual fingerprints). 70 | // 71 | // As an optimization, we compact consecutive block bitmaps such that a bitmap 72 | // only contains the zero-bits of its predecessor. In other words, a bitmap only 73 | // has a bit for each remaining bucket (i.e., all buckets that are NOT stored in 74 | // previous blocks). 75 | // 76 | // To increase the effect of this optimization, we order the individual blocks 77 | // based on decreasing cardinality (i.e., number of buckets stored in a block). 78 | // 79 | // Example encoding for the fingerprints {1, 101, 01, 0, 001} with one slot per 80 | // bucket: 81 | // 82 | // Block 0: 101001 -- bitpacked fingerprints 101 and 001 83 | // Block 1: 10 -- bitpacked fingerprints 1 and 0 84 | // Block 2: 01 -- bitpacked fingerprint 85 | // 86 | // Block bitmap 0: 01001 -- fingerprints no. 1 and 4 are stored in this block 87 | // Block bitmap 1: 101 -- of the 3 remaining fingerprints no. 0 and 2 are here 88 | // Block bitmap 2: 1 -- only one remaining fingerprint 89 | // 90 | // Specifically, we encode all block bitmaps back-to-back as a single RLE 91 | // bitmap. 92 | // 93 | // Individual blocks are stored as follows: 94 | // 95 | // uint32_t num_bits -- number of bits of fingerprints in this block 96 | // uint32_t bit_width -- actual bit width of fingerprints in this block 97 | // (could theoretically be lower) 98 | // .. bitpacked fingerprints .. 99 | // 8 'slop' bytes -- to be able to read bit-packed 64-bit values, we need 100 | // to ensure that a whole uword_t can be read from the position of the last 101 | // encoded diff, hence we need to be able to read at most 7 bytes past it 102 | class FingerprintStore { 103 | using BlockPtr = std::unique_ptr; 104 | 105 | // A helper struct used during block creation. 106 | struct BlockContent { 107 | Bitmap64Ptr block_bitmap; 108 | std::vector fingerprints; 109 | }; 110 | 111 | public: 112 | // Decodes FingerprintStore from bytes. 113 | static void Decode(const std::string& data); 114 | 115 | // The fingerprints passed here have a 1:1 correspondence to the slots in the 116 | // Cuckoo table. Individual fingerprints can be `inactive`, which means that 117 | // the corresponding slot is empty (i.e., doesn't contain a fingerprint). 118 | explicit FingerprintStore(const std::vector& fingerprints, 119 | const size_t slots_per_bucket, 120 | const bool use_rle_to_encode_block_bitmaps); 121 | 122 | // Returns fingerprint stored in slot `slot_idx`. 123 | Fingerprint GetFingerprint(const size_t slot_idx) const; 124 | 125 | // Encodes FingerprintStore as bytes. For `bitmaps_only` = true, only the 126 | // bitmaps will be encoded. This is only used for printing stats. 127 | std::string Encode(bool bitmaps_only = false) const; 128 | 129 | size_t num_slots() const { return num_slots_; } 130 | 131 | // Returns the bitmap indicating empty slots; 132 | const Bitmap64& EmptySlotsBitmap() const { return *empty_slots_bitmap_; } 133 | 134 | size_t GetSizeInBytes(bool bitmaps_only) const { 135 | return Encode(bitmaps_only).size(); 136 | } 137 | 138 | size_t GetZstdCompressedSizeInBytes(bool bitmaps_only) const { 139 | return Compress(Encode(bitmaps_only)).size(); 140 | } 141 | 142 | double GetBitsPerFingerprint(bool bitmaps_only) const { 143 | return static_cast(GetSizeInBytes(bitmaps_only) * CHAR_BIT) / 144 | num_stored_fingerprints_; 145 | } 146 | 147 | double GetBitsPerFingerprintZstdCompressed(bool bitmaps_only) const { 148 | return static_cast(GetZstdCompressedSizeInBytes(bitmaps_only) * 149 | CHAR_BIT) / 150 | num_stored_fingerprints_; 151 | } 152 | 153 | size_t GetNumBlocks() const { return blocks_.size(); } 154 | 155 | void PrintStats() const; 156 | 157 | private: 158 | // Returns the bucket index that the bit `bit_idx` in block bitmap `block_idx` 159 | // corresponds to. 160 | size_t GetBucketIndex(const size_t block_idx, const size_t bit_idx) const; 161 | 162 | // Returns the number of non-empty slots in bucket `bucket_idx`. 163 | size_t GetNumItemsInBucket(const size_t bucket_idx) const; 164 | 165 | // Returns the index of fingerprint `slot_idx` in block `block_idx` (the 166 | // offset to the fingerprint bits in the bitpacked storage). 167 | // `idx_in_compacted_bitmap` is the index of the fingerprint in the compacted 168 | // bitmap `block_idx`. 169 | size_t GetIndexOfFingerprintInBlock(const size_t block_idx, 170 | const size_t idx_in_compacted_bitmap, 171 | const size_t slot_idx) const; 172 | 173 | // Maps `bucket_idx` to its corresponding index (bit) in the block bitmap 174 | // `block_bitmap_idx`. 175 | size_t MapBucketIndexToBitInBlockBitmap(const size_t bucket_idx, 176 | const size_t block_bitmap_idx) const; 177 | 178 | // Creates and compacts block bitmaps in `lengths` order. The idea is to 179 | // "leave out" bits in subsequent block bitmaps, specifically those that are 180 | // set in the previous (already compacted) block bitmap. 181 | void CreateAndCompactBlockBitmaps( 182 | const std::vector& lengths, 183 | absl::flat_hash_map* blocks); 184 | 185 | // A bitmap indicating empty slots. 186 | Bitmap64Ptr empty_slots_bitmap_; 187 | 188 | // Bitmaps indicating which slot is stored in which block. A subsequent bitmap 189 | // has `prev.GetOnesCount()` fewer bits than its predecessor. 190 | // TODO: Replace with a single RLE bitmap. 191 | std::vector block_bitmaps_; 192 | std::vector blocks_; 193 | 194 | const size_t num_slots_; 195 | size_t num_stored_fingerprints_; 196 | 197 | size_t slots_per_bucket_; 198 | bool use_rle_to_encode_block_bitmaps_; 199 | }; 200 | 201 | } // namespace ci 202 | 203 | #endif // CUCKOO_INDEX_FINGERPRINT_STORE_H_ 204 | -------------------------------------------------------------------------------- /fingerprint_store_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: fingerprint_store_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "fingerprint_store.h" 20 | 21 | #include 22 | 23 | #include "absl/types/span.h" 24 | #include "cuckoo_utils.h" 25 | #include "gmock/gmock.h" 26 | #include "gtest/gtest.h" 27 | 28 | namespace ci { 29 | 30 | constexpr uint32_t kMurmurHashConstant = 0x5bd1e995; 31 | constexpr size_t kNumFingerprints = 1e3; 32 | 33 | // **** Helper methods **** 34 | 35 | // Creates `n` random fingerprints with different `lengths` such that all 36 | // `slots_per_bucket` fingerprints in a bucket share the same length. 37 | std::vector CreateRandomFingerprints(const size_t n, 38 | const size_t slots_per_bucket, 39 | std::vector lengths) { 40 | assert(lengths.size() > 0); 41 | std::sort(lengths.begin(), lengths.end()); 42 | 43 | // Insert `length[i]` lengths.size()-i times such that shorter lengths are 44 | // more likely to be chosen. 45 | std::vector lengths_to_draw_from; 46 | for (size_t i = 0; i < lengths.size(); ++i) { 47 | for (size_t j = 0; j < lengths.size() - i; ++j) 48 | lengths_to_draw_from.push_back(lengths[i]); 49 | } 50 | 51 | std::vector fingerprints; 52 | fingerprints.reserve(n); 53 | for (size_t i = 0; i < n; i += slots_per_bucket) { 54 | const size_t hash_bucket = i * kMurmurHashConstant; 55 | const size_t num_bits = 56 | lengths_to_draw_from[hash_bucket % lengths_to_draw_from.size()]; 57 | 58 | // Fill all slots. 59 | for (size_t j = 0; j < slots_per_bucket; ++j) { 60 | const size_t hash_slot = (i + j) * kMurmurHashConstant; 61 | Fingerprint fp{/*active=*/(i + j) % 10 == 0 ? false : true, num_bits, 62 | /*fingerprint=*/hash_slot % (1 << num_bits)}; 63 | fingerprints.push_back(fp); 64 | } 65 | } 66 | return fingerprints; 67 | } 68 | 69 | // **** Test cases **** 70 | 71 | // Creates fingerprints with different `lengths`, stores them in a 72 | // FingerprintStore, and calls GetFingerprint(..) on each of them. 73 | void CreateStoreAndGetFingerprints(const std::vector& lengths, 74 | const size_t slots_per_bucket, 75 | const bool use_rle_to_encode_block_bitmaps) { 76 | const std::vector fingerprints = 77 | CreateRandomFingerprints(kNumFingerprints, slots_per_bucket, lengths); 78 | const FingerprintStore store(fingerprints, slots_per_bucket, 79 | use_rle_to_encode_block_bitmaps); 80 | for (size_t i = 0; i < fingerprints.size(); ++i) { 81 | const Fingerprint fp = store.GetFingerprint(i); 82 | ASSERT_EQ(fp.active, fingerprints[i].active); 83 | if (fp.active) { 84 | ASSERT_EQ(fp.num_bits, fingerprints[i].num_bits); 85 | ASSERT_EQ(fp.fingerprint, fingerprints[i].fingerprint); 86 | } 87 | } 88 | } 89 | 90 | TEST(FingerprintStore, GetFingerprintReturnsCorrectFingerprintSingleBlock) { 91 | CreateStoreAndGetFingerprints(/*lengths=*/{8}, /*slots_per_bucket=*/1, 92 | /*use_rle_to_encode_block_bitmaps=*/false); 93 | } 94 | 95 | TEST(FingerprintStore, GetFingerprintReturnsCorrectFingerprintSingleBlockRLE) { 96 | CreateStoreAndGetFingerprints(/*lengths=*/{8}, /*slots_per_bucket=*/1, 97 | /*use_rle_to_encode_block_bitmaps=*/true); 98 | } 99 | 100 | TEST(FingerprintStore, GetFingerprintReturnsCorrectFingerprintFiveBlocks) { 101 | CreateStoreAndGetFingerprints(/*lengths=*/{1, 2, 4, 8, 16}, 102 | /*slots_per_bucket=*/1, 103 | /*use_rle_to_encode_block_bitmaps=*/false); 104 | } 105 | 106 | TEST(FingerprintStore, GetFingerprintReturnsCorrectFingerprintFiveBlocksRLE) { 107 | CreateStoreAndGetFingerprints(/*lengths=*/{1, 2, 4, 8, 16}, 108 | /*slots_per_bucket=*/1, 109 | /*use_rle_to_encode_block_bitmaps=*/true); 110 | } 111 | 112 | TEST(FingerprintStore, GetFingerprintReturnsCorrectFingerprintZeroBits) { 113 | CreateStoreAndGetFingerprints(/*lengths=*/{0}, 114 | /*slots_per_bucket=*/1, 115 | /*use_rle_to_encode_block_bitmaps=*/false); 116 | } 117 | 118 | TEST(FingerprintStore, GetFingerprintReturnsCorrectFingerprintZeroAndOneBits) { 119 | CreateStoreAndGetFingerprints(/*lengths=*/{0, 1}, 120 | /*slots_per_bucket=*/1, 121 | /*use_rle_to_encode_block_bitmaps=*/false); 122 | } 123 | 124 | TEST(FingerprintStore, 125 | GetFingerprintReturnsCorrectFingerprintTwoSlotsPerBucket) { 126 | CreateStoreAndGetFingerprints(/*lengths=*/{1, 2, 4, 8, 16}, 127 | /*slots_per_bucket=*/2, 128 | /*use_rle_to_encode_block_bitmaps=*/false); 129 | } 130 | 131 | } // namespace ci 132 | -------------------------------------------------------------------------------- /index_structure.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: index_structure.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_INDEX_STRUCTURE_H_ 20 | #define CUCKOO_INDEX_INDEX_STRUCTURE_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | 26 | #include "data.h" 27 | #include "evaluation.pb.h" 28 | 29 | namespace ci { 30 | 31 | // Class representing an index structure, e.g. based on a Bloom filter. 32 | class IndexStructure { 33 | public: 34 | IndexStructure() {} 35 | virtual ~IndexStructure() {} 36 | 37 | // Returns true if the stripe with `stripe_id` contains the given `value`. 38 | virtual bool StripeContains(size_t stripe_id, int value) const = 0; 39 | 40 | // Returns a bitmap indicating possibly qualifying stripes for the given 41 | // `value`. Probes up to `num_stripes` stripes. 42 | // Note: classes extending IndexStructure can override this method when they 43 | // can provide an optimized approach here (see CuckooIndex for an example). 44 | virtual Bitmap64 GetQualifyingStripes(int value, size_t num_stripes) const { 45 | // Default implementation for per-stripe index structures. 46 | Bitmap64 result(/*size=*/num_stripes); 47 | for (size_t stripe_id = 0; stripe_id < static_cast(num_stripes); 48 | ++stripe_id) { 49 | if (StripeContains(stripe_id, value)) 50 | result.Set(stripe_id, true); 51 | } 52 | return result; 53 | } 54 | 55 | // Returns the name of the index structure. 56 | virtual std::string name() const = 0; 57 | 58 | // Returns the in-memory size of the index structure. 59 | virtual size_t byte_size() const = 0; 60 | 61 | // Returns the in-memory size of the compressed index structure. 62 | virtual size_t compressed_byte_size() const = 0; 63 | 64 | // Returns statistics about internal data structures using bitmaps. Should 65 | // be implemented only by CLT-based index structures. 66 | virtual ci::BitmapStats bitmap_stats() { return ci::BitmapStats(); } 67 | }; 68 | 69 | using IndexStructurePtr = std::unique_ptr; 70 | 71 | class IndexStructureFactory { 72 | public: 73 | IndexStructureFactory() {} 74 | virtual ~IndexStructureFactory() {} 75 | 76 | // Creates an index structure for the given `column`. 77 | virtual IndexStructurePtr Create(const Column& column, 78 | size_t num_rows_per_stripe) const = 0; 79 | 80 | // Returns the name of the index that can be created using the factory. 81 | virtual std::string index_name() const = 0; 82 | }; 83 | 84 | } // namespace ci 85 | 86 | #endif // CUCKOO_INDEX_INDEX_STRUCTURE_H_ 87 | -------------------------------------------------------------------------------- /leveldb.BUILD: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | # ----------------------------------------------------------------------------- 16 | # File: leveldb.BUILD 17 | # ----------------------------------------------------------------------------- 18 | 19 | load("@rules_cc//cc:defs.bzl", "cc_library") 20 | 21 | package(default_visibility = ["//visibility:public"]) 22 | 23 | # Original headers at include/leveldb/... 24 | original_headers = glob(["include/leveldb/*.h"]) 25 | 26 | # Headers exported to include/... and to leveldb/... 27 | headers_in_include = ["include/" + x.split("/")[-1] for x in original_headers] 28 | headers_in_leveldb = ["leveldb/" + x.split("/")[-1] for x in original_headers] 29 | 30 | genrule( 31 | name = "relocate_headers_to_include", 32 | srcs = original_headers, 33 | outs = headers_in_include, 34 | cmd = "cp $(SRCS) -t $(@D)/include", 35 | ) 36 | 37 | genrule( 38 | name = "relocate_headers_to_leveldb", 39 | srcs = original_headers, 40 | outs = headers_in_leveldb, 41 | cmd = "cp $(SRCS) -t $(@D)/leveldb", 42 | ) 43 | 44 | filegroup( 45 | name = "util_sources_group", 46 | srcs = [ 47 | "util/arena.cc", 48 | "util/arena.h", 49 | "util/bloom.cc", 50 | "util/coding.cc", 51 | "util/coding.h", 52 | "util/crc32c.cc", 53 | "util/crc32c.h", 54 | "util/env.cc", 55 | "util/filter_policy.cc", 56 | "util/hash.cc", 57 | "util/hash.h", 58 | "util/logging.cc", 59 | "util/logging.h", 60 | "util/status.cc", 61 | "util/env_posix_test_helper.h", 62 | "util/posix_logger.h", 63 | ], 64 | ) 65 | 66 | filegroup( 67 | name = "util_headers_group", 68 | srcs = [ 69 | "port/port.h", 70 | "port/port_stdcxx.h", 71 | "port/thread_annotations.h", 72 | "include/export.h", 73 | "include/filter_policy.h", 74 | "include/slice.h", 75 | "leveldb/env.h", 76 | "leveldb/export.h", 77 | "leveldb/filter_policy.h", 78 | "leveldb/slice.h", 79 | "leveldb/status.h", 80 | ], 81 | ) 82 | 83 | cc_library( 84 | name = "util", 85 | # TODO: Change to env_windows.cc for windows. 86 | srcs = [":util_sources_group", "util/env_posix.cc"], 87 | hdrs = [":util_headers_group"], 88 | defines = ["LEVELDB_PLATFORM_POSIX", "LEVELDB_IS_BIG_ENDIAN=false"], 89 | ) 90 | -------------------------------------------------------------------------------- /per_stripe_bloom.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: per_stripe_bloom.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CI_PER_STRIPE_BLOOM_H_ 20 | #define CI_PER_STRIPE_BLOOM_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | 28 | #include "absl/strings/str_cat.h" 29 | #include "data.h" 30 | #include "evaluation_utils.h" 31 | #include "index_structure.h" 32 | #include "leveldb/filter_policy.h" 33 | #include "leveldb/slice.h" 34 | 35 | namespace ci { 36 | 37 | class PerStripeBloom : public IndexStructure { 38 | public: 39 | PerStripeBloom(const std::vector& data, std::size_t num_rows_per_stripe, 40 | std::size_t num_bits_per_key) 41 | : num_bits_per_key_(num_bits_per_key), 42 | policy_(leveldb::NewBloomFilterPolicy(num_bits_per_key_)) { 43 | num_stripes_ = data.size() / num_rows_per_stripe; 44 | // Pre-allocate `num_stripes_` strings to store filters. 45 | filters_.resize(num_stripes_); 46 | for (size_t stripe_id = 0; stripe_id < num_stripes_; ++stripe_id) { 47 | const std::size_t stripe_begin = num_rows_per_stripe * stripe_id; 48 | const std::size_t stripe_end = stripe_begin + num_rows_per_stripe; 49 | const std::size_t num_values = stripe_end - stripe_begin; 50 | // Collect values of current stripe. (LevelDB's Bloom filter only accepts 51 | // leveldb::Slice.) 52 | absl::flat_hash_set string_values; 53 | string_values.reserve(num_values); 54 | for (size_t i = stripe_begin; i < stripe_end; ++i) { 55 | string_values.insert(std::to_string(data[i])); 56 | } 57 | 58 | // Create the filter for the current stripe. 59 | const std::vector slices(string_values.begin(), 60 | string_values.end()); 61 | policy_->CreateFilter(slices.data(), static_cast(slices.size()), 62 | &filters_[stripe_id]); 63 | } 64 | } 65 | 66 | bool StripeContains(std::size_t stripe_id, int value) const override { 67 | if (stripe_id >= num_stripes_) { 68 | std::cerr << "`stripe_id` is out of bounds." << std::endl; 69 | exit(EXIT_FAILURE); 70 | } 71 | // Probe corresponding filter. 72 | return policy_->KeyMayMatch(leveldb::Slice(std::to_string(value)), 73 | filters_[stripe_id]); 74 | } 75 | 76 | std::string name() const override { 77 | return std::string("PerStripeBloom/") + std::to_string(num_bits_per_key_); 78 | } 79 | 80 | size_t byte_size() const override { 81 | std::size_t result = 0; 82 | for (const std::string filter : filters_) { 83 | result += filter.size(); 84 | } 85 | return result; 86 | } 87 | 88 | size_t compressed_byte_size() const override { 89 | std::string data; 90 | for (const std::string& filter : filters_) { 91 | absl::StrAppend(&data, filter); 92 | } 93 | 94 | return Compress(data).size(); 95 | } 96 | 97 | std::size_t num_stripes() { return num_stripes_; } 98 | 99 | private: 100 | std::size_t num_stripes_; 101 | std::size_t num_bits_per_key_; 102 | std::unique_ptr policy_; 103 | std::vector filters_; 104 | }; 105 | 106 | class PerStripeBloomFactory : public IndexStructureFactory { 107 | public: 108 | explicit PerStripeBloomFactory(size_t num_bits_per_key) 109 | : num_bits_per_key_(num_bits_per_key) {} 110 | std::unique_ptr Create( 111 | const Column& column, size_t num_rows_per_stripe) const override { 112 | return absl::make_unique(column.data(), num_rows_per_stripe, 113 | num_bits_per_key_); 114 | } 115 | 116 | std::string index_name() const override { 117 | return std::string("PerStripeBloom/") + std::to_string(num_bits_per_key_); 118 | } 119 | 120 | const size_t num_bits_per_key_; 121 | }; 122 | 123 | // A version of `PerStripeBloom` factory that allows to pass another filter's 124 | // factory in order to build a Bloom filter of a comparable compressed size. 125 | // This is useful, e.g. when trying to compare scan rates of filters with 126 | // roughly the same size. 127 | class PerStripeBloomComparableSizeFactory : public IndexStructureFactory { 128 | public: 129 | explicit PerStripeBloomComparableSizeFactory( 130 | std::unique_ptr other_index_factory) 131 | : other_index_factory_(std::move(other_index_factory)) {} 132 | std::unique_ptr Create( 133 | const Column& column, size_t num_rows_per_stripe) const override { 134 | constexpr size_t kMaxBitsPerKey = 30; 135 | 136 | IndexStructurePtr other_index = 137 | other_index_factory_->Create(column, num_rows_per_stripe); 138 | const size_t target_size = other_index->compressed_byte_size(); 139 | 140 | // Find the number of bits per key that minimizes the difference in size 141 | // between the filters. 142 | size_t argmin_num_bits_per_key; 143 | size_t min_size_diff = std::numeric_limits::max(); 144 | size_t min_bits = 1; 145 | size_t max_bits = kMaxBitsPerKey; 146 | while (min_bits <= max_bits) { 147 | const size_t num_bits_per_key = (min_bits + max_bits) / 2; 148 | 149 | auto bloom_index = absl::make_unique( 150 | column.data(), num_rows_per_stripe, num_bits_per_key); 151 | size_t abs_size_diff = target_size < bloom_index->compressed_byte_size() 152 | ? bloom_index->compressed_byte_size() - target_size 153 | : target_size - bloom_index->compressed_byte_size(); 154 | 155 | if (abs_size_diff < min_size_diff) { 156 | min_size_diff = abs_size_diff; 157 | argmin_num_bits_per_key = num_bits_per_key; 158 | } 159 | 160 | if (target_size < bloom_index->compressed_byte_size()) { 161 | max_bits = num_bits_per_key - 1; 162 | } else { 163 | min_bits = num_bits_per_key + 1; 164 | } 165 | } 166 | 167 | return absl::make_unique(column.data(), num_rows_per_stripe, 168 | argmin_num_bits_per_key); 169 | } 170 | 171 | std::string index_name() const override { 172 | return std::string("PerStripeBloomComparableSize/" + 173 | other_index_factory_->index_name()); 174 | } 175 | 176 | std::unique_ptr other_index_factory_; 177 | }; 178 | 179 | } // namespace ci 180 | 181 | #endif // CI_PER_STRIPE_BLOOM_H_ 182 | -------------------------------------------------------------------------------- /per_stripe_bloom_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: per_stripe_bloom_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "per_stripe_bloom.h" 20 | 21 | #include "gtest/gtest.h" 22 | 23 | namespace ci { 24 | 25 | TEST(PerStripeBloomTest, StripeContains) { 26 | PerStripeBloom per_stripe_bloom( 27 | /*data=*/{1, 2, 3, 4}, /*num_rows_per_stripe=*/2, 28 | /*num_bits_per_key=*/10); 29 | 30 | EXPECT_TRUE(per_stripe_bloom.StripeContains(/*stripe_id=*/0, /*value=*/1)); 31 | EXPECT_TRUE(per_stripe_bloom.StripeContains(/*stripe_id=*/0, /*value=*/2)); 32 | EXPECT_TRUE(per_stripe_bloom.StripeContains(/*stripe_id=*/1, /*value=*/3)); 33 | EXPECT_TRUE(per_stripe_bloom.StripeContains(/*stripe_id=*/1, /*value=*/4)); 34 | } 35 | 36 | } // namespace ci 37 | -------------------------------------------------------------------------------- /per_stripe_xor.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: per_stripe_xor.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CI_PER_STRIPE_XOR_H_ 20 | #define CI_PER_STRIPE_XOR_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | #include 26 | 27 | #include "absl/strings/str_cat.h" 28 | #include "data.h" 29 | #include "evaluation_utils.h" 30 | #include "index_structure.h" 31 | #include "xor_filter.h" 32 | 33 | namespace ci { 34 | 35 | // Creates one Xor8 filter per stripe. 36 | class PerStripeXor : public IndexStructure { 37 | public: 38 | PerStripeXor(const std::vector& data, std::size_t num_rows_per_stripe) { 39 | num_stripes_ = data.size() / num_rows_per_stripe; 40 | filters_.reserve(num_stripes_); 41 | for (size_t stripe_id = 0; stripe_id < num_stripes_; ++stripe_id) { 42 | const std::size_t stripe_begin = num_rows_per_stripe * stripe_id; 43 | const std::size_t stripe_end = stripe_begin + num_rows_per_stripe; 44 | const std::size_t num_values = stripe_end - stripe_begin; 45 | // Collect values of current stripe. 46 | absl::flat_hash_set values; 47 | values.reserve(num_values); 48 | for (size_t i = stripe_begin; i < stripe_end; ++i) values.insert(data[i]); 49 | 50 | // Create the filter for the current stripe. 51 | const std::vector keys(values.begin(), values.end()); 52 | filters_.push_back(absl::make_unique(keys)); 53 | } 54 | } 55 | 56 | bool StripeContains(std::size_t stripe_id, int value) const override { 57 | if (stripe_id >= num_stripes_) { 58 | std::cerr << "`stripe_id` is out of bounds." << std::endl; 59 | exit(EXIT_FAILURE); 60 | } 61 | // Probe corresponding filter. 62 | return filters_[stripe_id]->Contains(value); 63 | } 64 | 65 | std::string name() const override { return std::string("PerStripeXor"); } 66 | 67 | size_t byte_size() const override { 68 | size_t result = 0; 69 | for (size_t i = 0; i < filters_.size(); ++i) 70 | result += filters_[i]->SizeInBytes(); 71 | return result; 72 | } 73 | 74 | size_t compressed_byte_size() const override { 75 | std::string data; 76 | for (size_t i = 0; i < filters_.size(); ++i) 77 | absl::StrAppend(&data, filters_[i]->Data()); 78 | return Compress(data).size(); 79 | } 80 | 81 | std::size_t num_stripes() { return num_stripes_; } 82 | 83 | private: 84 | std::size_t num_stripes_; 85 | std::vector> filters_; 86 | }; 87 | 88 | class PerStripeXorFactory : public IndexStructureFactory { 89 | public: 90 | explicit PerStripeXorFactory() {} 91 | std::unique_ptr Create( 92 | const Column& column, size_t num_rows_per_stripe) const override { 93 | return absl::make_unique(column.data(), num_rows_per_stripe); 94 | } 95 | 96 | std::string index_name() const override { 97 | return std::string("PerStripeXor"); 98 | } 99 | }; 100 | 101 | } // namespace ci 102 | 103 | #endif // CI_PER_STRIPE_XOR_H_ 104 | -------------------------------------------------------------------------------- /per_stripe_xor_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: per_stripe_xor_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "per_stripe_xor.h" 20 | 21 | #include "gtest/gtest.h" 22 | 23 | namespace ci { 24 | 25 | TEST(PerStripeXorTest, StripeContains) { 26 | PerStripeXor per_stripe_xor(/*data=*/{1, 2, 3, 4}, /*num_rows_per_stripe=*/2); 27 | 28 | EXPECT_TRUE(per_stripe_xor.StripeContains(/*stripe_id=*/0, /*value=*/1)); 29 | EXPECT_TRUE(per_stripe_xor.StripeContains(/*stripe_id=*/0, /*value=*/2)); 30 | EXPECT_TRUE(per_stripe_xor.StripeContains(/*stripe_id=*/1, /*value=*/3)); 31 | EXPECT_TRUE(per_stripe_xor.StripeContains(/*stripe_id=*/1, /*value=*/4)); 32 | } 33 | 34 | } // namespace ci 35 | -------------------------------------------------------------------------------- /xor_filter.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: xor_filter.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CI_XOR_FILTER_H_ 20 | #define CI_XOR_FILTER_H_ 21 | 22 | #include 23 | #include 24 | 25 | #include "absl/strings/str_cat.h" 26 | #include "include/xorfilter.h" 27 | 28 | namespace ci { 29 | 30 | // A wrapper for the Xor filter with 8-bit fingerprints. This filter uses close 31 | // to 10 bits per element and achieves a false positive probability of ~0.3%. 32 | // 33 | // Adapted from https://github.com/FastFilter/xor_singleheader. 34 | class Xor8 { 35 | public: 36 | // `keys` has to be duplicate free. 37 | explicit Xor8(const std::vector& keys) { 38 | if (!xor8_allocate(keys.size(), &filter_)) { 39 | std::cerr << "Couldn't allocate Xor filter." << std::endl; 40 | exit(EXIT_FAILURE); 41 | } 42 | if (!xor8_buffered_populate(reinterpret_cast(keys.data()), 43 | keys.size(), &filter_)) { 44 | std::cerr << "Couldn't populate Xor filter." << std::endl; 45 | exit(EXIT_FAILURE); 46 | } 47 | } 48 | 49 | // Forbid copying and moving. 50 | Xor8(const Xor8&) = delete; 51 | Xor8& operator=(const Xor8&) = delete; 52 | Xor8(Xor8&&) = delete; 53 | Xor8& operator=(Xor8&&) = delete; 54 | 55 | ~Xor8() { xor8_free(&filter_); } 56 | 57 | inline bool Contains(const uint64_t key) const { 58 | return xor8_contain(key, &filter_); 59 | } 60 | 61 | std::string Data() const { 62 | std::string data; 63 | absl::StrAppend(&data, filter_.seed); 64 | absl::StrAppend(&data, filter_.blockLength); 65 | const size_t num_fingerprints = 3 * filter_.blockLength; 66 | for (size_t i = 0; i < num_fingerprints; ++i) 67 | absl::StrAppend(&data, filter_.fingerprints[i]); 68 | return data; 69 | } 70 | 71 | size_t SizeInBytes() const { return xor8_size_in_bytes(&filter_); } 72 | 73 | private: 74 | xor8_s filter_; 75 | }; 76 | 77 | } // namespace ci 78 | 79 | #endif // CI_XOR_FILTER_H_ 80 | -------------------------------------------------------------------------------- /xor_filter_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: xor_filter_test.cc 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "xor_filter.h" 20 | 21 | #include 22 | #include 23 | 24 | #include "gmock/gmock.h" 25 | #include "gtest/gtest.h" 26 | 27 | namespace ci { 28 | 29 | static constexpr size_t kNumKeys = 1000; 30 | 31 | TEST(Xor8Test, Contains) { 32 | Xor8 xor8(/*keys=*/{1, 2, 3, 4}); 33 | 34 | EXPECT_TRUE(xor8.Contains(/*key=*/1)); 35 | EXPECT_TRUE(xor8.Contains(/*key=*/2)); 36 | EXPECT_TRUE(xor8.Contains(/*key=*/3)); 37 | EXPECT_TRUE(xor8.Contains(/*key=*/4)); 38 | } 39 | 40 | TEST(Xor8Test, CheckForLowFalsePositiveProbability) { 41 | std::vector keys; 42 | keys.reserve(kNumKeys); 43 | for (uint64_t i = 0; i < kNumKeys; ++i) keys.push_back(i); 44 | Xor8 xor8(keys); 45 | for (const uint64_t key : keys) EXPECT_TRUE(xor8.Contains(key)); 46 | 47 | size_t num_false_positives = 0; 48 | for (uint64_t i = kNumKeys; i < 2 * kNumKeys; ++i) 49 | num_false_positives += xor8.Contains(i); 50 | const double false_positive_probability = 51 | static_cast(num_false_positives) / kNumKeys; 52 | EXPECT_LT(false_positive_probability, 0.01); 53 | } 54 | 55 | } // namespace ci 56 | -------------------------------------------------------------------------------- /xor_singleheader.BUILD: -------------------------------------------------------------------------------- 1 | # Copyright 2020 Google LLC 2 | # 3 | # Licensed under the Apache License, Version 2.0 (the "License"); 4 | # you may not use this file except in compliance with the License. 5 | # You may obtain a copy of the License at 6 | # 7 | # https://www.apache.org/licenses/LICENSE-2.0 8 | # 9 | # Unless required by applicable law or agreed to in writing, software 10 | # distributed under the License is distributed on an "AS IS" BASIS, 11 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | # See the License for the specific language governing permissions and 13 | # limitations under the License. 14 | # 15 | # ----------------------------------------------------------------------------- 16 | # File: xor_singleheader.BUILD 17 | # ----------------------------------------------------------------------------- 18 | 19 | load("@rules_cc//cc:defs.bzl", "cc_library") 20 | 21 | package(default_visibility = ["//visibility:public"]) 22 | 23 | cc_library( 24 | name = "xorfilter", 25 | hdrs = ["include/xorfilter.h"], 26 | ) 27 | -------------------------------------------------------------------------------- /zone_map.h: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: zone_map.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #ifndef CUCKOO_INDEX_ZONE_MAP_H_ 20 | #define CUCKOO_INDEX_ZONE_MAP_H_ 21 | 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | 28 | #include "absl/memory/memory.h" 29 | #include "absl/strings/string_view.h" 30 | #include "data.h" 31 | #include "evaluation_utils.h" 32 | #include "index_structure.h" 33 | 34 | namespace ci { 35 | 36 | class ZoneMap : public IndexStructure { 37 | public: 38 | ZoneMap(const std::vector& data, std::size_t num_rows_per_stripe) { 39 | if (data.size() % num_rows_per_stripe != 0) { 40 | std::cout << "WARNING: Number of values is not a multiple of " 41 | "`num_rows_per_stripe`. Ignoring last stripe." 42 | << std::endl; 43 | } 44 | num_stripes_ = data.size() / num_rows_per_stripe; 45 | minimums_.reserve(num_stripes_); 46 | maximums_.reserve(num_stripes_); 47 | for (size_t stripe_id = 0; stripe_id < num_stripes_; ++stripe_id) { 48 | const std::size_t stripe_begin = num_rows_per_stripe * stripe_id; 49 | const std::size_t stripe_end = stripe_begin + num_rows_per_stripe; 50 | 51 | int per_stripe_min = std::numeric_limits::max(); 52 | int per_stripe_max = std::numeric_limits::min(); 53 | 54 | for (size_t row_id = stripe_begin; row_id < stripe_end; ++row_id) { 55 | if (data[row_id] == Column::kIntNullSentinel) continue; 56 | 57 | per_stripe_min = std::min(per_stripe_min, data[row_id]); 58 | per_stripe_max = std::max(per_stripe_max, data[row_id]); 59 | } 60 | 61 | minimums_.push_back(per_stripe_min); 62 | maximums_.push_back(per_stripe_max); 63 | } 64 | } 65 | 66 | ZoneMap(const Column& column, std::size_t num_rows_per_stripe) 67 | : ZoneMap(column.data(), num_rows_per_stripe) {} 68 | 69 | void PrintZones() { 70 | for (size_t i = 0; i < num_stripes_; ++i) { 71 | std::cout << "Stripe " << i << ": " << minimums_[i] << ", " 72 | << maximums_[i] << std::endl; 73 | } 74 | } 75 | 76 | bool StripeContains(std::size_t stripe_id, int value) const override { 77 | if (stripe_id >= num_stripes_) { 78 | std::cerr << "`stripe_id` is out of bounds." << std::endl; 79 | exit(EXIT_FAILURE); 80 | } 81 | return value >= minimums_[stripe_id] && value <= maximums_[stripe_id]; 82 | } 83 | 84 | std::string name() const override { return "ZoneMap"; } 85 | 86 | size_t byte_size() const override { 87 | return sizeof(int) * minimums_.size() + sizeof(int) * maximums_.size(); 88 | } 89 | 90 | size_t compressed_byte_size() const override { 91 | std::string data; 92 | absl::StrAppend(&data, absl::string_view( 93 | reinterpret_cast(minimums_.data()), 94 | sizeof(minimums_[0]) * minimums_.size())); 95 | absl::StrAppend(&data, absl::string_view( 96 | reinterpret_cast(maximums_.data()), 97 | sizeof(maximums_[0]) * maximums_.size())); 98 | 99 | return Compress(data).size(); 100 | } 101 | 102 | std::size_t num_stripes() { return num_stripes_; } 103 | 104 | private: 105 | std::size_t num_stripes_; 106 | std::vector minimums_, maximums_; 107 | }; 108 | 109 | class ZoneMapFactory : public IndexStructureFactory { 110 | public: 111 | std::unique_ptr Create( 112 | const Column& column, size_t num_rows_per_stripe) const override { 113 | return absl::make_unique(column.data(), num_rows_per_stripe); 114 | } 115 | 116 | std::string index_name() const { return "ZoneMap"; } 117 | }; 118 | 119 | } // namespace ci 120 | 121 | #endif // CUCKOO_INDEX_ZONE_MAP_H_ 122 | -------------------------------------------------------------------------------- /zone_map_test.cc: -------------------------------------------------------------------------------- 1 | // Copyright 2020 Google LLC 2 | // 3 | // Licensed under the Apache License, Version 2.0 (the "License"); 4 | // you may not use this file except in compliance with the License. 5 | // You may obtain a copy of the License at 6 | // 7 | // https://www.apache.org/licenses/LICENSE-2.0 8 | // 9 | // Unless required by applicable law or agreed to in writing, software 10 | // distributed under the License is distributed on an "AS IS" BASIS, 11 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12 | // See the License for the specific language governing permissions and 13 | // limitations under the License. 14 | // 15 | // ----------------------------------------------------------------------------- 16 | // File: zone_map_test.h 17 | // ----------------------------------------------------------------------------- 18 | 19 | #include "zone_map.h" 20 | 21 | #include "gtest/gtest.h" 22 | 23 | namespace ci { 24 | 25 | TEST(ZoneMapTest, StripeContainsSequentialValues) { 26 | ZoneMap zone_map(/*data=*/{1, 2, 3, 4}, /*num_rows_per_stripe=*/2); 27 | 28 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/1)); 29 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/2)); 30 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/3)); 31 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/4)); 32 | 33 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/0)); 34 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/3)); 35 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/2)); 36 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/5)); 37 | } 38 | 39 | TEST(ZoneMapTest, StripeContainsShuffledValues) { 40 | ZoneMap zone_map(/*data=*/{2, 1, 4, 3}, /*num_rows_per_stripe=*/2); 41 | 42 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/1)); 43 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/2)); 44 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/3)); 45 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/4)); 46 | 47 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/0)); 48 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/3)); 49 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/2)); 50 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/5)); 51 | } 52 | 53 | TEST(ZoneMapTest, StripeContainsDuplicateValues) { 54 | ZoneMap zone_map(/*data=*/{1, 1, 2, 2}, /*num_rows_per_stripe=*/2); 55 | 56 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/1)); 57 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/2)); 58 | 59 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/0)); 60 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/2)); 61 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/1)); 62 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/3)); 63 | } 64 | 65 | TEST(ZoneMapTest, NullValuesAreIgnored) { 66 | ZoneMap zone_map( 67 | /*data=*/{1, Column::kIntNullSentinel, 3, 4, Column::kIntNullSentinel, 6}, 68 | /*num_rows_per_stripe=*/3); 69 | 70 | // Check the first stripe. 71 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/0)); 72 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/2)); 73 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/0, /*value=*/4)); 74 | 75 | // Check the second stripe. 76 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/2)); 77 | EXPECT_TRUE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/5)); 78 | EXPECT_FALSE(zone_map.StripeContains(/*stripe_id=*/1, /*value=*/7)); 79 | } 80 | 81 | } // namespace ci 82 | --------------------------------------------------------------------------------