├── .bazelrc ├── .clang-format ├── .gitignore ├── README.md ├── WORKSPACE ├── examples ├── BUILD ├── cache-lines.cc ├── circular-buffer-test.cc ├── circular-buffer.h ├── hello-world.cc ├── lock-striping.cc ├── pointer-tagging.cc └── power-of-two.cc └── graphs ├── sysprog-false-sharing.png └── sysprog-lock-striping.png /.bazelrc: -------------------------------------------------------------------------------- 1 | build --cxxopt='-std=c++17' -------------------------------------------------------------------------------- /.clang-format: -------------------------------------------------------------------------------- 1 | --- 2 | Language: Cpp 3 | BasedOnStyle: Google 4 | ... 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # gitignore template for Bazel build system 2 | # website: https://bazel.build/ 3 | 4 | # Ignore all bazel-* symlinks. There is no full list since this can change 5 | # based on the name of the directory bazel is cloned into. 6 | /bazel-* -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Common Systems Programming Optimizations & Tricks 2 | 3 | A listing of some common optimization techniques and tricks for doing "systems 4 | programming" to make your code run faster, be more efficient, and look cooler to 5 | other people. 6 | 7 | 8 | ## Cache Lines & False Sharing 9 | 10 | To show the effect in practice, we benchmark two different structs of 11 | std::atomic counters, `NormalCounters` and `CacheLineAwareCounters`. 12 | 13 | ``` 14 | // NormalCounters is straight forward naive implementation of a struct of 15 | // counters. 16 | struct ABSL_CACHELINE_ALIGNED NormalCounters { 17 | std::atomic success{0}; 18 | std::atomic failure{0}; 19 | std::atomic okay{0}; 20 | std::atomic meh{0}; 21 | }; 22 | 23 | // CacheLineAwareCounters forces each counter onto a separate cache line to 24 | // avoid any false sharing between the counters. 25 | struct ABSL_CACHELINE_ALIGNED CacheLineAwareCounters { 26 | ABSL_CACHELINE_ALIGNED std::atomic success{0}; 27 | ABSL_CACHELINE_ALIGNED std::atomic failure{0}; 28 | ABSL_CACHELINE_ALIGNED std::atomic okay{0}; 29 | ABSL_CACHELINE_ALIGNED std::atomic meh{0}; 30 | }; 31 | ``` 32 | 33 | The benchmark uses either 1, 2, 3, or 4 threads, each bumping a separate atomic 34 | counter inside the struct 64K times. Here are the results on a 2013 MacBook Pro: 35 | 36 | ``` 37 | $ bazel run -c opt //examples:cache-lines 38 | Executing tests from //examples:cache-lines 39 | ----------------------------------------------------------------------------- 40 | Cache Line Size: 64 41 | sizeof(NormalCounters) = 64 42 | sizeof(CacheLineAwareCounters) = 256 43 | 2019-08-13 01:16:18 44 | Run on (4 X 2800 MHz CPU s) 45 | CPU Caches: 46 | L1 Data 32K (x2) 47 | L1 Instruction 32K (x2) 48 | L2 Unified 262K (x2) 49 | L3 Unified 4194K (x1) 50 | --------------------------------------------------------------------------- 51 | Benchmark Time CPU Iterations 52 | --------------------------------------------------------------------------- 53 | BM_NormalCounters/threads:1 389 us 387 us 1812 54 | BM_NormalCounters/threads:2 1264 us 2517 us 234 55 | BM_NormalCounters/threads:3 1286 us 3718 us 225 56 | BM_NormalCounters/threads:4 1073 us 3660 us 204 57 | BM_CacheLineAwareCounters/threads:1 386 us 385 us 1799 58 | BM_CacheLineAwareCounters/threads:2 200 us 400 us 1658 59 | BM_CacheLineAwareCounters/threads:3 208 us 581 us 1152 60 | BM_CacheLineAwareCounters/threads:4 193 us 721 us 1008 61 | ``` 62 | 63 | ![Graph of CPU Time vs. # of Threads](/graphs/sysprog-false-sharing.png) 64 | 65 | ## The Magic Power of 2 66 | 67 | In current hardware, division and modulo, is one of the most expensive 68 | operations, where expensive here means "longest 69 | latency". [Agner Fog's listing of instruction latencies](https://www.agner.org/optimize/instruction_tables.pdf) 70 | lists Intel Skylake's `DIV` instruction operating on two 64 bit registers having 71 | a latency of 35-88 cycles, compared to `ADD` instruction operating on the same 72 | two 64 bit registers having a latency of 1 cycle. 73 | 74 | To show how much faster using a bitmask vs. using vision, we measure executing 75 | 1M modulo operations, versus executing 1M bitmasks. 76 | 77 | ``` 78 | $ bazel run -c opt //examples:power-of-two 79 | Executing tests from //examples:power-of-two 80 | ----------------------------------------------------------------------------- 81 | 2019-08-25 20:17:22 82 | Run on (4 X 2800 MHz CPU s) 83 | CPU Caches: 84 | L1 Data 32K (x2) 85 | L1 Instruction 32K (x2) 86 | L2 Unified 262K (x2) 87 | L3 Unified 4194K (x1) 88 | -------------------------------------------------------- 89 | Benchmark Time CPU Iterations 90 | -------------------------------------------------------- 91 | BM_Mod 9347 us 9298 us 74 92 | BM_BitMask 331 us 329 us 2123 93 | BM_RightShift 338 us 328 us 2089 94 | BM_RightShiftBy1 337 us 331 us 2132 95 | BM_DivideBy2 363 us 341 us 2111 96 | BM_DivideBy3 662 us 650 us 1055 97 | ``` 98 | 99 | ## Lock Striping 100 | 101 | Locks can be used for mutual exclusion when you want to have multiple threads 102 | access shared data exclusively. The downside though is if the shared data is 103 | frequently accessed and the critical section is non-trivial, your threads could 104 | spend most of their time contending on the lock, instead of actually doing work. 105 | 106 | One solution is *lock striping* by chunking up the data and then using different 107 | locks for different chunks of data. 108 | 109 | To show the perf improvement, we measure a single lock hash set againsta lock 110 | striped hash set with varying number of chunks inserting 1M items. 111 | 112 | ``` 113 | $ bazel run -c opt //examples:lock-striping 114 | Executing tests from //examples:lock-striping 115 | ----------------------------------------------------------------------------- 116 | 2019-08-24 22:24:37 117 | Run on (4 X 2800 MHz CPU s) 118 | CPU Caches: 119 | L1 Data 32K (x2) 120 | L1 Instruction 32K (x2) 121 | L2 Unified 262K (x2) 122 | L3 Unified 4194K (x1) 123 | -------------------------------------------------------------------------- 124 | Benchmark Time CPU Iterations 125 | -------------------------------------------------------------------------- 126 | BM_SingleLock/threads:1 65 ms 65 ms 11 127 | BM_SingleLock/threads:2 140 ms 254 ms 2 128 | BM_SingleLock/threads:3 140 ms 332 ms 3 129 | BM_SingleLock/threads:4 142 ms 405 ms 4 130 | BM_LockStriping_4_Chunks/threads:1 71 ms 69 ms 9 131 | BM_LockStriping_4_Chunks/threads:2 90 ms 178 ms 4 132 | BM_LockStriping_4_Chunks/threads:3 89 ms 248 ms 3 133 | BM_LockStriping_4_Chunks/threads:4 82 ms 299 ms 4 134 | BM_LockStriping_8_Chunks/threads:1 70 ms 69 ms 10 135 | BM_LockStriping_8_Chunks/threads:2 74 ms 143 ms 4 136 | BM_LockStriping_8_Chunks/threads:3 71 ms 198 ms 3 137 | BM_LockStriping_8_Chunks/threads:4 60 ms 200 ms 4 138 | ``` 139 | 140 | Graph: 141 | 142 | ![Graph of CPU Time vs. # of Threads](/graphs/sysprog-lock-striping.png) 143 | -------------------------------------------------------------------------------- /WORKSPACE: -------------------------------------------------------------------------------- 1 | load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive") 2 | 3 | # Abseil LTS archive version 20210324.2 4 | http_archive( 5 | name = "com_google_absl", 6 | urls = ["https://github.com/abseil/abseil-cpp/archive/refs/tags/20210324.2.zip"], 7 | strip_prefix = "abseil-cpp-20210324.2", 8 | sha256 = "1a7edda1ff56967e33bc938a4f0a68bb9efc6ba73d62bb4a5f5662463698056c", 9 | ) 10 | 11 | # glog archive. 12 | http_archive( 13 | name = "com_github_google_glog", 14 | urls = ["https://github.com/google/glog/archive/refs/tags/v0.5.0.zip"], 15 | strip_prefix = "glog-0.5.0", 16 | sha256 = "21bc744fb7f2fa701ee8db339ded7dce4f975d0d55837a97be7d46e8382dea5a", 17 | ) 18 | 19 | # gflags archive. 20 | http_archive( 21 | name = "com_github_gflags_gflags", 22 | strip_prefix = "gflags-2.2.2", 23 | urls = [ 24 | "https://mirror.bazel.build/github.com/gflags/gflags/archive/v2.2.2.tar.gz", 25 | "https://github.com/gflags/gflags/archive/v2.2.2.tar.gz", 26 | ], 27 | ) 28 | 29 | # googltest archive. 30 | http_archive( 31 | name = "com_github_google_googletest", 32 | strip_prefix = "googletest-90a443f9c2437ca8a682a1ac625eba64e1d74a8a", 33 | urls = ["https://github.com/google/googletest/archive/90a443f9c2437ca8a682a1ac625eba64e1d74a8a.zip"], 34 | sha256 = "6fb9a49ad77656c860cfdafbb3148a91f076a3a8bda9c6d8809075c832549dd4", 35 | ) 36 | 37 | # Bazel toolchains 38 | http_archive( 39 | name = "bazel_toolchains", 40 | urls = [ 41 | "https://mirror.bazel.build/github.com/bazelbuild/bazel-toolchains/archive/bc09b995c137df042bb80a395b73d7ce6f26afbe.tar.gz", 42 | "https://github.com/bazelbuild/bazel-toolchains/archive/bc09b995c137df042bb80a395b73d7ce6f26afbe.tar.gz", 43 | ], 44 | strip_prefix = "bazel-toolchains-bc09b995c137df042bb80a395b73d7ce6f26afbe", 45 | sha256 = "4329663fe6c523425ad4d3c989a8ac026b04e1acedeceb56aa4b190fa7f3973c", 46 | ) 47 | 48 | # GoogleTest/GoogleMock framework. Used by most unit-tests. 49 | http_archive( 50 | name = "com_google_googletest", 51 | urls = ["https://github.com/google/googletest/archive/b6cd405286ed8635ece71c72f118e659f4ade3fb.zip"], # 2019-01-07 52 | strip_prefix = "googletest-b6cd405286ed8635ece71c72f118e659f4ade3fb", 53 | sha256 = "ff7a82736e158c077e76188232eac77913a15dac0b22508c390ab3f88e6d6d86", 54 | ) 55 | 56 | # Google benchmark. 57 | http_archive( 58 | name = "com_github_google_benchmark", 59 | urls = ["https://github.com/google/benchmark/archive/16703ff83c1ae6d53e5155df3bb3ab0bc96083be.zip"], 60 | strip_prefix = "benchmark-16703ff83c1ae6d53e5155df3bb3ab0bc96083be", 61 | sha256 = "59f918c8ccd4d74b6ac43484467b500f1d64b40cc1010daa055375b322a43ba3", 62 | ) 63 | -------------------------------------------------------------------------------- /examples/BUILD: -------------------------------------------------------------------------------- 1 | cc_binary( 2 | name = "hello-world", 3 | srcs = ["hello-world.cc"], 4 | deps = [ 5 | "@com_google_absl//absl/strings", 6 | "@com_github_google_glog//:glog", 7 | ], 8 | ) 9 | 10 | cc_test( 11 | name = "cache-lines", 12 | srcs = ["cache-lines.cc"], 13 | tags = ["benchmark"], 14 | deps = [ 15 | "@com_github_google_benchmark//:benchmark_main", 16 | "@com_google_absl//absl/synchronization", 17 | ], 18 | ) 19 | 20 | cc_test( 21 | name = "lock-striping", 22 | srcs = ["lock-striping.cc"], 23 | tags = ["benchmark"], 24 | deps = [ 25 | "@com_github_google_benchmark//:benchmark_main", 26 | "@com_google_absl//absl/container:flat_hash_set", 27 | "@com_google_absl//absl/synchronization", 28 | ], 29 | ) 30 | 31 | cc_library( 32 | name = "circular-buffer", 33 | srcs = ["circular-buffer.h"], 34 | tags = [], 35 | deps = [ 36 | ], 37 | ) 38 | 39 | cc_test( 40 | name = "circular-buffer-test", 41 | srcs = ["circular-buffer-test.cc"], 42 | tags = ["benchmark"], 43 | deps = [ 44 | ":circular-buffer", 45 | "@com_github_google_benchmark//:benchmark_main", 46 | ], 47 | ) 48 | 49 | cc_test( 50 | name = "power-of-two", 51 | srcs = ["power-of-two.cc"], 52 | tags = ["benchmark"], 53 | deps = [ 54 | "@com_github_google_benchmark//:benchmark_main", 55 | "@com_google_absl//absl/random", 56 | ], 57 | ) 58 | 59 | cc_test( 60 | name = "pointer-tagging", 61 | srcs = ["pointer-tagging.cc"], 62 | tags = ["benchmark"], 63 | deps = [ 64 | "@com_github_google_benchmark//:benchmark_main", 65 | "@com_google_absl//absl/memory", 66 | "@com_google_googletest//:gtest_main", 67 | ], 68 | ) 69 | -------------------------------------------------------------------------------- /examples/cache-lines.cc: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | #include "absl/synchronization/notification.h" 5 | #include "benchmark/benchmark.h" 6 | 7 | using int64 = int64_t; 8 | 9 | // NormalCounters is straight forward naive implementation of a struct of 10 | // counters. 11 | // Note: We also use ABSL_CACHELINE_ALIGNED on the NormalCounters struct, but 12 | // not its members, so that the entire struct will be aligned to a cache line. 13 | // Otherwise the struct might be placed towards the end of a cache line, 14 | // accidentally straddling two cache lines, thereby improving its performance. 15 | struct ABSL_CACHELINE_ALIGNED NormalCounters { 16 | std::atomic success{0}; 17 | std::atomic failure{0}; 18 | std::atomic okay{0}; 19 | std::atomic meh{0}; 20 | }; 21 | 22 | // CacheLineAwareCounters forces each counter onto a separate cache line to 23 | // avoid any false sharing between the counters. 24 | // Note: We must use ABSL_CACHELINE_ALIGNED for each member, since we want to 25 | // pad every single counter so it will be forced onto its own separate cache 26 | // line. 27 | struct ABSL_CACHELINE_ALIGNED CacheLineAwareCounters { 28 | ABSL_CACHELINE_ALIGNED std::atomic success{0}; 29 | ABSL_CACHELINE_ALIGNED std::atomic failure{0}; 30 | ABSL_CACHELINE_ALIGNED std::atomic okay{0}; 31 | ABSL_CACHELINE_ALIGNED std::atomic meh{0}; 32 | }; 33 | 34 | namespace sysprog { 35 | namespace { 36 | 37 | template 38 | std::atomic* getCounter(T& counters, int i) { 39 | switch (i) { 40 | case 0: 41 | return &counters.success; 42 | case 1: 43 | return &counters.failure; 44 | case 2: 45 | return &counters.okay; 46 | case 3: 47 | return &counters.meh; 48 | default: 49 | return nullptr; 50 | } 51 | } 52 | 53 | static_assert( 54 | sizeof(NormalCounters) != sizeof(CacheLineAwareCounters), 55 | "NormalCounters and CacheLineAwareCounters should have different sizes due " 56 | "to aligning members to different cache lines -- otherwise benchmarks will " 57 | "not show the difference in performance."); 58 | 59 | constexpr int64 kNumIncrements = int64{1} << 16; 60 | 61 | void BM_NormalCounters(benchmark::State& state) { 62 | // Make the counters static so that each thread will use the same counters. 63 | static NormalCounters counters; 64 | std::atomic* counter = getCounter(counters, state.thread_index); 65 | *counter = 0; 66 | for (auto _ : state) { 67 | for (int64 i = 0; i < kNumIncrements; i++) { 68 | counter->fetch_add(1, std::memory_order_relaxed); 69 | } 70 | } 71 | benchmark::DoNotOptimize(*counter); 72 | } 73 | 74 | void BM_CacheLineAwareCounters(benchmark::State& state) { 75 | // Make the counters static so that each thread will use the same counters. 76 | static CacheLineAwareCounters counters; 77 | std::atomic* counter = getCounter(counters, state.thread_index); 78 | *counter = 0; 79 | for (auto _ : state) { 80 | for (int64 i = 0; i < kNumIncrements; i++) { 81 | counter->fetch_add(1, std::memory_order_relaxed); 82 | } 83 | } 84 | benchmark::DoNotOptimize(*counter); 85 | } 86 | 87 | // Try running with 1, 2, 3, and then 4 threads all bumping separate counters in 88 | // the given counters struct 89 | BENCHMARK(BM_NormalCounters) 90 | ->Threads(1) 91 | ->Threads(2) 92 | ->Threads(3) 93 | ->Threads(4) 94 | ->Unit(benchmark::kMicrosecond); 95 | BENCHMARK(BM_CacheLineAwareCounters) 96 | ->Threads(1) 97 | ->Threads(2) 98 | ->Threads(3) 99 | ->Threads(4) 100 | ->Unit(benchmark::kMicrosecond); 101 | 102 | } // namespace 103 | } // namespace sysprog 104 | 105 | int main(int argc, char** argv) { 106 | std::cerr << "Cache Line Size: " << ABSL_CACHELINE_SIZE << std::endl; 107 | std::cerr << "sizeof(NormalCounters) = " << sizeof(NormalCounters) 108 | << std::endl; 109 | std::cerr << "sizeof(CacheLineAwareCounters) = " 110 | << sizeof(CacheLineAwareCounters) << std::endl; 111 | ::benchmark::Initialize(&argc, argv); 112 | if (::benchmark::ReportUnrecognizedArguments(argc, argv)) return 1; 113 | ::benchmark::RunSpecifiedBenchmarks(); 114 | return 0; 115 | } 116 | -------------------------------------------------------------------------------- /examples/circular-buffer-test.cc: -------------------------------------------------------------------------------- 1 | #include "examples/circular-buffer.h" 2 | 3 | #include 4 | 5 | #include "benchmark/benchmark.h" 6 | 7 | namespace sysprog { 8 | namespace { 9 | 10 | constexpr size_t kBufferSize = 2048; 11 | 12 | void BM_CircularBuffer_InsertAndPop(benchmark::State& state) { 13 | CircularBuffer buffer(kBufferSize); 14 | for (auto _ : state) { 15 | for (int i = 0; i < kBufferSize - 1; i++) { 16 | buffer.push_back(i); 17 | } 18 | while (!buffer.empty()) { 19 | buffer.pop_front(); 20 | } 21 | benchmark::DoNotOptimize(buffer); 22 | } 23 | } 24 | 25 | void BM_CircularBuffer_Insert(benchmark::State& state) { 26 | CircularBuffer buffer(kBufferSize); 27 | for (auto _ : state) { 28 | for (int i = 0; i < kBufferSize - 1; i++) { 29 | buffer.push_back(i); 30 | } 31 | } 32 | benchmark::DoNotOptimize(buffer); 33 | } 34 | 35 | void BM_StdQueue_InsertAndPop(benchmark::State& state) { 36 | std::queue buffer; 37 | for (auto _ : state) { 38 | for (int i = 0; i < kBufferSize - 1; i++) { 39 | buffer.push(i); 40 | } 41 | while (!buffer.empty()) { 42 | buffer.pop(); 43 | } 44 | benchmark::DoNotOptimize(buffer); 45 | } 46 | } 47 | 48 | void BM_StdQueue_Insert(benchmark::State& state) { 49 | for (auto _ : state) { 50 | std::queue buffer; 51 | for (int i = 0; i < kBufferSize - 1; i++) { 52 | buffer.push(i); 53 | } 54 | benchmark::DoNotOptimize(buffer); 55 | } 56 | } 57 | 58 | BENCHMARK(BM_CircularBuffer_InsertAndPop); 59 | BENCHMARK(BM_CircularBuffer_Insert); 60 | 61 | BENCHMARK(BM_StdQueue_InsertAndPop); 62 | BENCHMARK(BM_StdQueue_Insert); 63 | 64 | } // namespace 65 | } // namespace sysprog 66 | -------------------------------------------------------------------------------- /examples/circular-buffer.h: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | #include 4 | 5 | static_assert(sizeof(size_t) == sizeof(uint64_t), "64-bit only"); 6 | 7 | constexpr size_t RoundUpToPow2(size_t x) { 8 | if (x == 1) return x; 9 | return 1 << ((8 * sizeof(size_t)) - __builtin_clzl(x - 1)); 10 | } 11 | 12 | // Simple fixed-size circular or ring buffer. The buffer can hold size - 1 13 | // items, and size is rounded up to the nearest power of 2. 14 | template 15 | class CircularBuffer { 16 | public: 17 | CircularBuffer(size_t size) : size_(RoundUpToPow2(size)), mask_(size_ - 1) { 18 | buf_.reserve(size_); 19 | } 20 | 21 | void push_back(T item) { 22 | buf_[write_] = item; 23 | write_ = (write_ + 1) & mask_; 24 | } 25 | 26 | T pop_front() { 27 | auto ret = buf_[read_]; 28 | read_ = (read_ + 1) & mask_; 29 | buf_[read_].~T(); 30 | return ret; 31 | } 32 | 33 | T peek() { 34 | return buf_[read_]; 35 | } 36 | 37 | bool empty() { 38 | return read_ == write_; 39 | } 40 | 41 | bool full() { 42 | return read_ == ((write_ + 1) & mask_); 43 | } 44 | 45 | private: 46 | size_t size_ = 0; 47 | size_t read_ = 0; 48 | size_t write_ = 0; 49 | size_t mask_ = 0; 50 | std::vector buf_; 51 | bool full_ = false; 52 | }; 53 | -------------------------------------------------------------------------------- /examples/hello-world.cc: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | #include "absl/strings/str_join.h" 6 | #include "glog/logging.h" 7 | 8 | int main(int argc, char* argv[]) { 9 | // Initialize Google’s logging library. 10 | google::InitGoogleLogging(argv[0]); 11 | 12 | // Join some strings 13 | std::vector v = {"foo", "bar", "baz"}; 14 | std::string s = absl::StrJoin(v, ", "); 15 | 16 | LOG(ERROR) << "Joined string: " << s << "\n"; 17 | 18 | return 0; 19 | } 20 | -------------------------------------------------------------------------------- /examples/lock-striping.cc: -------------------------------------------------------------------------------- 1 | #include 2 | 3 | #include "absl/container/flat_hash_set.h" 4 | #include "absl/synchronization/mutex.h" 5 | #include "benchmark/benchmark.h" 6 | 7 | namespace sysprog { 8 | namespace { 9 | 10 | using uint64 = uint64_t; 11 | 12 | class ThreadSafeHashSet { 13 | public: 14 | void Insert(uint64 i) { 15 | absl::MutexLock lock(&mu_); 16 | hash_set_.insert(i); 17 | } 18 | 19 | bool Contains(uint64 i) { 20 | absl::MutexLock lock(&mu_); 21 | return hash_set_.contains(i); 22 | } 23 | 24 | private: 25 | absl::Mutex mu_; 26 | absl::flat_hash_set hash_set_; 27 | }; 28 | 29 | template 30 | class LockStripedHashSet { 31 | public: 32 | void Insert(uint64 i) { 33 | const size_t idx = i % kNumChunks; 34 | absl::MutexLock lock(&mu_[idx]); 35 | hash_set_[idx].insert(i); 36 | } 37 | 38 | bool Contains(uint64 i) { 39 | const size_t idx = i % kNumChunks; 40 | absl::MutexLock lock(&mu_[idx]); 41 | return hash_set_[idx].contains(i); 42 | } 43 | 44 | private: 45 | std::array mu_; 46 | std::array, kNumChunks> hash_set_; 47 | }; 48 | 49 | constexpr uint64 kNumIters = 1 << 20; 50 | 51 | void BM_SingleLock(benchmark::State& state) { 52 | // Make static so that each thread will use the same object. 53 | static ThreadSafeHashSet hash_set; 54 | 55 | for (auto _ : state) { 56 | for (uint64 i = 0; i < kNumIters; i++) { 57 | hash_set.Insert(i); 58 | } 59 | } 60 | benchmark::DoNotOptimize(hash_set); 61 | } 62 | 63 | void BM_LockStriping_2_Chunks(benchmark::State& state) { 64 | // Make static so that each thread will use the same object. 65 | static LockStripedHashSet<2> hash_set; 66 | 67 | for (auto _ : state) { 68 | for (uint64 i = 0; i < kNumIters; i++) { 69 | hash_set.Insert(i); 70 | } 71 | } 72 | benchmark::DoNotOptimize(hash_set); 73 | } 74 | 75 | void BM_LockStriping_4_Chunks(benchmark::State& state) { 76 | // Make static so that each thread will use the same object. 77 | static LockStripedHashSet<4> hash_set; 78 | 79 | for (auto _ : state) { 80 | for (uint64 i = 0; i < kNumIters; i++) { 81 | hash_set.Insert(i); 82 | } 83 | } 84 | benchmark::DoNotOptimize(hash_set); 85 | } 86 | 87 | void BM_LockStriping_8_Chunks(benchmark::State& state) { 88 | // Make static so that each thread will use the same object. 89 | static LockStripedHashSet<8> hash_set; 90 | 91 | for (auto _ : state) { 92 | for (uint64 i = 0; i < kNumIters; i++) { 93 | hash_set.Insert(i); 94 | } 95 | } 96 | benchmark::DoNotOptimize(hash_set); 97 | } 98 | 99 | // Try running with differing number of threads to show contention interference. 100 | BENCHMARK(BM_SingleLock) 101 | ->Threads(1) 102 | ->Threads(2) 103 | ->Threads(3) 104 | ->Threads(4) 105 | ->Unit(benchmark::kMillisecond); 106 | BENCHMARK(BM_LockStriping_2_Chunks) 107 | ->Threads(1) 108 | ->Threads(2) 109 | ->Threads(3) 110 | ->Threads(4) 111 | ->Unit(benchmark::kMillisecond); 112 | BENCHMARK(BM_LockStriping_4_Chunks) 113 | ->Threads(1) 114 | ->Threads(2) 115 | ->Threads(3) 116 | ->Threads(4) 117 | ->Unit(benchmark::kMillisecond); 118 | BENCHMARK(BM_LockStriping_8_Chunks) 119 | ->Threads(1) 120 | ->Threads(2) 121 | ->Threads(3) 122 | ->Threads(4) 123 | ->Unit(benchmark::kMillisecond); 124 | 125 | } // namespace 126 | } // namespace sysprog 127 | -------------------------------------------------------------------------------- /examples/pointer-tagging.cc: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | #include "absl/memory/memory.h" 5 | #include "benchmark/benchmark.h" 6 | #include "gtest/gtest.h" 7 | 8 | namespace sysprog { 9 | namespace { 10 | 11 | using uint64 = uint64_t; 12 | 13 | struct SomeData { 14 | std::string foo; 15 | uint64 bar; 16 | }; 17 | 18 | constexpr uintptr_t kDirtyBit = 0x8000000000000000; 19 | constexpr uintptr_t kPtrMask = 0x7fffffffffffffff; 20 | static_assert(sizeof(uintptr_t) == 8, "Only works on 64-bit systems"); 21 | 22 | template 23 | uintptr_t MarkDirty(T* t) { 24 | return reinterpret_cast(t) | kDirtyBit; 25 | } 26 | 27 | bool IsDirty(uintptr_t t) { return t & kDirtyBit; } 28 | 29 | template 30 | T* GetPointer(uintptr_t t) { 31 | return reinterpret_cast(t & kPtrMask); 32 | } 33 | 34 | TEST(PointerTagging, RoundTrip) { 35 | auto data = absl::make_unique(); 36 | SomeData* raw = data.get(); 37 | raw->foo = "abc"; 38 | raw->bar = 123; 39 | 40 | uintptr_t d = MarkDirty(raw); 41 | EXPECT_TRUE(IsDirty(d)); 42 | 43 | raw = GetPointer(d); 44 | EXPECT_EQ(raw->foo, "abc"); 45 | EXPECT_EQ(raw->bar, 123); 46 | } 47 | 48 | void BM_RoundTrip(benchmark::State& state) { 49 | SomeData* data = new SomeData(); 50 | data->foo = "abc"; 51 | data->bar = 123; 52 | for (auto _ : state) { 53 | uintptr_t d = MarkDirty(data); 54 | if (IsDirty(d)) { 55 | SomeData* data = GetPointer(d); 56 | data->foo = "def"; 57 | data->bar = 456; 58 | } 59 | } 60 | } 61 | 62 | BENCHMARK(BM_RoundTrip); 63 | 64 | } // namespace 65 | } // namespace sysprog 66 | -------------------------------------------------------------------------------- /examples/power-of-two.cc: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | 4 | #include "absl/random/random.h" 5 | #include "benchmark/benchmark.h" 6 | 7 | namespace sysprog { 8 | namespace { 9 | 10 | using uint64 = uint64_t; 11 | constexpr uint64 kNumIters = uint64{1} << 20; 12 | 13 | bool isPowerOfTwo(uint64 n) { return (n & (n - 1)) == 0; } 14 | 15 | uint64 randomNonPowerOfTwo() { 16 | static auto gen = absl::BitGen(); 17 | uint64 n = gen(); 18 | while (isPowerOfTwo(n)) { 19 | n = gen(); 20 | } 21 | return n; 22 | } 23 | 24 | uint64 randomPowerOfTwo() { 25 | static auto gen = absl::BitGen(); 26 | const uint64 n = gen(); 27 | const uint64 shift = n % 63; 28 | return uint64{2} << shift; 29 | } 30 | 31 | uint64 getNum() { 32 | static auto gen = absl::BitGen(); 33 | return gen(); 34 | } 35 | 36 | void BM_Mod(benchmark::State& state) { 37 | uint64 num = getNum(); 38 | uint64 divisor = randomNonPowerOfTwo(); 39 | for (auto _ : state) { 40 | for (uint64 i = 0; i < kNumIters; i++) { 41 | uint64 res = num % divisor; 42 | benchmark::DoNotOptimize(res); 43 | } 44 | } 45 | benchmark::DoNotOptimize(num); 46 | } 47 | 48 | void BM_BitMask(benchmark::State& state) { 49 | uint64 num = getNum(); 50 | uint64 divisor = randomPowerOfTwo(); 51 | uint64 mask = divisor - 1; 52 | for (auto _ : state) { 53 | for (uint64 i = 0; i < kNumIters; i++) { 54 | uint64 res = num & mask; 55 | benchmark::DoNotOptimize(res); 56 | } 57 | } 58 | } 59 | 60 | void BM_RightShift(benchmark::State& state) { 61 | uint64 num = getNum(); 62 | uint64 divisor = randomPowerOfTwo(); 63 | uint64 shift = __builtin_clzll(divisor) + 1; 64 | for (auto _ : state) { 65 | for (uint64 i = 0; i < kNumIters; i++) { 66 | uint64 res = num >> shift; 67 | benchmark::DoNotOptimize(res); 68 | } 69 | } 70 | } 71 | 72 | void BM_RightShiftBy1(benchmark::State& state) { 73 | uint64 num = getNum(); 74 | benchmark::DoNotOptimize(num); 75 | for (auto _ : state) { 76 | for (uint64 i = 0; i < kNumIters; i++) { 77 | benchmark::DoNotOptimize(num >> 1); 78 | } 79 | } 80 | } 81 | 82 | void BM_DivideBy2(benchmark::State& state) { 83 | uint64 num = absl::BitGen()(); 84 | benchmark::DoNotOptimize(num); 85 | for (auto _ : state) { 86 | for (uint64 i = 0; i < kNumIters; i++) { 87 | benchmark::DoNotOptimize(num / 2); 88 | } 89 | } 90 | } 91 | 92 | void BM_DivideBy3(benchmark::State& state) { 93 | uint64 num = absl::BitGen()(); 94 | benchmark::DoNotOptimize(num); 95 | for (auto _ : state) { 96 | for (uint64 i = 0; i < kNumIters; i++) { 97 | benchmark::DoNotOptimize(num / 3); 98 | } 99 | } 100 | } 101 | 102 | BENCHMARK(BM_Mod)->Unit(benchmark::kMicrosecond); 103 | BENCHMARK(BM_BitMask)->Unit(benchmark::kMicrosecond); 104 | BENCHMARK(BM_RightShift)->Unit(benchmark::kMicrosecond); 105 | BENCHMARK(BM_RightShiftBy1)->Unit(benchmark::kMicrosecond); 106 | BENCHMARK(BM_DivideBy2)->Unit(benchmark::kMicrosecond); 107 | BENCHMARK(BM_DivideBy3)->Unit(benchmark::kMicrosecond); 108 | 109 | } // namespace 110 | } // namespace sysprog 111 | -------------------------------------------------------------------------------- /graphs/sysprog-false-sharing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulcavallaro/systems-programming/c20a48c00d7fb257a7ecbab08cdcca8efc8438d6/graphs/sysprog-false-sharing.png -------------------------------------------------------------------------------- /graphs/sysprog-lock-striping.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/paulcavallaro/systems-programming/c20a48c00d7fb257a7ecbab08cdcca8efc8438d6/graphs/sysprog-lock-striping.png --------------------------------------------------------------------------------