├── LICENSE ├── README.md └── consistent_hashing.h /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2016 Phaistos Networks 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This is an C++14 implementation of [Consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing), abstracted as a `Ring` of tokens, with a `ring_segment` data structure that represents a segment of the ring. [We](http://phaistosnetworks.gr/) have been using this implementation for many years in building multiple distributed systems, including our massive scale, high performance distributed store (CloudDS). 2 | 3 | Please check the comments in the single header file for how to use the data structures and their APIs, and how it works. 4 | It is pretty trivial to use it and various useful methods are implemented for building robust distributed services. You should also check the wiki, startng with the [transition plan page](https://github.com/phaistos-networks/ConsistentHashing/wiki/Transition-Plan) 5 | 6 | ### Using it in your Project 7 | Just include [consistent_hashing.h](https://github.com/phaistos-networks/ConsistentHashing/blob/master/consistent_hashing.h), and make sure you set `std=c++14` or higher compiler option. 8 | 9 | If you are going to need more than 2^64 ring tokens, and you probably should, you will need a struct or class to represent it (because an uin64_t won't suffice). In this case, you need to implement a few things: 10 | 11 | - You need a `TrivialCmp()` implementation for your token type. This should return < 0, 0, or > 0 depending on the comparison result of two tokens. Hopefully, a future C++ standard update will introduce the [spaceship operator](https://en.wikipedia.org/wiki/Three-way_comparison) we could override and solve this more elegantly, but for now this will have to do. 12 | ```cpp 13 | template<> 14 | static inline int8_t TrivialCmp(const hugetoken_t &a, const hugetoken_t &b) 15 | { 16 | // return comparison result 17 | } 18 | ``` 19 | 20 | - You will need to implement an appropriate `std::numeric_limits::min()`, like so: 21 | ```cpp 22 | namespace std 23 | { 24 | template<> 25 | struct numeric_limits 26 | { 27 | static inline const hugetoken_t min() 28 | { 29 | //return minimum possible token (e.g 0) 30 | } 31 | }; 32 | } 33 | ``` 34 | 35 | - Implement appropriate `std::min()` and `std::max()` methods like so: 36 | ```cpp 37 | namespace std 38 | { 39 | template<> 40 | inline const hugetoken_t &min(const hugetoken_t &a, const hugetoken_t &b) 41 | { 42 | // return whichever is lower 43 | } 44 | 45 | template<> 46 | inline const hugetoken_t &max(const hugetoken_t &a, const hugetoken_t &b) 47 | { 48 | // return whichever is higher 49 | } 50 | } 51 | ``` 52 | 53 | - Finally, you should implement appropriate `operator==`, `operator!=`, `operator<` and `operator>` methods for your hugetoken_t 54 | 55 | This is really not that much work, and chances are you are already doing that anyway to support other needs of your codebase. 56 | 57 | ### Example 58 | Please read about [transition plans](https://github.com/phaistos-networks/ConsistentHashing/wiki/Transition-Plan) first. 59 | 60 | ```cpp 61 | #include 62 | 63 | int main() 64 | { 65 | using token_t = uint32_t; 66 | using segment_t = ConsistentHashing::ring_segment; 67 | using ring_t = ConsistentHashing::Ring; 68 | using node_t = uint32_t; 69 | // Suppose we have a simple ring, and of those tokens, only 70 | // one is owned by the node we wish to update (node 1) 71 | std::pair ringStructure[] = 72 | { 73 | {100, 10}, 74 | {200, 20}, 75 | {300, 30}, 76 | {400, 40}, 77 | {500, 50}, 78 | {600, 60}, 79 | {1, 70}, /* this is the only token owned by node 1 */ 80 | {800, 80}, 81 | {900, 90}, 82 | {1000, 100}, 83 | {150, 110}, 84 | {1200, 120}}; 85 | std::vector ringTokensNodes, ringTokens; 86 | 87 | // We need all ring tokens, and all associated nodes for those tokens. 88 | // We also need to collect the tokens owned by the node 89 | for (auto it = std::begin(ringStructure), end = std::end(ringStructure); it != end; ++it) 90 | { 91 | const auto &v = *it; 92 | 93 | ringTokens.push_back(v.second); 94 | ringTokensNodes.push_back(v.first); 95 | } 96 | 97 | // This is a simple method that returns the replicas for a given token 98 | // In practice, you wouldn't allocate any memory here (i.e no std::vector<> use), you 'd 99 | // pay a lot of attention to performance, and you 'd possiblyh consider physical placement of node 100 | // (i.e at least one from local DC and the rest from other DCs) 101 | const auto replicas_of = [](const auto &ring, const auto ringTokensNodes, const token_t token, node_t *const out) { 102 | static constexpr uint8_t replicationFactor{2}; // how many copies of each ring segment we need at any given time 103 | const auto base = ring.index_owner_of(token); // index in the ring tokens for the token-owner of token(input) 104 | uint32_t i{base}, n{0}; 105 | 106 | // walk the the ring clockwise until we have enough nodes to return 107 | do 108 | { 109 | const auto node = ringTokensNodes[i]; 110 | 111 | if (std::find(out, out + n, node) == out + n) 112 | { 113 | // haven't collected that node yet 114 | // we only care for distinct nodes 115 | out[n++] = node; 116 | if (n == replicationFactor) 117 | break; 118 | } 119 | 120 | i = (i + 1) % ring.size(); 121 | } while (i != base); 122 | 123 | return n; 124 | }; 125 | 126 | // This is our ring 127 | const ring_t ring(ringTokens.data(), ringTokens.size()); 128 | // those are the changes 129 | const std::unordered_map> topologyUpdates{ 130 | {1, {35, 95}}, 131 | {110, {}}, // will remove node from the ring 132 | {64, {7}}}; 133 | // Create the transition plan 134 | auto plan = ring.transition(ringTokensNodes.data(), topologyUpdates, replicas_of); 135 | 136 | 137 | for (auto &it : plan) 138 | { 139 | const auto segment = it.first; 140 | const auto &toFrom = it.second; 141 | const auto target = toFrom.first; 142 | auto sources = toFrom.second; 143 | 144 | // Let's filter the sources, so that we only pull from the nodes closest to us in terms of node hopes 145 | sources.resize(ring_t::filter_by_distance(sources.data(), sources.data() + sources.size(), [target](const auto node) { 146 | return 1; // TODO: return an appropriate distance from target to node 147 | }) - 148 | sources.data()); 149 | } 150 | 151 | 152 | // Please read https://github.com/phaistos-networks/ConsistentHashing/wiki/Transition-Plan 153 | schedule_transfer(std::move(plan), 154 | [ ringTokensNodes, ring, topologyUpdates ]() { 155 | const auto newTopology = ring.new_topology(ringTokensNodes.data(), topologyUpdates); 156 | 157 | switch_to_ring(newTopology); 158 | }); 159 | return 0; 160 | } 161 | ``` 162 | -- 163 | 164 | With the included data structures and implemented algorithms, it should be trivial to build robust consistent-hashing based replication for your distributed systems. 165 | 166 | Have Fun! 167 | -------------------------------------------------------------------------------- /consistent_hashing.h: -------------------------------------------------------------------------------- 1 | // In Consistent Hashing, a Ring is represented as an array of sorted in ascending order tokens, and each of those 2 | // tokens identifies a segment in the ring. 3 | // 4 | // The key property is that every segment owns the ring-space defined by the range: 5 | // (prev_segment.token, segment.token] 6 | // that is, starting but excluding the token of the previous segment in the ring, upto and including the token of the segment. 7 | // 8 | // The two exceptions are for tokens that <= the first tokens in the ring or > last tokens in the ring(ring semantics) 9 | // -- For the last segment in the array, its next segment is the first segment in the array 10 | // -- For the first segment in the array, its previous segment is the last segment in the array 11 | // 12 | // 13 | // You should operate on ring segments, and for a typical distributed service, each segment will be owned by a primary replica, and based on 14 | // replication strategies and factors, more(usually, the successor) segments will also get to hold to hold the same segment's data. 15 | #pragma once 16 | #ifdef HAVE_SWITCH 17 | #include "switch.h" 18 | #include "switch_vector.h" 19 | #include 20 | #else 21 | #include 22 | #include 23 | #include 24 | #include 25 | #include 26 | #include 27 | #endif 28 | #include 29 | 30 | namespace ConsistentHashing 31 | { 32 | // Returns lowest index where token <= tokens[index] 33 | // if it returns cnt, use 0 ( because tokens[0] owns ( tokens[cnt - 1], tokens[0] ] 34 | template 35 | static uint32_t search(const T *const tokens, const uint32_t cnt, const T token) 36 | { 37 | int32_t h = cnt - 1, l{0}; 38 | 39 | while (l <= h) 40 | { 41 | // This protects from overflows: see http://locklessinc.com/articles/binary_search/ 42 | // The addition can be split up bitwise. The carry between bits can be obtained by 43 | // the logical-and of the two summands. The resultant bit can be obtained by a XOR 44 | // 45 | // https://en.wikipedia.org/wiki/Binary_search_algorithm#Implementation_issues 46 | // The problem with overflow is that if (l + h) add up to value greater than INT32_MAX, 47 | // (exceeds the range of integers of the data type used to store the midpoint, even if 48 | // l and h are withing rhe range). If l and h, this can non-negatives, this can be avoided 49 | // by calculating the modpoint as: (l + (r - l) / 2) 50 | // 51 | // We are not using unsigned integers though -- though we should look for a way 52 | // to do that so that we could safely use (l + (r - l ) / 2) 53 | // so we can't use >> 1 here becuse (l + r) may result in a negative number 54 | // and shifting by >> 1 won't divide that number by two. 55 | const auto m = (l & h) + ((l ^ h) >> 1); 56 | const auto v = tokens[m]; 57 | const auto r = TrivialCmp(token, v); 58 | 59 | if (!r) 60 | return m; 61 | else if (r < 0) 62 | h = m - 1; 63 | else 64 | l = m + 1; 65 | } 66 | 67 | return l; 68 | } 69 | 70 | // An 128bit token representation 71 | // You should probably use 128 or more bits for the tokens space 72 | struct token128 73 | { 74 | uint64_t ms; 75 | uint64_t ls; 76 | 77 | constexpr token128() 78 | : ms{0}, ls{0} 79 | { 80 | } 81 | 82 | constexpr token128(const uint64_t m, const uint64_t l) 83 | : ms{m}, ls{l} 84 | { 85 | } 86 | 87 | constexpr bool is_minimum() const noexcept 88 | { 89 | return ms == 0 && ls == 0; 90 | } 91 | 92 | constexpr operator bool() const noexcept 93 | { 94 | return is_valid(); 95 | } 96 | 97 | constexpr bool is_valid() const noexcept 98 | { 99 | return ms || ls; 100 | } 101 | 102 | constexpr bool operator==(const token128 &o) const noexcept 103 | { 104 | return ms == o.ms && ls == o.ls; 105 | } 106 | 107 | constexpr bool operator!=(const token128 &o) const noexcept 108 | { 109 | return ms != o.ms || ls != o.ls; 110 | } 111 | 112 | constexpr bool operator>(const token128 &o) const noexcept 113 | { 114 | return ms > o.ms || (ms == o.ms && ls > o.ls); 115 | } 116 | 117 | constexpr bool operator<(const token128 &o) const noexcept 118 | { 119 | return ms < o.ms || (ms == o.ms && ls < o.ls); 120 | } 121 | 122 | constexpr bool operator>=(const token128 &o) const noexcept 123 | { 124 | return ms > o.ms || (ms == o.ms && ls >= o.ls); 125 | } 126 | 127 | constexpr bool operator<=(const token128 &o) const noexcept 128 | { 129 | return ms < o.ms || (ms == o.ms && ls <= o.ls); 130 | } 131 | 132 | constexpr auto &operator=(const token128 &o) noexcept 133 | { 134 | ms = o.ms; 135 | ls = o.ls; 136 | 137 | return *this; 138 | } 139 | 140 | constexpr void reset() noexcept 141 | { 142 | ms = 0; 143 | ls = 0; 144 | } 145 | }; 146 | 147 | // A segment in a ring. The segment is responsible(owns) the tokens range 148 | // (left, right] i.e left exlusive, right inclusive 149 | // whereas left is the token of the predecessor segment and right is the token of this segment 150 | // See also: https://en.wikipedia.org/wiki/Circular_segment 151 | template 152 | struct ring_segment 153 | { 154 | token_t left; 155 | token_t right; 156 | 157 | constexpr uint64_t span() const noexcept 158 | { 159 | if (wraps()) 160 | { 161 | require(left >= right); 162 | return uint64_t(std::numeric_limits::max()) - left + right; 163 | } 164 | else 165 | { 166 | require(right >= left); 167 | return right - left; 168 | } 169 | } 170 | 171 | constexpr ring_segment() 172 | { 173 | } 174 | 175 | constexpr ring_segment(const token_t l, const token_t r) 176 | : left{l}, right{r} 177 | { 178 | } 179 | 180 | constexpr void set(const token_t l, const token_t r) 181 | { 182 | left = l; 183 | right = r; 184 | } 185 | 186 | // this segment's token 187 | constexpr auto token() const noexcept 188 | { 189 | return right; 190 | } 191 | 192 | constexpr bool operator==(const ring_segment &o) const noexcept 193 | { 194 | return left == o.left && right == o.right; 195 | } 196 | 197 | constexpr bool operator!=(const ring_segment &o) const noexcept 198 | { 199 | return left != o.left || right != o.right; 200 | } 201 | 202 | constexpr bool operator<(const ring_segment &o) const noexcept 203 | { 204 | return left < o.left || (left == o.left && right < o.right); 205 | } 206 | 207 | constexpr bool operator>(const ring_segment &o) const noexcept 208 | { 209 | return left > o.left || (left == o.left && right > o.right); 210 | } 211 | 212 | constexpr int8_t cmp(const ring_segment &rhs) const noexcept 213 | { 214 | if (tokens_wrap_around(left, right)) 215 | { 216 | // there is only one segment that wraps around in the ring 217 | return -1; 218 | } 219 | else if (tokens_wrap_around(rhs.left, rhs.right)) 220 | { 221 | // there is only one segment that wraps around in the ring 222 | return 1; 223 | } 224 | else 225 | { 226 | if (right == rhs.right) 227 | return 0; 228 | else if (right > rhs.right) 229 | return 1; 230 | else 231 | return -1; 232 | } 233 | } 234 | 235 | static constexpr bool tokens_wrap_around(const token_t &l, const token_t &r) noexcept 236 | { 237 | // true iff extends from last to the first ring segment 238 | return l >= r; 239 | } 240 | 241 | bool contains(const ring_segment &that) const noexcept 242 | { 243 | if (left == right) 244 | { 245 | // Full ring always contains all other ranges 246 | return true; 247 | } 248 | 249 | const bool thisWraps = tokens_wrap_around(left, right); 250 | const bool thatWraps = tokens_wrap_around(that.left, that.right); 251 | 252 | if (thisWraps == thatWraps) 253 | return left <= that.left && that.right <= right; 254 | else if (thisWraps) 255 | { 256 | // wrapping might contain non-wrapping that is contained if both its tokens are in one of our wrap segments 257 | return left <= that.left || that.right <= right; 258 | } 259 | else 260 | { 261 | // non-wrapping cannot contain wrapping 262 | return false; 263 | } 264 | } 265 | 266 | // masks a segment `mask` from a segment `s`, if they intersect, and return 0+ segments 267 | // 268 | // It is very important that we get this right, otherwise other methods that depend on it will produce crap 269 | // returns a pair, where the first is true if the segment was intersected by the mask, false otherwise, and the second 270 | // is the number of segments it was partitioned to (can be 0) 271 | std::pair mask(const ring_segment mask, ring_segment *const out) const noexcept 272 | { 273 | if (false == intersects(mask)) 274 | return {false, 0}; 275 | else if (mask.contains(*this)) 276 | { 277 | // completely masked 278 | return {true, 0}; 279 | } 280 | else 281 | { 282 | // partially masked 283 | uint8_t n{0}; 284 | 285 | if (mask.wraps() || wraps()) 286 | n = mask.difference(*this, out); 287 | else if (mask.right > left) 288 | { 289 | if (mask.left < right && mask.left > left) 290 | out[n++] = {left, mask.left}; 291 | 292 | if (mask.right < right) 293 | out[n++] = {mask.right, right}; 294 | } 295 | 296 | return {true, n}; 297 | } 298 | } 299 | 300 | static void mask_segments_impl(const ring_segment *it, const ring_segment *const end, const std::vector &toExclude, std::vector *const out) 301 | { 302 | ring_segment list[2]; 303 | 304 | for (auto i{it}; i != end; ++i) 305 | { 306 | const auto in = *i; 307 | 308 | for (const auto mask : toExclude) 309 | { 310 | if (const auto res = in.mask(mask, list); res.first) 311 | { 312 | // OK, either completely or partially masked 313 | 314 | if (res.second) 315 | mask_segments_impl(list, list + res.second, toExclude, out); 316 | 317 | goto next; 318 | } 319 | } 320 | 321 | out->push_back(in); 322 | 323 | next:; 324 | } 325 | } 326 | 327 | static void mask_segments(const ring_segment *it, const ring_segment *const end, const std::vector &toExclude, std::vector *const out) 328 | { 329 | if (toExclude.size()) 330 | { 331 | mask_segments_impl(it, end, toExclude, out); 332 | // Just in case (this is cheap) 333 | sort_and_deoverlap(out); 334 | } 335 | else 336 | out->insert(out->end(), it, end); 337 | } 338 | 339 | static void mask_segments(const std::vector &in, const std::vector &toExclude, std::vector *const out) 340 | { 341 | mask_segments(in.data(), in.data() + in.size(), toExclude, out); 342 | } 343 | 344 | static auto mask_segments(const std::vector &in, const std::vector &toExclude) 345 | { 346 | std::vector out; 347 | 348 | mask_segments(in.begin(), in.end(), &out); 349 | return out; 350 | } 351 | 352 | // For list of wrapped segments sorted by left token ascending, process the list to produce 353 | // an equivalent set of ranges, sans the overlapping ranges 354 | // it will also merge together ranges 355 | // i.e [(8, 10],(8, 15],(14, 18],(17, 18]] => [ (8, 18] ] 356 | // 357 | // this will only work if the segments are properly sorted. see sort_and_deoverlap() 358 | // This utility method deals with invalid segments as well (e.g you can't really have more than one segments that wrap) 359 | static void deoverlap(std::vector *const segments) 360 | { 361 | auto out = segments->data(); 362 | 363 | for (auto *it = segments->data(), *const end = it + segments->size(); it != end;) 364 | { 365 | auto s = *it; 366 | 367 | if (it->right <= it->left) 368 | { 369 | // This segment wraps 370 | // deal with e.g [30, 4], [35, 8], [40, 2] 371 | // that'd be an invalid list of segments(there can only be one wrapping segment), but we 'll deal with it anyway 372 | const auto wrappedSegmentIt = it; 373 | 374 | for (++it; it != end; ++it) 375 | { 376 | if (it->right > s.right) 377 | s.right = it->right; 378 | } 379 | 380 | // we need to potentially drop some of them segments if the wrapping segment overlaps them 381 | if (wrappedSegmentIt != (it = segments->data()) && s.right >= it->right) 382 | { 383 | s.right = it->right; 384 | memmove(it, it + 1, (out - it) * sizeof(ring_segment)); 385 | --out; 386 | } 387 | 388 | *out++ = s; 389 | break; 390 | } 391 | else 392 | { 393 | for (++it; it != end && ((*it == s) || (it->left >= s.left && s.right > it->left)); ++it) 394 | s.right = it->right; 395 | 396 | if (out == segments->data() || false == out[-1].contains(s)) 397 | { 398 | // deal with (8, 30],(9, 18] 399 | *out++ = s; 400 | } 401 | } 402 | } 403 | 404 | segments->resize(out - segments->data()); 405 | 406 | if (segments->size() == 1 && segments->back().left == segments->back().right) 407 | { 408 | // spans the whole ring 409 | const auto MinTokenValue = std::numeric_limits::min(); 410 | 411 | segments->pop_back(); 412 | segments->push_back({MinTokenValue, MinTokenValue}); 413 | } 414 | } 415 | 416 | // utility method; sorts segments so that deoverlap() can process them 417 | static void sort_and_deoverlap(std::vector *const segments) 418 | { 419 | std::sort(segments->begin(), segments->end(), [](const auto &a, const auto &b) { return a.left < b.left || (a.left == b.left && a.right < b.right); }); 420 | deoverlap(segments); 421 | } 422 | 423 | // Copy of input list, with all segments unwrapped, sorted by left bound, and with overlapping bounds merged 424 | static void normalize(const ring_segment *const segments, const uint32_t segmentsCnt, std::vector *const out) 425 | { 426 | ring_segment res[2]; 427 | 428 | for (uint32_t i{0}; i != segmentsCnt; ++i) 429 | { 430 | if (const uint8_t n = segments[i].unwrap(res)) 431 | out->insert(out->end(), res, res + n); 432 | } 433 | 434 | sort_and_deoverlap(out); 435 | } 436 | 437 | static auto normalize(const ring_segment *const segments, const uint32_t segmentsCnt) 438 | { 439 | std::vector res; 440 | 441 | normalize(segments, segmentsCnt, &res); 442 | return res; 443 | } 444 | 445 | // true iff segment contains the token 446 | bool contains(const token_t &token) const noexcept 447 | { 448 | if (wraps()) 449 | { 450 | // We are wrapping around. Thee interval is (a, b] where a>= b 451 | // then we have 3 cases which hold for any given token k, and we should return true 452 | // 1. a < k 453 | // 2. k <= b 454 | // 3. b < k <= a 455 | return token > left || right >= token; 456 | } 457 | else 458 | { 459 | // Range [a,b], a < b 460 | return token > left && right >= token; 461 | } 462 | } 463 | 464 | constexpr bool wraps() const noexcept 465 | { 466 | return tokens_wrap_around(left, right); 467 | } 468 | 469 | inline bool intersects(const ring_segment that) const noexcept 470 | { 471 | ring_segment out[2]; 472 | 473 | return intersection(that, out); 474 | } 475 | 476 | static uint8_t _intersection_of_two_wrapping_segments(const ring_segment &first, const ring_segment &that, ring_segment *intersection) noexcept 477 | { 478 | if (that.right > first.left) 479 | { 480 | intersection[0] = ring_segment(first.left, that.right); 481 | intersection[1] = ring_segment(that.left, first.right); 482 | return 2; 483 | } 484 | else 485 | { 486 | intersection[0] = ring_segment(that.left, first.right); 487 | return 1; 488 | } 489 | } 490 | 491 | static uint8_t _intersection_of_single_wrapping_segment(const ring_segment &wrapping, const ring_segment &other, ring_segment *intersection) noexcept 492 | { 493 | uint8_t size{0}; 494 | 495 | if (other.contains(wrapping.right)) 496 | intersection[size++] = ring_segment(other.left, wrapping.right); 497 | if (other.contains(wrapping.left) && wrapping.left < other.right) 498 | intersection[size++] = ring_segment(wrapping.left, other.right); 499 | 500 | return size; 501 | } 502 | 503 | // Returns the intersection of two segments. That can be two disjoint ranges if one is wrapping and the other is not. 504 | // e.g for two nodes G and M, and a query range (D, T]; the intersection is (M-T] and (D-G] 505 | // If there is no interesection, an empty list is returned 506 | // 507 | // (12,7)^(5,20) => [(5,7), (12, 20)] 508 | // ring_segment(10, 100).intersection(50, 120) => [ ring_segment(50, 100) ] 509 | // see also mask() 510 | // 511 | // this is the result of the logical operation: ((*this) & that) 512 | uint8_t intersection(const ring_segment &that, ring_segment *out) const noexcept 513 | { 514 | if (that.contains(*this)) 515 | { 516 | *out = *this; 517 | return 1; 518 | } 519 | else if (contains(that)) 520 | { 521 | *out = that; 522 | return 1; 523 | } 524 | else 525 | { 526 | const bool thisWraps = tokens_wrap_around(left, right); 527 | const bool thatWraps = tokens_wrap_around(that.left, that.right); 528 | 529 | if (!thisWraps && !thatWraps) 530 | { 531 | // Neither wraps; fast path 532 | if (!(left < that.right && that.left < right)) 533 | return 0; 534 | 535 | *out = ring_segment(std::max(left, that.left), std::min(right, that.right)); 536 | return 1; 537 | } 538 | else if (thisWraps && thatWraps) 539 | { 540 | // Two wrapping ranges always intersect. 541 | // We have determined that neither this or that contains the other, we are left 542 | // with two possibilities and mirror images of each such case: 543 | // 1. both of s (1,2] endpoints lie in this's (A, B] right segment 544 | // 2. only that's start endpoint lies in this's right segment: 545 | if (left < that.left) 546 | return _intersection_of_two_wrapping_segments(*this, that, out); 547 | else 548 | return _intersection_of_two_wrapping_segments(that, *this, out); 549 | } 550 | else if (thisWraps && !thatWraps) 551 | return _intersection_of_single_wrapping_segment(*this, that, out); 552 | else 553 | return _intersection_of_single_wrapping_segment(that, *this, out); 554 | } 555 | } 556 | 557 | // Subtracts a portion of this range 558 | // @contained : The range to subtract from `this`: must be totally contained by this range 559 | // @out: List of ranges left after subtracting contained from `this` (@return value is size of @out) 560 | // 561 | // i.e ring_segment(10, 100).subdvide(ring_segment(50, 55)) => [ ring_segment(10, 50), ring_segment(55, 110) ] 562 | // 563 | // You may want to use mask() instead, which is more powerful and covers wrapping cases, etc 564 | uint8_t subdivide(const ring_segment &contained, ring_segment *const out) const noexcept 565 | { 566 | if (contained.contains(*this)) 567 | { 568 | // contained actually contains this segment 569 | return 0; 570 | } 571 | 572 | uint8_t size{0}; 573 | 574 | if (left != contained.left) 575 | out[size++] = ring_segment(left, contained.left); 576 | if (right != contained.right) 577 | out[size++] = ring_segment(contained.right, right); 578 | return size; 579 | } 580 | 581 | // if this segment wraps, it will return two segments 582 | // 1. (left, std::numeric_limits::min()) 583 | // 2. (std::numeric_limits::min(), right) 584 | // otherwise, it will return itself 585 | uint8_t unwrap(ring_segment *const out) const noexcept 586 | { 587 | const auto MinTokenValue = std::numeric_limits::min(); 588 | 589 | if (false == wraps() || right == MinTokenValue) 590 | { 591 | *out = *this; 592 | return 1; 593 | } 594 | else 595 | { 596 | out[0] = ring_segment(left, MinTokenValue); 597 | out[1] = ring_segment(MinTokenValue, right); 598 | return 2; 599 | } 600 | } 601 | 602 | // Compute difference betweet two ring segments 603 | // This is very handy for computing, e.g the segments a node will need to fetch, when moving to a given token 604 | // e.g segment(5, 20).difference(segment(2, 25)) => [ (2, 5), (20, 25) ] 605 | // e.g segment(18, 25).difference(segment(5,20)) => [ (5, 18) ] 606 | // 607 | // In other words, compute the missing segments(ranges) that (*this) is missing from rhs 608 | // There is an opposite operation, mask() 609 | // 610 | // This is the result of the logical operation: (rhs & (~(rhs & (*this))) ) 611 | uint8_t difference(const ring_segment &rhs, ring_segment *const result) const 612 | { 613 | ring_segment intersectionSet[2]; 614 | 615 | switch (intersection(rhs, intersectionSet)) 616 | { 617 | case 0: 618 | // not intersected 619 | *result = rhs; 620 | return 1; 621 | 622 | case 1: 623 | // compute missing sub-segments 624 | return rhs.subdivide(intersectionSet[0], result); 625 | 626 | default: 627 | { 628 | const auto first = intersectionSet[0], second = intersectionSet[1]; 629 | ring_segment tmp[2]; 630 | 631 | rhs.subdivide(first, tmp); 632 | // two intersections; subtracting only one of them will yield a single segment 633 | return tmp[0].subdivide(second, result); 634 | } 635 | } 636 | } 637 | 638 | // split the segment in two, halved at segmentToken value (if segmentToken is contained in segment) 639 | // 640 | // i.e ring_segment(10, 20).split(18) => ( ring_segment(10, 18), ring_segment(18, 20) ) 641 | std::experimental::optional> split(const token_t segmentToken) const noexcept 642 | { 643 | if (left == segmentToken || right == segmentToken || !contains(segmentToken)) 644 | return {}; 645 | 646 | return {{ring_segment(left, segmentToken), ring_segment(segmentToken, right)}}; 647 | } 648 | 649 | #ifdef HAVE_SWITCH 650 | void serialize(IOBuffer *const b) const 651 | { 652 | b->Serialize(left); 653 | b->Serialize(right); 654 | } 655 | 656 | void deserialize(ISerializer *const b) const 657 | { 658 | b->Unserialize(&left); 659 | b->Unserialize(&right); 660 | } 661 | #endif 662 | 663 | // Make sure segments is properly ordered and deoverlapped 664 | // see sort_and_deoverlap() 665 | static bool segments_contain(const token_t token, const ring_segment *const segments, const uint32_t cnt) 666 | { 667 | if (!cnt) 668 | return false; 669 | 670 | int32_t h = cnt - 1; 671 | 672 | if (segments[h].wraps()) 673 | { 674 | if (segments[h--].contains(token)) 675 | { 676 | // there can only be one segment that wraps, and that should be the last one (see sort_and_deoverlap() impl.) 677 | return true; 678 | } 679 | } 680 | 681 | for (int32_t l{0}; l <= h;) 682 | { 683 | const auto m = (l & h) + ((l ^ h) >> 1); 684 | const auto segment = segments[m]; 685 | 686 | if (segment.contains(token)) 687 | return true; 688 | else if (token <= segment.left) 689 | h = m - 1; 690 | else 691 | l = m + 1; 692 | } 693 | 694 | return false; 695 | } 696 | }; 697 | 698 | // A Ring of tokens 699 | template 700 | struct Ring 701 | { 702 | using token_t = T; 703 | using segment_t = ring_segment; 704 | 705 | const T *const tokens; 706 | const uint32_t cnt; 707 | 708 | constexpr Ring(const T *const v, const uint32_t n) 709 | : tokens{v}, cnt{n} 710 | { 711 | } 712 | 713 | constexpr Ring(const std::vector &v) 714 | : Ring{v.data(), v.size()} 715 | { 716 | } 717 | 718 | constexpr auto size() const noexcept 719 | { 720 | return cnt; 721 | } 722 | 723 | uint32_t index_of(const T token) const noexcept 724 | { 725 | for (int32_t h = cnt - 1, l{0}; l <= h;) 726 | { 727 | const auto m = (l & h) + ((l ^ h) >> 1); 728 | const auto v = tokens[m]; 729 | const auto r = TrivialCmp(token, v); 730 | 731 | if (!r) 732 | return m; 733 | else if (r < 0) 734 | h = m - 1; 735 | else 736 | l = m + 1; 737 | } 738 | 739 | return UINT32_MAX; 740 | } 741 | 742 | inline bool is_set(const T token) const noexcept 743 | { 744 | return index_of(token) != UINT32_MAX; 745 | } 746 | 747 | inline uint32_t search(const T token) const noexcept 748 | { 749 | return ConsistentHashing::search(tokens, cnt, token); 750 | } 751 | 752 | // In a distributed systems, you should map the token to a node (or the segment index returned by this method) 753 | inline uint32_t index_owner_of(const T token) const noexcept 754 | { 755 | // modulo is not cheap, and comparisons are much cheaper, but branchless is nice 756 | return search(token) % cnt; 757 | } 758 | 759 | inline auto token_owner_of(const T token) const noexcept 760 | { 761 | return tokens[index_owner_of(token)]; 762 | } 763 | 764 | constexpr const T &token_predecessor_by_index(const uint32_t idx) const noexcept 765 | { 766 | return tokens[(idx + (cnt - 1)) % cnt]; 767 | } 768 | 769 | constexpr const T &token_predecessor(const T token) const noexcept 770 | { 771 | return token_predecessor_by_index(index_of(token)); 772 | } 773 | 774 | constexpr const T &token_successor_by_index(const uint32_t idx) const noexcept 775 | { 776 | return tokens[(idx + 1) % cnt]; 777 | } 778 | 779 | constexpr const T &token_successor(const T token) const noexcept 780 | { 781 | return token_successor_by_index(index_of(token)); 782 | } 783 | 784 | constexpr auto index_segment(const uint32_t idx) const noexcept 785 | { 786 | return ring_segment(tokens[(idx + (cnt - 1)) % cnt], tokens[idx]); 787 | } 788 | 789 | // segment in the ring that owns this token 790 | // based on the (prev segment.token, this segment.token] ownership rule 791 | constexpr auto token_segment(const T token) const noexcept 792 | { 793 | return index_segment(index_of(token)); 794 | } 795 | 796 | // see also sort_and_deoverlap() 797 | void segments(std::vector> *const res) const 798 | { 799 | if (cnt) 800 | { 801 | res->reserve(cnt + 2); 802 | for (uint32_t i{1}; i != cnt; ++i) 803 | res->push_back({tokens[i - 1], tokens[i]}); 804 | res->push_back({tokens[cnt - 1], tokens[0]}); 805 | } 806 | } 807 | 808 | auto segments() const 809 | { 810 | std::vector> res; 811 | 812 | segments(&res); 813 | return res; 814 | } 815 | 816 | auto tokens_segments(const std::vector &t) const 817 | { 818 | std::vector res; 819 | 820 | res.reserve(t.size()); 821 | for (const auto token : t) 822 | { 823 | const auto idx = index_owner_of(token); 824 | 825 | res.push_back({token_predecessor_by_index(idx), token}); 826 | } 827 | 828 | std::sort(res.begin(), res.end(), [](const auto &a, const auto &b) { return a.left < b.left; }); 829 | return res; 830 | } 831 | 832 | // Assuming a node is a replica for tokens in segments `current`, and then it assumes ownership of a different 833 | // set of segments, `updated` 834 | // 835 | // This handy utility method will generate a pair of segments list: 836 | // 1. The first is segments the node will need to *fetch* from other nodes in the ring, because it will now be also responsible 837 | // for those segments, but it does not have the data, based on the current owned segments. 838 | // 2. The second is segments the node will need to *stream* to other nodes in the ring, because it will no longer hold data for them. 839 | // 840 | // Obviously, if a node is just introduced to a ring (i.e have only updated and no current segments ), it should 841 | // just fetch data for all the current segments. Conversely, if the node is exiting the ring, it should 842 | // consider streaming all the data it has to other nodes if needed, and not fetch any data to itself. 843 | // 844 | // make sure that current and updated are in order e.g std::sort(start, end, [](const auto &a, const auto &b) { return a.left < b.left; }); 845 | // 846 | // Because the output will be an array of segments (_not_ tokens), you will need to determine the segments of the ring that intersect it 847 | // in order to figure out which nodes have which parts of the segments. 848 | // 849 | // This is a fairly expensive method (although it should be easy to optimize it if necessary), but given how rare it should be used, that's not a real concern 850 | // 851 | // Example: current segment [10, 20), updated segment [10, 25) 852 | // Example: current segment [10, 20), updated segment [8, 30) 853 | static auto compute_segments_ownership_updates(const std::vector ¤tSegmentsInput, const std::vector &updatedSegmentsInput) 854 | { 855 | std::vector toFetch, toStream, current, updated, toFetchFinal, toStreamFinal; 856 | segment_t segmentsList[2]; 857 | 858 | // We need to work on normalized lists of segments 859 | current = currentSegmentsInput; 860 | ring_segment::sort_and_deoverlap(¤t); 861 | 862 | updated = updatedSegmentsInput; 863 | ring_segment::sort_and_deoverlap(&updated); 864 | 865 | for (const auto curSegment : current) 866 | { 867 | const auto n = toStream.size(); 868 | 869 | for (const auto updatedSegment : updated) 870 | { 871 | if (curSegment.intersects(updatedSegment)) 872 | toStream.insert(toStream.end(), segmentsList, segmentsList + updatedSegment.difference(curSegment, segmentsList)); 873 | } 874 | 875 | if (toStream.size() == n) 876 | { 877 | // no intersection; accept whole segment 878 | toStream.push_back(curSegment); 879 | } 880 | } 881 | 882 | for (const auto updatedSegment : updated) 883 | { 884 | const auto n = toFetch.size(); 885 | 886 | for (const auto curSegment : current) 887 | { 888 | if (updatedSegment.intersects(curSegment)) 889 | toFetch.insert(toFetch.end(), segmentsList, segmentsList + curSegment.difference(updatedSegment, segmentsList)); 890 | } 891 | 892 | if (toFetch.size() == n) 893 | { 894 | // no intersection; accept whole segment 895 | toFetch.push_back(updatedSegment); 896 | } 897 | } 898 | 899 | // normalize output 900 | ring_segment::sort_and_deoverlap(&toFetch); 901 | ring_segment::sort_and_deoverlap(&toStream); 902 | 903 | // mask segments: 904 | // from segments to fetch, mask currently owned segments 905 | // from segments to stream, mask segments we will own (updated segments) 906 | ring_segment::mask_segments(toFetch, current, &toFetchFinal); 907 | ring_segment::mask_segments(toStream, updated, &toStreamFinal); 908 | 909 | return std::make_pair(toFetchFinal, toStreamFinal); 910 | } 911 | 912 | // When a node acquires ring tokens(joins a cluster), it only disupts segments its token(s) fall into 913 | // Assuming a ring of tokens: (10, 100, 150, 180, 200) 914 | // and a node joins a cluster, and acquires token 120 915 | // then it will only affect requests for (100, 120] 916 | // so it will need to fetch content for (100, 120] from somewhere. Where? well, from whichever owned (100, 150] 917 | // which is just the successor node, which we can find using index_owner_of() 918 | // This is a simple replication strategy implementation; we 'll just walk the ring clockwise and collect nodes that own 919 | // the tokens, skipping already collected nodes 920 | // 921 | // EXAMPLE: This is an illustrative example; you shouldn't really use this in production as is 922 | template 923 | auto token_replicas_basic(const token_t token, const uint8_t replicationFactor, L &&endpoint_token) const 924 | { 925 | using node_t = typename std::result_of::type; 926 | std::vector nodes; 927 | const auto base = index_owner_of(token); 928 | auto idx = base; 929 | 930 | do 931 | { 932 | const auto node = endpoint_token(idx); 933 | 934 | if (std::find(nodes.begin(), nodes.end(), node) == nodes.end()) 935 | { 936 | nodes.push_back(node); 937 | if (nodes.size() == replicationFactor) 938 | break; 939 | } 940 | 941 | idx = (idx + 1) % size(); 942 | } while (idx != base); 943 | 944 | return nodes; 945 | } 946 | 947 | // This generates the lists of tokens and matching nodes that own them based on a new ownership state that results 948 | // from applying the changes in ringTokensNodes 949 | // Specifically, in the resulted topology, current nodes tokens are replaced with their updated set in ringTokensNodes 950 | template 951 | std::pair, std::vector> new_topology(const node_t *const ringTokensNodes, 952 | const std::unordered_map> &futureNodesTokens) const 953 | { 954 | std::vector transientRingTokens; 955 | std::vector transientRingTokensNodes; 956 | std::unordered_map map; 957 | 958 | for (uint32_t i{0}; i != cnt; ++i) 959 | { 960 | const auto token = tokens[i]; 961 | 962 | if (futureNodesTokens.find(ringTokensNodes[i]) == futureNodesTokens.end()) 963 | { 964 | transientRingTokens.push_back(tokens[i]); 965 | map.insert({tokens[i], ringTokensNodes[i]}); 966 | } 967 | } 968 | 969 | for (const auto &it : futureNodesTokens) 970 | { 971 | const auto node = it.first; 972 | 973 | transientRingTokens.insert(transientRingTokens.end(), it.second.data(), it.second.data() + it.second.size()); 974 | for (const auto token : it.second) 975 | map.insert({token, node}); 976 | } 977 | 978 | std::sort(transientRingTokens.begin(), transientRingTokens.end()); 979 | 980 | // The associated nodes for each token in the transient ring 981 | transientRingTokensNodes.reserve(transientRingTokens.size()); 982 | for (const auto token : transientRingTokens) 983 | transientRingTokensNodes.push_back(map[token]); 984 | 985 | return {std::move(transientRingTokens), std::move(transientRingTokensNodes)}; 986 | } 987 | 988 | template 989 | static node_t *filter_by_distance(node_t *const nodes, const node_t *const end, L &&l) 990 | { 991 | using dist_t = typename std::result_of::type; 992 | dist_t min; 993 | uint32_t out{0}; 994 | 995 | for (auto it = nodes; it != end; ++it) 996 | { 997 | if (!out) 998 | { 999 | min = l(*it); 1000 | nodes[out++] = *it; 1001 | } 1002 | else if (const auto d = l(*it); d == min) 1003 | nodes[out++] = *it; 1004 | else if (d < min) 1005 | { 1006 | min = d; 1007 | nodes[0] = *it; 1008 | out = 1; 1009 | } 1010 | } 1011 | 1012 | return nodes + out; 1013 | } 1014 | 1015 | 1016 | // Builds a ring transition plan that is to be commited in order to transition to a new ring state. 1017 | // 1018 | // Whenever one more nodes alters the ring topology (when joining a cluster, leaving a cluster, or acquiring a different set of tokens), we need to 1019 | // account for that change, by copying data to nodes that will serve segments they didn't already were serving(thus, they don't have the data for that ring space) 1020 | // and by copying data to nodes that will now serve as a result of one or more nodes dropping segments they used to serve. This is necessary in order to 1021 | // support replication semantics. 1022 | // 1023 | // You should initiate a transition PLAN, and when it is COMMITED, create the final ring topology using 1024 | // new_topology() like it's used here for the transient ring, and switch to it, by advertising the new tokens for all tokens in the ring. 1025 | // 1026 | // GUIDELINES 1027 | // - There can be only one active transition in progress. If you allow for concurrent transitions, you will almost definitely end up with 1028 | // invalid rings that likely contain missing data. You can queue new transition plans to be executed after the current plan is complete. See Riak 1029 | // - For existing nodes participating in the transition: They should not be stopped or otherwise be treated in any special way. 1030 | // - For nodes that are to join the cluster(i.e are not already participating in the ring), you should wait until the transition has completed successfully, 1031 | // and then initialize them with the tokens you used for them in the transition. 1032 | // - For nodes that are to leave the cluster(i.e notes in the current cluster, but not in the cluster topology after the transition), you should 1033 | // wait until the transition has completed successfully, and them stop them, and make sure you won't start them again with the same tokens. 1034 | // 1035 | // With this method, the only tricky operation becomes the coordindation required for switching to the new topology after the 1036 | // streaming required for the transition is complete. You will need to (re)start nodes using specific tokens, etc. 1037 | // 1038 | // OPTIMIZATION OPPORTUNITIES 1039 | // - You should use filter_by_distance() to filter sources, or a similar function so that you will always select among the closest(in terms of network hops) nodes to 1040 | // the ring target node for streaming, in order to minimize streaming time. Use a best-effort strategy in order to minimize data motion. 1041 | // - You should try to schedule the streaming operations fairly among the involved nodes. If you over-load a node and under-load the rest, or vice versa, the time 1042 | // and effort(cost) will be much higher. 1043 | // 1044 | // EXAMPLES 1045 | // - If you 'd like to add 5 new nodes to your cluster, you can pick appropriate tokens(functionality for selecting tokens from the ring based on current distribution will be 1046 | // implemented later) for those new tokens, initiate a transition that involves them and their new tokens, and when you are done streaming, you should start those new 5 nodes, and each 1047 | // should be conigured to use the tokens you selected for transition(). 1048 | // - If you 'd like to decomission 1 node, you just need to a new transition that involves that node, and the list of tokens it will own will be empty. Once the streaming is 1049 | // complete, you should stop the node. 1050 | // 1051 | // 1052 | // ## Riak and Cassandra 1053 | // Effectively, this creates a transition plan. 1054 | // Once you have successfully executed the transition plan, you should *commit* the changes. 1055 | // According to @justinsheehy an @tsantero, Riak supports on-demand resizing by arbitrary factors, and you can 1056 | // issue multiple resize operations(they are queued and are executed whenever the last one commits, 1057 | // it's possible to cancel them). 1058 | // It does a best-effort about minimizing movement. If a node failed during movement, HH would 1059 | // kick in. 1060 | // Like cassandra, it would keep track of 'pending segmnets' and the coordinator would push writes 1061 | // to them if needed. See Cassandra's method for calculating "pending ranges" and how 1062 | // that is beijng used in the proxy where for each write, the pending ranges list is consulted 1063 | // and if any pending segments match the token, the targetrs of those segments also receive the update 1064 | // See also: https://billo.gitbooks.io/lfe-little-riak-book/content/ch4/5.html 1065 | template 1066 | auto transition( 1067 | const node_t *const ringTokensNodes, 1068 | const std::unordered_map> &futureNodesTokens, 1069 | L &&replicas_for) const 1070 | { 1071 | static constexpr size_t maxReplicasCnt{16}; 1072 | const auto segments_of = [&replicas_for](const Ring &ring, const node_t *const ringTokensNodes, const node_t node, std::vector *const res) { 1073 | node_t replicas[maxReplicasCnt]; 1074 | 1075 | for (uint32_t i{0}; i != ring.cnt; ++i) 1076 | { 1077 | const auto token = ring.tokens[i]; 1078 | const auto replicasCnt = replicas_for(ring, ringTokensNodes, token, replicas); 1079 | 1080 | if (std::find(replicas, replicas + replicasCnt, node) != replicas + replicasCnt) 1081 | { 1082 | // We need the distinct replicas 1083 | res->push_back({ring.token_predecessor_by_index(i), token}); 1084 | } 1085 | } 1086 | 1087 | std::sort(res->begin(), res->end(), [](const auto &a, const auto &b) { return a.left < b.left; }); 1088 | }; 1089 | 1090 | const auto transientRingTopology = new_topology(ringTokensNodes, futureNodesTokens); 1091 | const auto &transientRingTokens = transientRingTopology.first; 1092 | const auto &transientRingTokensNodes = transientRingTopology.second; 1093 | const Ring transientRing(transientRingTokens.data(), transientRingTokens.size()); 1094 | const auto transientRingSegments = transientRing.segments(); 1095 | const auto currentRingSegments = segments(); 1096 | std::vector outSegments; 1097 | segment_t segmentsList[2]; 1098 | // The plan to execute to transition to the new ring 1099 | // A list of: 1100 | // (ring segment, target node, replicas) 1101 | std::vector>>> plan; 1102 | std::unordered_map> curRingServeMap; 1103 | std::vector replicas; 1104 | node_t tokenReplicas[maxReplicasCnt], futureReplicas[maxReplicasCnt]; 1105 | std::vector replicaForSegmentsFuture, replicaForSegmentsNow; 1106 | 1107 | // Build (node => [segments]) map for the current ring 1108 | { 1109 | std::vector> v; 1110 | 1111 | for (const auto segment : currentRingSegments) 1112 | { 1113 | const auto n = replicas_for(*this, ringTokensNodes, segment.right, tokenReplicas); 1114 | 1115 | for (uint8_t i{0}; i != n; ++i) 1116 | v.push_back({tokenReplicas[i], segment}); 1117 | } 1118 | 1119 | std::sort(v.begin(), v.end(), [](const auto &a, const auto &b) { return a.first < b.first; }); 1120 | 1121 | const auto n = v.size(); 1122 | const auto all = v.data(); 1123 | 1124 | for (uint32_t i{0}; i != n;) 1125 | { 1126 | const auto node = v[i].first; 1127 | std::vector list; 1128 | 1129 | do 1130 | { 1131 | list.push_back(v[i].second); 1132 | } while (++i != n && v[i].first == node); 1133 | 1134 | curRingServeMap.insert({node, std::move(list)}); 1135 | } 1136 | } 1137 | 1138 | for (const auto &it : futureNodesTokens) 1139 | { 1140 | const auto node = it.first; 1141 | 1142 | replicaForSegmentsFuture.clear(); 1143 | replicaForSegmentsNow.clear(); 1144 | 1145 | segments_of(transientRing, transientRingTokensNodes.data(), node, &replicaForSegmentsFuture); 1146 | segments_of(*this, ringTokensNodes, node, &replicaForSegmentsNow); 1147 | 1148 | 1149 | // Compute what needs to be delivered to _this_ node 1150 | for (const auto futureSegment : replicaForSegmentsFuture) 1151 | { 1152 | // Mask segments this node serves already, no need to acquire any content we already have 1153 | const auto futureSegmentWraps = futureSegment.wraps(); 1154 | 1155 | outSegments.clear(); 1156 | segment_t::mask_segments(&futureSegment, (&futureSegment) + 1, replicaForSegmentsNow, &outSegments); 1157 | 1158 | if (outSegments.empty()) 1159 | { 1160 | // No need to acquire extra data 1161 | continue; 1162 | } 1163 | 1164 | // TODO: use binary search to locate the next segment in currentRingSegments 1165 | // No need for the linear scan: https://github.com/phaistos-networks/ConsistentHashing/issues/1 1166 | 1167 | 1168 | // OK, so who's going to provide content for those segments, based on the current ring? 1169 | for (const auto it : currentRingSegments) 1170 | { 1171 | if (it.right <= futureSegment.left) 1172 | { 1173 | // can safely skip it 1174 | continue; 1175 | } 1176 | else if (it.left > futureSegment.right && !futureSegmentWraps && !it.wraps()) 1177 | { 1178 | // can safely stop here 1179 | break; 1180 | } 1181 | 1182 | const auto cnt = it.intersection(futureSegment, segmentsList); 1183 | 1184 | if (!cnt) 1185 | continue; 1186 | 1187 | const std::vector replicas(tokenReplicas, tokenReplicas + replicas_for(*this, ringTokensNodes, it.right, tokenReplicas)); 1188 | 1189 | for (uint8_t i{0}; i != cnt; ++i) 1190 | plan.push_back({segmentsList[i], {node, replicas}}); // from replicas, to node, that segment 1191 | } 1192 | } 1193 | 1194 | // Whenever a node gives up (part of) a ring segment, we need to shift data around in order 1195 | // to account for the fact that replication factor for the data that span that segment will drop by 1. 1196 | for (const auto currentSegment : replicaForSegmentsNow) 1197 | { 1198 | const auto token = currentSegment.right; 1199 | const auto currentSegmentWraps = currentSegment.wraps(); 1200 | bool haveSources{false}; 1201 | 1202 | // TODO: use binary search to locate the next segment in transientRingSegments 1203 | // No need for linear scan: https://github.com/phaistos-networks/ConsistentHashing/issues/1 1204 | 1205 | for (const auto futureSegment : transientRingSegments) 1206 | { 1207 | if (futureSegment.right <= currentSegment.left) 1208 | { 1209 | // can safely skip it 1210 | continue; 1211 | } 1212 | else if (futureSegment.left > currentSegment.right && !currentSegmentWraps && !futureSegment.wraps()) 1213 | { 1214 | // can safely stop here 1215 | break; 1216 | } 1217 | 1218 | const auto cnt = futureSegment.intersection(currentSegment, segmentsList); 1219 | 1220 | if (!cnt) 1221 | continue; 1222 | 1223 | const auto futureReplicasCnt = std::remove_if(futureReplicas, 1224 | futureReplicas + replicas_for(transientRing, transientRingTokensNodes.data(), futureSegment.right, futureReplicas), 1225 | [node, &futureNodesTokens](const node_t target) { 1226 | if (target == node) 1227 | { 1228 | // exclude self 1229 | return true; 1230 | } 1231 | else if (futureNodesTokens.find(target) != futureNodesTokens.end()) 1232 | { 1233 | // exclude nodes that are also involved in this process, otherwise we may output the same value twice in plan 1234 | return true; 1235 | } 1236 | else 1237 | { 1238 | return false; 1239 | } 1240 | 1241 | }) - 1242 | futureReplicas; 1243 | 1244 | for (uint8_t i{0}; i != cnt; ++i) 1245 | { 1246 | const auto subSegment = segmentsList[i]; // intersection 1247 | 1248 | for (uint32_t ri{0}; ri != futureReplicasCnt; ++ri) 1249 | { 1250 | const auto target = futureReplicas[ri]; 1251 | 1252 | if (!haveSources) 1253 | { 1254 | // lazy generation of the sources(replicas) for this segment 1255 | // replicas should include this node 1256 | replicas.clear(); 1257 | replicas.insert(replicas.end(), tokenReplicas, tokenReplicas + replicas_for(*this, ringTokensNodes, token, tokenReplicas)); 1258 | haveSources = true; 1259 | } 1260 | 1261 | if (auto s = curRingServeMap.find(target); s != curRingServeMap.end()) 1262 | { 1263 | // this node serves 1+ segments already in the current segment 1264 | // mask subSegment with them; we don't want to send data to nodes if they already have any of it 1265 | outSegments.clear(); 1266 | segment_t::mask_segments(&subSegment, (&subSegment) + 1, s->second, &outSegments); 1267 | 1268 | for (const auto s : outSegments) 1269 | plan.push_back({s, {target, replicas}}); 1270 | } 1271 | else 1272 | { 1273 | // this target does not currently server any segments in the current segment 1274 | plan.push_back({subSegment, {target, replicas}}); 1275 | } 1276 | } 1277 | } 1278 | } 1279 | } 1280 | } 1281 | 1282 | return plan; 1283 | } 1284 | }; 1285 | } 1286 | 1287 | #ifdef HAVE_SWITCH 1288 | template 1289 | static inline void PrintImpl(Buffer &b, const ConsistentHashing::ring_segment &segment) 1290 | { 1291 | b.append("(", segment.left, ", ", segment.right, "]"); 1292 | } 1293 | 1294 | template 1295 | static inline void PrintImpl(Buffer &b, const ConsistentHashing::Ring &ring) 1296 | { 1297 | b.append(_S32("(( ")); 1298 | if (const auto cnt = ring.cnt) 1299 | { 1300 | for (uint32_t i{1}; i != cnt; ++i) 1301 | b.append(ConsistentHashing::ring_segment(ring.tokens[i - 1], ring.tokens[i]), ","); 1302 | 1303 | b.append(ConsistentHashing::ring_segment(ring.tokens[cnt - 1], ring.tokens[0])); 1304 | } 1305 | b.append(_S32(" ))")); 1306 | } 1307 | #endif 1308 | --------------------------------------------------------------------------------