├── .gitattributes ├── .gitignore ├── LICENSE ├── README.md ├── blipsort_speed.png └── sort.h /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | pdqsort.h 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Ellie Moore 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # *Branchless Lomuto in Pattern-Defeating Quicksort (Blipsort)* 2 | 3 | A highly-optimized, memory-conscious, Introsort variant that draws from PDQsort, Java, BlockQuicksort and Orson Peters' and Lukas Bergdoll's branchless Lomuto partitioning. 4 | 5 | ## Iterative Version 6 | 7 | [Here](https://github.com/RedBedHed/blipsort_iterative) 8 | 9 | ## Speed 10 | 11 | ![Speed](https://github.com/RedBedHed/BLPDQsort/blob/main/blipsort_speed.png) 12 | 13 | ##### *clang 16, -O3* 14 | 15 | ## Complexity 16 | 17 | | Best | Average | Worst | Memory | 18 | |------|---------|-------|--------| 19 | | n | n log n | n log n | log n | 20 | 21 | ## Visualization 22 | 23 | https://github.com/RedBedHed/blipsort/assets/58797872/00986779-05a3-430a-bc67-11eb45a54756 24 | 25 | ###### *Blipsort, ported into the Sound of Sorting program by Timo Bingmann* 26 | 27 | ## Techniques 28 | 29 | ### Branchless Lomuto 30 | The decades-old partitioning algorithm recently made a resurgence when researchers discovered ways to remove the inner branch. Orson Peters' and Lukas Bergdoll's [method](https://orlp.net/blog/branchless-lomuto-partitioning/)— published under two months ago— is the fastest yet. It employs a gap in the data to move elements twice per iteration rather than swapping them (three moves). 31 | 32 | For arithmetic and pointer types, Blipsort employs branchless Lomuto partitioning. For other, larger types, Blipsort uses branchful or block Hoare partitioning. Branchful Hoare partitioning is slower than fulcrum or block partitioning. However, it uses no extra offset memory (better for medium embedded systems). Block Hoare partitioning is significantly faster than its branchy counterpart. However, it does use extra offset memory (better for large embedded systems and PC). 33 | 34 | ###### *Note: Branchy Hoare partitioning is also slower than 2-3 pivot partitioning on random data, although marginally so.* 35 | 36 | While it works wonders for random arrays, branchless Lomuto does struggle with descending data. Blipsort attempts to remedy this by sampling five elements from the array and rotating the interval when all are strictly descending. However, this approach does not break all descending patterns. When an array contains strictly-descending elements at intervals, lomuto partitioning can slow down quite significantly in comparison to Hoare (In all fairness, Hoare is particularly well-suited for descending patterns). 37 | 38 | ### Pivot Selectivity 39 | Blipsort carefully selects the pivot from the middle of five sorted candidates. These candidates allow the sort to determine whether the data in the current interval is approximately descending and inform its "partition left" strategy. 40 | 41 | ### Introspection 42 | Blipsort is introspective, switching to a guaranteed nlog(n) sort if it becomes quadratic. Like PDQsort, Blipsort switches to Heapsort after log(n) "bad" partitions— partitions that are significantly unbalanced. 43 | 44 | ### Insertion Sort 45 | Blipsort uses Insertion sort on small intervals where asymptotic complexity matters less and instruction overhead matters more. Blipsort employs Java's Pair Insertion sort on every interval except the leftmost. Pair insertion sort inserts two elements at a time 46 | and doesn't need to perform a lower bound check, making it slightly faster than normal insertion sort in the context of quicksort. 47 | 48 | ### Pivot Retention 49 | Similar to PDQsort, if any of the three middlemost candidate pivots is equal to the rightmost element of the partition at left, Blipsort moves equal elements to the left with branchless Lomuto and continues to the right, solving the dutch-flag problem and yeilding linear time on data comprised of equal elements. 50 | 51 | ### Optimism 52 | Similar to PDQsort, if the partition is "good" (not highly unbalanced) and we have done little work in partitioning, Blipsort switches to insertion sort. If the Insertion sort makes more than a constant number of moves, Blipsort bails and resumes quicksort. This allows Blipsort to achieve linear time on already-sorted data. 53 | 54 | Work is calculated as: 55 | ```c++ 56 | bool work = ((skipped_left) + (skipped_right)) < (interval_width / 2); 57 | ``` 58 | 59 | ### Breaking Patterns 60 | Like PDQsort, if the partition is bad, Blipsort scrambles some elements to break up patterns. Unlike PDQsort, blipsort does not introduce completely fresh pivot candidates. 61 | 62 | ### Rotation 63 | When all of the candidate pivots are strictly descending, it is very likely that the interval is descending as well. Lomuto partitioning slows significantly on descending data. Therefore, Blipsort neglects to sort descending candidates and instead swap-rotates the entire interval before partitioning. 64 | 65 | ### Custom Comparators 66 | Blipsort allows its user to implement a custom boolean comparator. A comparator is best implemented with a lambda and no branches. A comparator implemented with a lambda can be inlined by an optimizing compiler, while a comparator implemented with a constexpr/inline function typically cannot. 67 | 68 | ## Usage 69 | 70 | To sort with branchless Lomuto on small types and block Hoare on large types, call blipsort like so: 71 | 72 | ```c++ 73 | Arrays::blipsort(array, size); 74 | ``` 75 | 76 | To sort with branchless Lomuto on small types and branchy Hoare on large types (to conserve memory) call blipsort like so: 77 | ```c++ 78 | Arrays::blipsort_embed(array, size); 79 | ``` 80 | 81 | ## Sources 82 | 83 | [Here](https://github.com/orlp/pdqsort) 84 | is the PDQsort algorithm by Orson Peters 85 | 86 | [Here](https://orlp.net/blog/branchless-lomuto-partitioning/) 87 | is the branchless Lomuto blog post by Orson Peters and Lukas Bergdoll 88 | 89 | [Here](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/DualPivotQuicksort.java) 90 | is Java's Dual Pivot Quicksort 91 | -------------------------------------------------------------------------------- /blipsort_speed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/RedBedHed/blipsort/03c0adca7c1bd33a751643356a22881a12b9a1cb/blipsort_speed.png -------------------------------------------------------------------------------- /sort.h: -------------------------------------------------------------------------------- 1 | #pragma once 2 | #ifndef SORT_H 3 | #define SORT_H 4 | #include 5 | #include 6 | #include 7 | #include 8 | 9 | namespace Algo 10 | { enum : uint32_t 11 | { 12 | InsertionThreshold = 88, 13 | AscendingThreshold = 8, 14 | LargeDataThreshold = 128, 15 | BlockSize = 64, 16 | #if __cpp_lib_bitops >= 201907L 17 | DoubleWordBitCount = 31, 18 | #else 19 | DeBruijnShiftAmoun = 58 20 | #endif 21 | }; 22 | 23 | /** 24 | * The DeBruijn constant. 25 | */ 26 | #if __cpp_lib_bitops < 201907L 27 | constexpr uint64_t DeBruijn64 = 28 | 0x03F79D71B4CB0A89L; 29 | #endif 30 | 31 | /** 32 | * The DeBruijn map from key to integer 33 | * square index. 34 | */ 35 | #if __cpp_lib_bitops < 201907L 36 | constexpr uint8_t DeBruijnTableF[] = 37 | { 38 | 0, 47, 1, 56, 48, 27, 2, 60, 39 | 57, 49, 41, 37, 28, 16, 3, 61, 40 | 54, 58, 35, 52, 50, 42, 21, 44, 41 | 38, 32, 29, 23, 17, 11, 4, 62, 42 | 46, 55, 26, 59, 40, 36, 15, 53, 43 | 34, 51, 20, 43, 31, 22, 10, 45, 44 | 25, 39, 14, 33, 19, 30, 9, 24, 45 | 13, 18, 8, 12, 7, 6, 5, 63 46 | }; 47 | #endif 48 | 49 | /** 50 | * Fill trailing bits using prefix fill. 51 | * 52 | * @code 53 | * Example: 54 | * 10000000 >> 1 55 | * = 01000000 | 10000000 56 | * = 11000000 >> 2 57 | * = 00110000 | 11000000 58 | * = 11110000 >> 4 59 | * = 00001111 | 11110000 60 | * = 11111111 61 | * @endcode 62 | * @tparam E The type 63 | * @param x The integer 64 | */ 65 | #if __cpp_lib_bitops < 201907L 66 | constexpr void parallelPrefixFill 67 | ( 68 | uint32_t & x 69 | ) 70 | { 71 | x |= x >> 1U; 72 | x |= x >> 2U; 73 | x |= x >> 4U; 74 | x |= x >> 8U; 75 | x |= x >> 16U; 76 | } 77 | #endif 78 | 79 | /** 80 | * Calculates floor of log2 81 | * 82 | * @authors Kim Walisch - source 83 | * @authors Mark Dickinson - source 84 | * @authors Ellie Moore 85 | * @param bb bitboard to scan 86 | * @precondition bb != 0 87 | * @return index (0..63) of most significant one bit 88 | */ 89 | constexpr int log2 90 | ( 91 | uint32_t l 92 | ) 93 | { 94 | assert(l != 0); 95 | #if __cpp_lib_bitops >= 201907L 96 | return std::countl_zero(l) ^ DoubleWordBitCount; 97 | #else 98 | parallelPrefixFill(l); 99 | return DeBruijnTableF[(int) 100 | ((l * DeBruijn64) >> DeBruijnShiftAmoun) 101 | ]; 102 | #endif 103 | } 104 | 105 | /** 106 | * A simple swap method. 107 | * 108 | * @tparam E the element type 109 | * @param i the first element pointer 110 | * @param j the second element pointer 111 | */ 112 | template 113 | constexpr void swap 114 | ( 115 | E *const i, 116 | E *const j 117 | ) 118 | { 119 | E const el = *i; *i = *j; *j = el; 120 | } 121 | 122 | /** 123 | * A generic "sift down" method (AKA max-heapify) 124 | * 125 | * @tparam E the element type 126 | * @param a the pointer to the base of the current 127 | * sub-array 128 | * @param i the starting index 129 | * @param size the size of the current sub-array 130 | */ 131 | template 132 | inline void siftDown 133 | ( 134 | E* const a, 135 | const int i, 136 | const int size, 137 | const Cmp cmp 138 | ) 139 | { 140 | // Store size in 141 | // a local variable. 142 | const size_t n = size; 143 | 144 | // Establish non-leaf 145 | // boundary. 146 | const size_t o = n >> 1U; 147 | 148 | // Extract the element 149 | // to sift. 150 | E z = a[i]; 151 | 152 | // initialize temporary 153 | // variables. 154 | size_t x = i, l, r; 155 | 156 | // consider only non-leaf 157 | // nodes. 158 | while(x < o) 159 | { 160 | // y is currently 161 | // left child element. 162 | // Note: "l" here stands 163 | // for "left" 164 | r = (l = (x << 1U) + 1) + 1; 165 | E y = a[l]; 166 | 167 | // if right child is 168 | // within the heap... 169 | // AND 170 | // if right child element 171 | // is greater than left 172 | // child element, 173 | // THEN 174 | // assign right child to 175 | // y and right index to l. 176 | // Note: "l" now stands 177 | // for "larger" 178 | if(r < n && cmp(y, a[r])) 179 | y = a[l = r]; 180 | 181 | // if y is less than or 182 | // equal to the element 183 | // we are sifting, then 184 | // we are done. 185 | if(!cmp(z, y)) break; 186 | 187 | // move y up to the 188 | // parent index. 189 | a[x] = y; 190 | 191 | // Set parent index to 192 | // be the index of 193 | // the largest child. 194 | x = l; 195 | } 196 | 197 | // Place the sifted element. 198 | a[x] = z; 199 | } 200 | 201 | /** 202 | *

203 | * 204 | * Heap Sort 205 | * 206 | *

207 | * 208 | *

209 | * Classical heap sort that sorts the given range 210 | * in ascending order, building a max heap and 211 | * continuously sifting/swapping the max element 212 | * to the previous rightmost index. 213 | *

214 | * @author Ellie Moore 215 | * @tparam E the element type 216 | * @param low a pointer to the leftmost index 217 | * @param high a pointer to the rightmost index 218 | */ 219 | template 220 | inline void hSort 221 | ( 222 | E* const low, 223 | E* const high, 224 | const Cmp cmp 225 | ) 226 | { 227 | E* r = high + 1; 228 | E* const l = low; 229 | 230 | // Build the heap. 231 | int x = r - l; 232 | for(int i = 233 | (x >> 1U); i >= 0; --i) 234 | siftDown(l, i, x, cmp); 235 | 236 | // Sort. 237 | while(l < --r) 238 | { 239 | const E z = *l; *l = *r; 240 | siftDown(l, 0, --x, cmp); 241 | *r = z; 242 | } 243 | } 244 | 245 | /** 246 | *

247 | * 248 | * Insertion Sort 249 | * 250 | *

251 | * 252 | *

253 | * Classical ascending insertion sort packaged with a 254 | * "pairing" optimization to be used in the context of 255 | * Quicksort. 256 | *

257 | * 258 | *

259 | * This optimization is used whenever the portion of 260 | * the array to be sorted is padded on the left by 261 | * a portion with lesser elements. The fact that all of 262 | * the elements on the left are automatically less than 263 | * the elements in the current portion allows us to skip 264 | * the costly lower boundary check in the nested loops 265 | * and insert two elements in one go. 266 | *

267 | * 268 | * @authors Josh Bloch - source 269 | * @authors Jon Bently - source 270 | * @authors Orson Peters - source 271 | * @authors Ellie Moore 272 | * @tparam E the element type 273 | * @tparam Are we sorting optimistically? 274 | * @param leftmost whether this is the leftmost part 275 | * @param low a pointer to the leftmost index 276 | * @param high a pointer to the rightmost index 277 | * left-most partition. 278 | */ 279 | template 280 | 281 | inline bool iSort 282 | ( 283 | E *const low, 284 | E *const high, 285 | const Cmp cmp, 286 | const bool leftmost = true 287 | ) 288 | { 289 | E* l = low; 290 | E* r = high; 291 | int moves = 0; 292 | 293 | // We aren't guarding, jump 294 | // straight into pair insertion 295 | // sort. 296 | if constexpr (Guard) 297 | goto g1; 298 | if constexpr (NoGuard) 299 | goto g2; 300 | 301 | if (leftmost) 302 | { 303 | g1: 304 | 305 | // Traditional 306 | // insertion 307 | // sort. 308 | for (E *i = l + 1; i <= r; ++i) 309 | { 310 | E t = *i, *j = i - 1; 311 | for (; j >= l && cmp(t, *j); --j) 312 | j[1] = *j; 313 | j[1] = t; 314 | 315 | if constexpr (Bail) 316 | { 317 | // If we have moved too 318 | // many elements, abort. 319 | moves += (i - 1) - j; 320 | if(moves > AscendingThreshold) 321 | return false; 322 | } 323 | } 324 | } 325 | else 326 | { 327 | g2: 328 | 329 | // Pair insertion sort. 330 | // Skip elements that are 331 | // in ascending order. 332 | do if (l++ >= r) return true; 333 | while (!cmp(*l, *(l - 1))); 334 | 335 | // This sort uses the sub 336 | // array at left to avoid 337 | // the lower bound check. 338 | // Assumes that this is not 339 | // the leftmost partition. 340 | for (E *i = l; ++l <= r; i = ++l) 341 | { 342 | E ex = *i, ey = *l; 343 | 344 | // Make sure that 345 | // we insert the 346 | // larger element 347 | // first. 348 | if (cmp(ey, ex)) 349 | { 350 | ex = ey; 351 | ey = *i; 352 | if constexpr (Bail) 353 | ++moves; 354 | } 355 | 356 | // Insert the two 357 | // in one downward 358 | // motion. 359 | while (cmp(ey, *--i)) 360 | i[2] = *i; 361 | (++i)[1] = ey; 362 | while (cmp(ex, *--i)) 363 | i[1] = *i; 364 | i[1] = ex; 365 | 366 | if constexpr (Bail) 367 | { 368 | // If we have moved too 369 | // many elements, abort. 370 | moves += (l - 2) - i; 371 | if(moves > AscendingThreshold) 372 | return false; 373 | } 374 | } 375 | 376 | // For odd length arrays, 377 | // insert the last element. 378 | E ez = *r; 379 | while (cmp(ez, *--r)) 380 | r[1] = *r; 381 | r[1] = ez; 382 | } 383 | return true; 384 | } 385 | 386 | /** 387 | * Explicit constexpr ternary. 388 | * 389 | * @tparam EXP the constexpr 390 | * @tparam E the return type 391 | * @param a the true value 392 | * @param b the false value 393 | */ 394 | template 395 | constexpr E ternary 396 | ( 397 | E a, 398 | E b 399 | ) 400 | { 401 | if constexpr (EXP) return a; 402 | else return b; 403 | } 404 | 405 | /** 406 | * Scramble a few elements to help 407 | * break patterns. 408 | * 409 | * @tparam E the element type 410 | * @param i the first element pointer 411 | * @param j the second element pointer 412 | */ 413 | template 414 | constexpr void scramble 415 | ( 416 | E* const low, 417 | E* const high, 418 | const size_t len 419 | ) 420 | { 421 | if(len >= InsertionThreshold) 422 | { 423 | const int _4th = len >> 2U; 424 | swap(low, low + _4th); 425 | swap(high, high - _4th); 426 | if(len > LargeDataThreshold) 427 | { 428 | swap(low + 1, low + (_4th + 1)); 429 | swap(low + 2, low + (_4th + 2)); 430 | swap(high - 2, high - (_4th + 2)); 431 | swap(high - 1, high - (_4th + 1)); 432 | } 433 | } 434 | } 435 | 436 | /** 437 | * Aligns the given pointer on 64-byte 438 | * cachline. 439 | * 440 | * @tparam E the element type 441 | * @param p pointer to memory to align 442 | */ 443 | template 444 | constexpr E* align 445 | ( 446 | E* p 447 | ) 448 | { 449 | return reinterpret_cast(( 450 | reinterpret_cast(p) + (BlockSize - 1) 451 | ) & -uintptr_t(BlockSize)); 452 | } 453 | 454 | /** 455 | *

456 | * 457 | * Blipsort 458 | * 459 | *

460 | * 461 | *

Branchless Lomuto

462 | *

463 | * The decades-old partitioning algorithm recently 464 | * made a resurgence when researchers discovered 465 | * ways to remove the inner branch. Lukas Bergdoll 466 | * and Orson Peters' method— published a little 467 | * under two months ago— is the fastest yet. It 468 | * employs a gap in the data to move elements 469 | * twice per iteration rather than swapping them 470 | * (three moves). For arithmetic and pointer types, 471 | * Blipsort employs branchless Lomuto partitioning. 472 | * For other, larger types, Blipsort uses branchless 473 | * or branchful Hoare partitioning. 474 | *

475 | * 476 | *

Pivot Selectivity

477 | *

478 | * Blipsort carefully selects the pivot from the 479 | * middle of five sorted candidates. These 480 | * candidates allow the sort to determine whether 481 | * the data in the current interval is approximately 482 | * descending and inform its "partition left" strategy. 483 | *

484 | * 485 | *

Insertion Sort

486 | *

487 | * Blipsort uses Insertion sort on small intervals 488 | * where asymptotic complexity matters less and 489 | * instruction overhead matters more. Blipsort 490 | * employs Java's Pair Insertion sort on every 491 | * interval except the leftmost. Pair insertion 492 | * sort inserts two elements at a time and doesn't 493 | * need to perform a lower bound check, making it 494 | * slightly faster than normal insertion sort in 495 | * the context of quicksort. 496 | *

497 | * 498 | *

Pivot Retention

499 | *

500 | * Similar to PDQsort, if any of the three middlemost 501 | * candidate pivots is equal to the rightmost element 502 | * of the partition at left, Blipsort moves equal 503 | * elements to the left with branchless Lomuto and 504 | * continues to the right, solving the dutch-flag 505 | * problem and yeilding linear time on data comprised 506 | * of equal elements. 507 | *

508 | * 509 | *

Optimism

510 | *

511 | * Similar to PDQsort, if the partition is "good" 512 | * (not highly unbalanced), Blipsort switches to 513 | * insertion sort. If the Insertion sort makes more 514 | * than a constant number of moves, Blipsort bails 515 | * and resumes quicksort. This allows Blipsort to 516 | * achieve linear time on already-sorted data. 517 | *

518 | * 519 | *

Breaking Patterns

520 | *

521 | * Like PDQsort, if the partition is bad, Blipsort 522 | * scrambles some elements to break up patterns. 523 | *

524 | * 525 | *

Rotation

526 | *

527 | * When all of the candidate pivots are strictly 528 | * descending, it is very likely that the interval 529 | * is descending as well. Lomuto partitioning slows 530 | * significantly on descending data. Therefore, 531 | * Blipsort neglects to sort descending candidates 532 | * and instead swap-rotates the entire interval 533 | * before partitioning. 534 | *

535 | * 536 | *

Custom Comparators

537 | *

538 | * Blipsort allows its user to implement a custom 539 | * boolean comparator. A comparator is best 540 | * implemented with a lambda and no branches. A 541 | * comparator implemented with a lambda can be 542 | * inlined by an optimizing compiler, while a 543 | * comparator implemented with a constexpr/inline 544 | * function typically cannot. 545 | *

546 | * 547 | * @authors Josh Bloch - source 548 | * @authors Jon Bently - source 549 | * @authors Orson Peters - source 550 | * @authors Lukas Bergdoll - source 551 | * @authors Stefan Edelkamp - source 552 | * @authors Armin Weiß - source 553 | * @authors Ellie Moore 554 | * @tparam E the element type 555 | * @tparam Root whether this is the sort root 556 | * @param leftmost whether this is the leftmost part 557 | * @param low a pointer to the leftmost index 558 | * @param high a pointer to the rightmost index 559 | * @param height the distance of the current sort 560 | * tree from the initial height of 2log2n 561 | */ 562 | template 563 | 564 | inline void qSort 565 | ( 566 | E * low, 567 | E * high, 568 | int height, 569 | const Cmp cmp, 570 | bool leftmost = true 571 | ) 572 | { 573 | // Tail call loop. 574 | for(size_t x = high - low;;) 575 | { 576 | // If this is not the 577 | // root node, sort the 578 | // interval by insertion 579 | // sort if small enough. 580 | if constexpr (!Root) 581 | if (x < InsertionThreshold) 582 | { 583 | // If we are in the Root, 584 | // we won't be insertion 585 | // sorting until we 586 | // iterate on the rightmost 587 | // part. However, we are 588 | // not in the root here, so 589 | // we need to be careful 590 | // to use guarded insertion 591 | // sort if this is the 592 | // leftmost partition. 593 | iSort<0,0,0> 594 | (low, high, cmp, leftmost); 595 | return; 596 | } 597 | 598 | // If this is not the root node, 599 | // heap sort when the runtime 600 | // trends towards quadratic. 601 | if constexpr (!Root) 602 | if(height < 0) 603 | return hSort(low, high, cmp); 604 | 605 | // Find an inexpensive 606 | // approximation of a third of 607 | // the interval. 608 | const size_t y = x >> 2U, 609 | _3rd = y + (y >> 1U), 610 | _6th = _3rd >> 1U; 611 | 612 | // Find an approximate 613 | // midpoint of the interval. 614 | E *const mid = low + (x >> 1U); 615 | 616 | // Assign tercile indices 617 | // to candidate pivots. 618 | E *const sl = low + _3rd; 619 | E *const sr = high - _3rd; 620 | 621 | // Assign outer indices 622 | // to candidate pivots. 623 | E * cl = low + _6th; 624 | E * cr = high - _6th; 625 | 626 | // If the candidates aren't 627 | // descending... 628 | // Insertion sort all five 629 | // candidate pivots in-place. 630 | if((!cmp(*cl, *low)) || 631 | (!cmp(*sl, *cl)) || 632 | (!cmp(*mid, *sl)) || 633 | (!cmp(*sr, *mid)) || 634 | (!cmp(*cr, *sr)) || 635 | (!cmp(*high, *cr))) 636 | { 637 | 638 | if(cmp(*low, *cl)) 639 | cl = low; 640 | if(cmp(*cr, *high)) 641 | cr = high; 642 | 643 | if (cmp(*sl, *cl)) 644 | { 645 | E e = *sl; 646 | *sl = *cl; 647 | *cl = e; 648 | } 649 | 650 | if (cmp(*mid, *sl)) 651 | { 652 | E e = *mid; 653 | *mid = *sl; 654 | *sl = e; 655 | if (cmp(e, *cl)) 656 | { 657 | *sl = *cl; 658 | *cl = e; 659 | } 660 | } 661 | 662 | if (cmp(*sr, *mid)) 663 | { 664 | E e = *sr; 665 | *sr = *mid; 666 | *mid = e; 667 | if (cmp(e, *sl)) 668 | { 669 | *mid = *sl; 670 | *sl = e; 671 | if (cmp(e, *cl)) 672 | { 673 | *sl = *cl; 674 | *cl = e; 675 | } 676 | } 677 | } 678 | 679 | if (cmp(*cr, *sr)) 680 | { 681 | E e = *cr; 682 | *cr = *sr; 683 | *sr = e; 684 | if (cmp(e, *mid)) 685 | { 686 | *sr = *mid; 687 | *mid = e; 688 | if (cmp(e, *sl)) 689 | { 690 | *mid = *sl; 691 | *sl = e; 692 | if (cmp(e, *cl)) 693 | { 694 | *sl = *cl; 695 | *cl = e; 696 | } 697 | } 698 | } 699 | } 700 | } 701 | 702 | // If the candidates are 703 | // descending, then the 704 | // interval is likely to 705 | // be descending somewhat. 706 | // rotate the entire interval 707 | // around the midpoint. 708 | // Don't worry about the 709 | // even size case. One 710 | // out-of-order element 711 | // is no big deal. 712 | else 713 | { 714 | E* u = low; 715 | E* q = high; 716 | while(u < mid) 717 | { 718 | E e = *u; 719 | *u++ = *q; 720 | *q-- = e; 721 | } 722 | } 723 | 724 | // If any middle candidate 725 | // pivot is equal to the 726 | // rightmost element of the 727 | // partition to the left, 728 | // swap pivot duplicates to 729 | // the side and sort the 730 | // remainder. This is an 731 | // alternative to dutch-flag 732 | // partitioning. 733 | if(!leftmost) 734 | { 735 | // Check the pivot to 736 | // the left. 737 | E h = *(low - 1); 738 | if(!cmp(h, *sl) || 739 | !cmp(h, *mid) || 740 | !cmp(h, *sr)) 741 | { 742 | E* l = low - 1, 743 | * g = high + 1; 744 | 745 | // skip over data 746 | // in place. 747 | while(cmp(h, *--g)); 748 | 749 | if(g == high) 750 | while(!cmp(h, *++l) && l < g); 751 | else 752 | while(!cmp(h, *++l)); 753 | 754 | // If we are sorting 755 | // non-arithmetic types, 756 | // use Hoare for fewer 757 | // moves. 758 | if constexpr (Expense) 759 | { 760 | /** 761 | * Partition left by branchful Hoare scheme 762 | * 763 | * During partitioning: 764 | * 765 | * +-------------------------------------------------------------+ 766 | * | ... == h | ... ? ... | ... > h | 767 | * +-------------------------------------------------------------+ 768 | * ^ ^ ^ ^ 769 | * low l k high 770 | * 771 | * After partitioning: 772 | * 773 | * +-------------------------------------------------------------+ 774 | * | ... == h | > h ... | 775 | * +-------------------------------------------------------------+ 776 | * ^ ^ ^ 777 | * low l high 778 | */ 779 | while(l < g) 780 | { 781 | swap(l, g); 782 | while(cmp(h, *--g)); 783 | while(!cmp(h, *++l)); 784 | } 785 | } 786 | 787 | // If we are sorting 788 | // arithmetic types, 789 | // use branchless lomuto 790 | // for fewer branches. 791 | else 792 | { 793 | /** 794 | * Partition left by branchless Lomuto scheme 795 | * 796 | * During partitioning: 797 | * 798 | * +-------------------------------------------------------------+ 799 | * | ... == h | ... > h | * | ... ? ... | ... > h | 800 | * +-------------------------------------------------------------+ 801 | * ^ ^ ^ ^ ^ 802 | * low l k g high 803 | * 804 | * After partitioning: 805 | * 806 | * +-------------------------------------------------------------+ 807 | * | ... == h | > h ... | 808 | * +-------------------------------------------------------------+ 809 | * ^ ^ ^ 810 | * low l high 811 | */ 812 | E * k = l, p = *l; 813 | while(k < g) 814 | { 815 | *k = *l; 816 | *l = *++k; 817 | l += !cmp(h, *l); 818 | } 819 | *k = *l; *l = p; 820 | } 821 | 822 | // Advance low to the 823 | // start of the right 824 | // partition. 825 | low = l; 826 | 827 | // If we have nothing 828 | // left to sort, return. 829 | if(low >= high) 830 | return; 831 | 832 | // Calculate the interval 833 | // width and loop. 834 | x = high - low; 835 | continue; 836 | } 837 | } 838 | 839 | // Initialize l and k. 840 | E *l = ternary(low, low - 1), 841 | *k = high + 1, * g; 842 | 843 | // Assign midpoint to pivot 844 | // variable. 845 | const E p = *mid; 846 | 847 | // If we are sorting 848 | // non-arithmetic types, bring 849 | // left end inside. Left end 850 | // will be replaced and pivot 851 | // will be swapped back later. 852 | if constexpr(Expense) 853 | *mid = *low; 854 | 855 | // skip over data 856 | // in place. 857 | while(cmp(*++l, p)); 858 | 859 | // If we are sorting 860 | // arithmetic types, bring 861 | // left end inside. Left end 862 | // will be replaced and pivot 863 | // will be swapped back later. 864 | if constexpr(!Expense) 865 | *mid = *l; 866 | 867 | // skip over data 868 | // in place. 869 | if(ternary 870 | (l == low + 1, l == low)) 871 | while(!cmp(*--k, p) && k > l); 872 | else 873 | while(!cmp(*--k, p)); 874 | 875 | // Will we do a significant 876 | // amount of work during 877 | // partitioning? 878 | bool work = 879 | ((l - low) + (high - k)) 880 | < (x >> 1U); 881 | 882 | // If we are sorting 883 | // non-arithmetic types and 884 | // conserving memory, use 885 | // Hoare for fewer moves. 886 | if constexpr (Expense && !Block) 887 | { 888 | /** 889 | * Partition by branchful Hoare scheme 890 | * 891 | * During partitioning: 892 | * 893 | * +-------------------------------------------------------------+ 894 | * | ... < p | ... ? ... | ... >= p | 895 | * +-------------------------------------------------------------+ 896 | * ^ ^ ^ ^ 897 | * low l k high 898 | * 899 | * After partitioning: 900 | * 901 | * +-------------------------------------------------------------+ 902 | * | ... < p | >= p ... | 903 | * +-------------------------------------------------------------+ 904 | * ^ ^ ^ 905 | * low l high 906 | */ 907 | while(l < k) 908 | { 909 | swap(l, k); 910 | while(cmp(*++l, p)); 911 | while(!cmp(*--k, p)); 912 | } 913 | *low = *--l; *l = p; 914 | } 915 | 916 | // If we are sorting 917 | // non-arithmetic types and 918 | // not conserving memory, use 919 | // Block Hoare for fewer moves 920 | // and fewer branches. 921 | else if constexpr (Expense) 922 | { 923 | /** 924 | * Partition by branchless (Block) Hoare scheme 925 | * 926 | * During partitioning: 927 | * 928 | * +-------------------------------------------------------------+ 929 | * | ... < p | cmp | ... ? ... | cmp | ... >= p | 930 | * +-------------------------------------------------------------+ 931 | * ^ ^ ^ ^ ^ ^ 932 | * low _low l k _high high 933 | * 934 | * After partitioning: 935 | * 936 | * +-------------------------------------------------------------+ 937 | * | ... < p | >= p ... | 938 | * +-------------------------------------------------------------+ 939 | * ^ ^ ^ 940 | * low l high 941 | */ 942 | if(l < k) 943 | { 944 | swap(l++, k); 945 | 946 | // Set up blocks and 947 | // align base pointers to 948 | // the cacheline. 949 | uint8_t 950 | ols[BlockSize << 1U], 951 | oks[BlockSize << 1U]; 952 | uint8_t 953 | * olp = align(ols), 954 | * okp = align(oks); 955 | 956 | // Initialize frame pointers. 957 | E * _low = l, * _high = k; 958 | 959 | // Initialize offset counts and 960 | // start indices for swap routine. 961 | size_t nl = 0, nk = 0, ls = 0, ks = 0; 962 | 963 | while(l < k) 964 | { 965 | // If both blocks are empty, split 966 | // the interval in two. Otherwise 967 | // give the whole interval to one 968 | // block. 969 | size_t xx = k - l, 970 | lspl = -(nl == 0) & (xx >> (nk == 0)), 971 | kspl = -(nk == 0) & (xx - lspl); 972 | 973 | // Fill the offset blocks. If the split 974 | // for either block is larger than 64, 975 | // crop it and unroll the loop. Otherwise, 976 | // keep the loop fully rolled. This should 977 | // only happen near the end of partitioning. 978 | if(lspl >= BlockSize) 979 | { 980 | size_t i = -1; 981 | do 982 | { 983 | olp[nl] = ++i; nl += !cmp(*l++, p); 984 | olp[nl] = ++i; nl += !cmp(*l++, p); 985 | olp[nl] = ++i; nl += !cmp(*l++, p); 986 | olp[nl] = ++i; nl += !cmp(*l++, p); 987 | olp[nl] = ++i; nl += !cmp(*l++, p); 988 | olp[nl] = ++i; nl += !cmp(*l++, p); 989 | olp[nl] = ++i; nl += !cmp(*l++, p); 990 | olp[nl] = ++i; nl += !cmp(*l++, p); 991 | olp[nl] = ++i; nl += !cmp(*l++, p); 992 | olp[nl] = ++i; nl += !cmp(*l++, p); 993 | olp[nl] = ++i; nl += !cmp(*l++, p); 994 | olp[nl] = ++i; nl += !cmp(*l++, p); 995 | olp[nl] = ++i; nl += !cmp(*l++, p); 996 | olp[nl] = ++i; nl += !cmp(*l++, p); 997 | olp[nl] = ++i; nl += !cmp(*l++, p); 998 | olp[nl] = ++i; nl += !cmp(*l++, p); 999 | } while(i < BlockSize - 1); 1000 | } 1001 | else 1002 | for(size_t i = 0; i < lspl; ++i) 1003 | olp[nl] = i, nl += !cmp(*l++, p); 1004 | 1005 | if(kspl >= BlockSize) 1006 | { 1007 | size_t i = 0; 1008 | do 1009 | { 1010 | okp[nk] = ++i; nk += cmp(*--k, p); 1011 | okp[nk] = ++i; nk += cmp(*--k, p); 1012 | okp[nk] = ++i; nk += cmp(*--k, p); 1013 | okp[nk] = ++i; nk += cmp(*--k, p); 1014 | okp[nk] = ++i; nk += cmp(*--k, p); 1015 | okp[nk] = ++i; nk += cmp(*--k, p); 1016 | okp[nk] = ++i; nk += cmp(*--k, p); 1017 | okp[nk] = ++i; nk += cmp(*--k, p); 1018 | okp[nk] = ++i; nk += cmp(*--k, p); 1019 | okp[nk] = ++i; nk += cmp(*--k, p); 1020 | okp[nk] = ++i; nk += cmp(*--k, p); 1021 | okp[nk] = ++i; nk += cmp(*--k, p); 1022 | okp[nk] = ++i; nk += cmp(*--k, p); 1023 | okp[nk] = ++i; nk += cmp(*--k, p); 1024 | okp[nk] = ++i; nk += cmp(*--k, p); 1025 | okp[nk] = ++i; nk += cmp(*--k, p); 1026 | } while(i < BlockSize); 1027 | 1028 | } 1029 | else 1030 | for(size_t i = 0; i < kspl;) 1031 | okp[nk] = ++i, nk += cmp(*--k, p); 1032 | 1033 | // n = min(nl, nk), branchless. 1034 | size_t n = 1035 | (nl & -(nl < nk)) + (nk & -(nl >= nk)); 1036 | 1037 | // Swap the elements using the offsets. 1038 | // Set up working block pointers and lower 1039 | // block end pointer. 1040 | uint8_t* 1041 | ll = olp + ls, * kk = okp + ks, * e = ll + n; 1042 | 1043 | // If the offset counts are equal, we are likely 1044 | // to be ascending or descending. If ascending, 1045 | // we don't need to do anything. If descending, 1046 | // use swaps to stay O(n). Both blocks must 1047 | // contain n offsets. If either block is empty, 1048 | // fill it and come back. 1049 | if(nl == nk) 1050 | for(; ll < e; ++ll, ++kk) 1051 | swap(_low + *ll, _high - *kk); 1052 | 1053 | // Otherwise, swap using a cyclic permutation. 1054 | // Both blocks must contain n offsets. If either 1055 | // block is empty, fill it and come back. 1056 | else if(n > 0) 1057 | { 1058 | E* _l = _low + *ll, * _k = _high - *kk; 1059 | E t = *_l; *_l = *_k; 1060 | for(++ll, ++kk; ll < e; ++ll, ++kk) 1061 | { 1062 | _l = _low + *ll; *_k = *_l; 1063 | _k = _high - *kk; *_l = *_k; 1064 | } 1065 | *_k = t; 1066 | } 1067 | 1068 | // Adjust offset counts and starts. If a block 1069 | // is empty, adjust its frame pointer. 1070 | nl -= n; nk -= n; 1071 | if(nl == 0) { ls = 0; _low = l; } else ls += n; 1072 | if(nk == 0) { ks = 0; _high = k; } else ks += n; 1073 | } 1074 | 1075 | // swap the remaining elements into place. 1076 | if(nl) 1077 | { 1078 | olp += ls; 1079 | for(uint8_t* ll = olp + nl;;) 1080 | { 1081 | swap(_low + *--ll, --l); 1082 | if(ll <= olp) break; 1083 | } 1084 | } 1085 | 1086 | if(nk) 1087 | { 1088 | okp += ks; 1089 | for(uint8_t* kk = okp + nk;;) 1090 | { 1091 | swap(_high - *--kk, l++); 1092 | if(kk <= okp) break; 1093 | } 1094 | } 1095 | } 1096 | 1097 | // Move the pivot into place. 1098 | *low = *--l; *l = p; 1099 | } 1100 | 1101 | // If we are sorting 1102 | // arithmetic types, use 1103 | // branchless lomuto for 1104 | // fewer branches. 1105 | else 1106 | { 1107 | /** 1108 | * Partition by branchless Lomuto scheme 1109 | * 1110 | * During partitioning: 1111 | * 1112 | * +-------------------------------------------------------------+ 1113 | * | ... < p | ... >= p | * | ... ? ... | ... >= p | 1114 | * +-------------------------------------------------------------+ 1115 | * ^ ^ ^ ^ ^ 1116 | * low l g k high 1117 | * 1118 | * After partitioning: 1119 | * 1120 | * +-------------------------------------------------------------+ 1121 | * | ... < p | >= p ... | 1122 | * +-------------------------------------------------------------+ 1123 | * ^ ^ ^ 1124 | * low l high 1125 | */ 1126 | g = l; 1127 | 1128 | // If we are not conserving 1129 | // memory, unroll the 1130 | // loop for a tiny boost. 1131 | if constexpr (Block) 1132 | { 1133 | E* u = k - (BlockSize >> 2U); 1134 | while(g < u) 1135 | { 1136 | *g = *l; *l = *++g; l += cmp(*l, p); 1137 | *g = *l; *l = *++g; l += cmp(*l, p); 1138 | *g = *l; *l = *++g; l += cmp(*l, p); 1139 | *g = *l; *l = *++g; l += cmp(*l, p); 1140 | *g = *l; *l = *++g; l += cmp(*l, p); 1141 | *g = *l; *l = *++g; l += cmp(*l, p); 1142 | *g = *l; *l = *++g; l += cmp(*l, p); 1143 | *g = *l; *l = *++g; l += cmp(*l, p); 1144 | *g = *l; *l = *++g; l += cmp(*l, p); 1145 | *g = *l; *l = *++g; l += cmp(*l, p); 1146 | *g = *l; *l = *++g; l += cmp(*l, p); 1147 | *g = *l; *l = *++g; l += cmp(*l, p); 1148 | *g = *l; *l = *++g; l += cmp(*l, p); 1149 | *g = *l; *l = *++g; l += cmp(*l, p); 1150 | *g = *l; *l = *++g; l += cmp(*l, p); 1151 | *g = *l; *l = *++g; l += cmp(*l, p); 1152 | } 1153 | } 1154 | 1155 | while(g < k) 1156 | { 1157 | *g = *l; 1158 | *l = *++g; 1159 | l += cmp(*l, p); 1160 | } 1161 | *g = *l; *l = p; 1162 | } 1163 | 1164 | // Skip the pivot. 1165 | g = l + (l < high); 1166 | l -= (l > low); 1167 | 1168 | // Cheaply calculate an 1169 | // eigth of the interval. 1170 | const size_t 1171 | _8th = x >> 3U; 1172 | 1173 | // Calculate interval widths. 1174 | const size_t 1175 | ls = l - low, 1176 | gs = high - g; 1177 | 1178 | // If the partition is fairly 1179 | // balanced, try insertion sort. 1180 | // If insertion sort runtime 1181 | // trends higher than O(n), fall 1182 | // back to quicksort. 1183 | if(ls >= _8th && 1184 | gs >= _8th) 1185 | { 1186 | if(work) goto l1; 1187 | if(!iSort<0,0>(low, l, cmp, leftmost)) 1188 | goto l1; 1189 | if(!iSort<1,0>(g, high, cmp)) 1190 | goto l2; 1191 | return; 1192 | } 1193 | 1194 | // The partition is not balanced. 1195 | // scramble some elements and 1196 | // try to break the pattern. 1197 | scramble(low, l, ls); 1198 | scramble(g, high, gs); 1199 | 1200 | // This was a bad partition, 1201 | // so decrement the height. 1202 | // When the height is zero, 1203 | // we will use heapsort. 1204 | --height; 1205 | 1206 | // Sort left portion. 1207 | l1: qSort 1208 | (low, l, height, cmp, leftmost); 1209 | 1210 | // Sort right portion 1211 | // iteratively. 1212 | l2: low = g; 1213 | 1214 | // Find the wwidth of the 1215 | // interval 1216 | x = high - low; 1217 | 1218 | // If this is the root, 1219 | // sort the interval 1220 | // by insertion sort 1221 | // if small enough. 1222 | if constexpr (Root) 1223 | if (x < InsertionThreshold) 1224 | { 1225 | // If we are in the Root, 1226 | // insertion sort will 1227 | // be unguarded. 1228 | iSort<1,0,0>(low, high, cmp); 1229 | return; 1230 | } 1231 | 1232 | // If this is the root node, 1233 | // heap sort when the runtime 1234 | // trends towards quadratic. 1235 | if constexpr (Root) 1236 | if(height < 0) 1237 | return hSort(low, high, cmp); 1238 | 1239 | leftmost = false; 1240 | } 1241 | } 1242 | 1243 | /** 1244 | * sort 1245 | * 1246 | * @tparam E the element type 1247 | * @tparam Cmp the comparator type 1248 | * @param a the array to be sorted 1249 | * @param cnt the size of the the array 1250 | * @param cmp the comparator 1251 | */ 1252 | template 1253 | inline void blipsort 1254 | ( 1255 | E* const a, 1256 | const uint32_t cnt, 1257 | const Cmp cmp 1258 | ) 1259 | { 1260 | if(cnt < InsertionThreshold) 1261 | { 1262 | iSort<0,1,0>(a, a + (cnt - 1), cmp); 1263 | return; 1264 | } 1265 | return qSort 1266 | ::value && 1267 | !std::is_pointer::value, Block> 1268 | (a, a + (cnt - 1), log2(cnt), cmp); 1269 | }} 1270 | 1271 | namespace Arrays 1272 | { 1273 | /** 1274 | *

1275 | * 1276 | * blipsort 1277 | * 1278 | *

1279 | * 1280 | *

1281 | * Sorts the given array with provided comparator 1282 | * or in ascending order if nonesuch. 1283 | *

1284 | * 1285 | * @tparam E the element type 1286 | * @tparam Cmp the comparator type 1287 | * @param a the array to be sorted 1288 | * @param cnt the size of the the array 1289 | * @param cmp the comparator 1290 | */ 1291 | template > 1292 | inline void blipsort 1293 | ( 1294 | E* const a, 1295 | const uint32_t cnt, 1296 | const Cmp cmp = std::less<>() 1297 | ) 1298 | { 1299 | Algo::blipsort(a, cnt, cmp); 1300 | } 1301 | 1302 | /** 1303 | *

1304 | * 1305 | * blipsort_embed 1306 | * 1307 | *

1308 | * 1309 | *

1310 | * Sorts the given array with provided comparator 1311 | * or in ascending order if nonesuch. Conserves 1312 | * memory by using branchy Hoare instead of 1313 | * Block Hoare on large types. 1314 | *

1315 | * 1316 | * @tparam E the element type 1317 | * @tparam Cmp the comparator type 1318 | * @param a the array to be sorted 1319 | * @param cnt the size of the the array 1320 | * @param cmp the comparator 1321 | */ 1322 | template > 1323 | inline void blipsort_embed 1324 | ( 1325 | E* const a, 1326 | const uint32_t cnt, 1327 | const Cmp cmp = std::less<>() 1328 | ) 1329 | { 1330 | Algo::blipsort<0>(a, cnt, cmp); 1331 | } 1332 | } 1333 | 1334 | #endif //SORT_H 1335 | --------------------------------------------------------------------------------