├── .gitattributes
├── .gitignore
├── LICENSE
├── README.md
├── blipsort_speed.png
└── sort.h


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | pdqsort.h
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Ellie Moore
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # *Branchless Lomuto in Pattern-Defeating Quicksort (Blipsort)*
 2 |  
 3 | A highly-optimized, memory-conscious, Introsort variant that draws from PDQsort, Java, BlockQuicksort and Orson Peters' and Lukas Bergdoll's branchless Lomuto partitioning. 
 4 | 
 5 | ## Iterative Version 
 6 | 
 7 | [Here](https://github.com/RedBedHed/blipsort_iterative)
 8 | 
 9 | ## Speed
10 | 
11 | ![Speed](https://github.com/RedBedHed/BLPDQsort/blob/main/blipsort_speed.png)
12 | 
13 | ##### *clang 16, -O3*
14 | 
15 | ## Complexity
16 | 
17 | | Best | Average | Worst | Memory |
18 | |------|---------|-------|--------|
19 | | n    | n log n | n log n | log n |
20 | 
21 | ## Visualization
22 | 
23 | https://github.com/RedBedHed/blipsort/assets/58797872/00986779-05a3-430a-bc67-11eb45a54756
24 | 
25 | ###### *Blipsort, ported into the Sound of Sorting program by Timo Bingmann*
26 | 
27 | ## Techniques
28 | 
29 | ### Branchless Lomuto
30 | The decades-old partitioning algorithm recently made a resurgence when researchers discovered ways to remove the inner branch. Orson Peters' and Lukas Bergdoll's [method](https://orlp.net/blog/branchless-lomuto-partitioning/)&mdash; published under two months ago&mdash; is the fastest yet. It employs a gap in the data to move elements twice per iteration rather than swapping them (three moves).
31 | 
32 | For arithmetic and pointer types, Blipsort employs branchless Lomuto partitioning. For other, larger types, Blipsort uses branchful or block Hoare partitioning. Branchful Hoare partitioning is slower than fulcrum or block partitioning. However, it uses no extra offset memory (better for medium embedded systems). Block Hoare partitioning is significantly faster than its branchy counterpart. However, it does use extra offset memory (better for large embedded systems and PC).
33 | 
34 | ###### *Note: Branchy Hoare partitioning is also slower than 2-3 pivot partitioning on random data, although marginally so.*
35 | 
36 | While it works wonders for random arrays, branchless Lomuto does struggle with descending data. Blipsort attempts to remedy this by sampling five elements from the array and rotating the interval when all are strictly descending. However, this approach does not break all descending patterns. When an array contains strictly-descending elements at intervals, lomuto partitioning can slow down quite significantly in comparison to Hoare (In all fairness, Hoare is particularly well-suited for descending patterns).
37 | 
38 | ### Pivot Selectivity
39 | Blipsort carefully selects the pivot from the middle of five sorted candidates. These candidates allow the sort to determine whether the data in the current interval is approximately descending and inform its "partition left" strategy.
40 | 
41 | ### Introspection
42 | Blipsort is introspective, switching to a guaranteed nlog(n) sort if it becomes quadratic. Like PDQsort, Blipsort switches to Heapsort after log(n) "bad" partitions&mdash; partitions that are significantly unbalanced.
43 | 
44 | ### Insertion Sort
45 | Blipsort uses Insertion sort on small intervals where asymptotic complexity matters less and instruction overhead matters more. Blipsort employs Java's Pair Insertion sort on every interval except the leftmost. Pair insertion sort inserts two elements at a time 
46 | and doesn't need to perform a lower bound check, making it slightly faster than normal insertion sort in the context of quicksort.
47 | 
48 | ### Pivot Retention
49 | Similar to PDQsort, if any of the three middlemost candidate pivots is equal to the rightmost element of the partition at left, Blipsort moves equal elements to the left with branchless Lomuto and continues to the right, solving the dutch-flag problem and yeilding linear time on data comprised of equal elements.
50 | 
51 | ### Optimism
52 | Similar to PDQsort, if the partition is "good" (not highly unbalanced) and we have done little work in partitioning, Blipsort switches to insertion sort. If the Insertion sort makes more than a constant number of moves, Blipsort bails and resumes quicksort. This allows Blipsort to achieve linear time on already-sorted data.
53 | 
54 | Work is calculated as:
55 | ```c++
56 |  bool work = ((skipped_left) + (skipped_right)) < (interval_width / 2);
57 | ```
58 | 
59 | ### Breaking Patterns
60 | Like PDQsort, if the partition is bad, Blipsort scrambles some elements to break up patterns. Unlike PDQsort, blipsort does not introduce completely fresh pivot candidates.
61 | 
62 | ### Rotation
63 | When all of the candidate pivots are strictly descending, it is very likely that the interval is descending as well. Lomuto partitioning slows significantly on descending data. Therefore, Blipsort neglects to sort descending candidates and instead swap-rotates the entire interval before partitioning.
64 | 
65 | ### Custom Comparators
66 | Blipsort allows its user to implement a custom boolean comparator. A comparator is best implemented with a lambda and no branches. A comparator implemented with a lambda can be inlined by an optimizing compiler, while a comparator implemented with a constexpr/inline function typically cannot.
67 | 
68 | ## Usage
69 | 
70 | To sort with branchless Lomuto on small types and block Hoare on large types, call blipsort like so:
71 | 
72 | ```c++
73 | Arrays::blipsort(array, size);
74 | ```
75 | 
76 | To sort with branchless Lomuto on small types and branchy Hoare on large types (to conserve memory) call blipsort like so:
77 | ```c++
78 | Arrays::blipsort_embed(array, size);
79 | ```
80 | 
81 | ## Sources
82 | 
83 | [Here](https://github.com/orlp/pdqsort)
84 | is the PDQsort algorithm by Orson Peters
85 | 
86 | [Here](https://orlp.net/blog/branchless-lomuto-partitioning/)
87 | is the branchless Lomuto blog post by Orson Peters and Lukas Bergdoll
88 | 
89 | [Here](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/DualPivotQuicksort.java)
90 | is Java's Dual Pivot Quicksort
91 | 


--------------------------------------------------------------------------------
/blipsort_speed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/RedBedHed/blipsort/03c0adca7c1bd33a751643356a22881a12b9a1cb/blipsort_speed.png


--------------------------------------------------------------------------------
/sort.h:
--------------------------------------------------------------------------------
   1 | #pragma once
   2 | #ifndef SORT_H
   3 | #define SORT_H
   4 | #include <iostream>
   5 | #include <cassert>
   6 | #include <bit>
   7 | #include <cstdint>
   8 | 
   9 | namespace Algo
  10 | { enum : uint32_t
  11 | {
  12 |     InsertionThreshold = 88,
  13 |     AscendingThreshold = 8,
  14 |     LargeDataThreshold = 128,
  15 |     BlockSize          = 64, 
  16 | #if __cpp_lib_bitops >= 201907L
  17 |     DoubleWordBitCount = 31,
  18 | #else
  19 |     DeBruijnShiftAmoun = 58
  20 | #endif
  21 | };
  22 | 
  23 | /**
  24 |  * The DeBruijn constant.
  25 |  */
  26 | #if __cpp_lib_bitops < 201907L
  27 | constexpr uint64_t DeBruijn64 =
  28 |     0x03F79D71B4CB0A89L;
  29 | #endif
  30 | 
  31 | /**
  32 |  * The DeBruijn map from key to integer
  33 |  * square index.
  34 |  */
  35 | #if __cpp_lib_bitops < 201907L
  36 | constexpr uint8_t DeBruijnTableF[] = 
  37 | {
  38 |     0,  47,  1, 56, 48, 27,  2, 60,
  39 |     57, 49, 41, 37, 28, 16,  3, 61,
  40 |     54, 58, 35, 52, 50, 42, 21, 44,
  41 |     38, 32, 29, 23, 17, 11,  4, 62,
  42 |     46, 55, 26, 59, 40, 36, 15, 53,
  43 |     34, 51, 20, 43, 31, 22, 10, 45,
  44 |     25, 39, 14, 33, 19, 30,  9, 24,
  45 |     13, 18,  8, 12,  7,  6,  5, 63
  46 | };
  47 | #endif
  48 | 
  49 | /**
  50 |  * Fill trailing bits using prefix fill.
  51 |  *
  52 |  * @code
  53 |  * Example:
  54 |  *       10000000 >> 1
  55 |  *     = 01000000 | 10000000
  56 |  *     = 11000000 >> 2
  57 |  *     = 00110000 | 11000000
  58 |  *     = 11110000 >> 4
  59 |  *     = 00001111 | 11110000
  60 |  *     = 11111111
  61 |  * @endcode
  62 |  * @tparam E The type
  63 |  * @param x The integer
  64 |  */
  65 | #if __cpp_lib_bitops < 201907L
  66 | constexpr void parallelPrefixFill
  67 |     (
  68 |     uint32_t & x
  69 |     ) 
  70 | {
  71 |     x |= x >> 1U;
  72 |     x |= x >> 2U;
  73 |     x |= x >> 4U;
  74 |     x |= x >> 8U;
  75 |     x |= x >> 16U;
  76 | }
  77 | #endif
  78 | 
  79 | /**
  80 |  * Calculates floor of log2
  81 |  * 
  82 |  * @authors Kim Walisch - source
  83 |  * @authors Mark Dickinson - source
  84 |  * @authors Ellie Moore 
  85 |  * @param bb bitboard to scan
  86 |  * @precondition bb != 0
  87 |  * @return index (0..63) of most significant one bit
  88 |  */
  89 | constexpr int log2
  90 |     (
  91 |     uint32_t l
  92 |     ) 
  93 | {
  94 |     assert(l != 0);
  95 | #if __cpp_lib_bitops >= 201907L
  96 |     return std::countl_zero(l) ^ DoubleWordBitCount; 
  97 | #else
  98 |     parallelPrefixFill(l);
  99 |     return DeBruijnTableF[(int)
 100 |         ((l * DeBruijn64) >> DeBruijnShiftAmoun)
 101 |     ];
 102 | #endif
 103 | }
 104 | 
 105 | /**
 106 |  * A simple swap method.
 107 |  *
 108 |  * @tparam E the element type
 109 |  * @param i the first element pointer
 110 |  * @param j the second element pointer
 111 |  */
 112 | template<typename E>
 113 | constexpr void swap
 114 |     (
 115 |     E *const i,
 116 |     E *const j
 117 |     ) 
 118 | {
 119 |     E const el = *i; *i = *j; *j = el;
 120 | }
 121 | 
 122 | /**
 123 |  * A generic "sift down" method (AKA max-heapify)
 124 |  *
 125 |  * @tparam E the element type
 126 |  * @param a the pointer to the base of the current
 127 |  * sub-array
 128 |  * @param i the starting index
 129 |  * @param size the size of the current sub-array
 130 |  */
 131 | template<typename E, class Cmp>
 132 | inline void siftDown
 133 |     (
 134 |     E* const a,
 135 |     const int i,
 136 |     const int size,
 137 |     const Cmp cmp
 138 |     ) 
 139 | {
 140 |     // Store size in
 141 |     // a local variable.
 142 |     const size_t n = size;
 143 | 
 144 |     // Establish non-leaf
 145 |     // boundary.
 146 |     const size_t o = n >> 1U;
 147 | 
 148 |     // Extract the element
 149 |     // to sift.
 150 |     E z = a[i];
 151 | 
 152 |     // initialize temporary
 153 |     // variables.
 154 |     size_t x = i, l, r;
 155 | 
 156 |     // consider only non-leaf
 157 |     // nodes.
 158 |     while(x < o) 
 159 |     {
 160 |         // y is currently
 161 |         // left child element.
 162 |         // Note: "l" here stands
 163 |         // for "left"
 164 |         r = (l = (x << 1U) + 1) + 1;
 165 |         E y = a[l];
 166 | 
 167 |         // if right child is
 168 |         // within the heap...
 169 |         // AND
 170 |         // if right child element
 171 |         // is greater than left
 172 |         // child element,
 173 |         // THEN
 174 |         // assign right child to
 175 |         // y and right index to l.
 176 |         // Note: "l" now stands
 177 |         // for "larger"
 178 |         if(r < n && cmp(y, a[r]))
 179 |             y = a[l = r];
 180 | 
 181 |         // if y is less than or
 182 |         // equal to the element
 183 |         // we are sifting, then
 184 |         // we are done.
 185 |         if(!cmp(z, y)) break;
 186 | 
 187 |         // move y up to the
 188 |         // parent index.
 189 |         a[x] = y;
 190 | 
 191 |         // Set parent index to
 192 |         // be the index of
 193 |         // the largest child.
 194 |         x = l;
 195 |     }
 196 | 
 197 |     // Place the sifted element.
 198 |     a[x] = z;
 199 | }
 200 | 
 201 | /**
 202 |  * <h1>
 203 |  *  <b>
 204 |  *  <i>Heap Sort</i>
 205 |  *  </b>
 206 |  * </h1>
 207 |  *
 208 |  * <p>
 209 |  * Classical heap sort that sorts the given range
 210 |  * in ascending order, building a max heap and
 211 |  * continuously sifting/swapping the max element
 212 |  * to the previous rightmost index.
 213 |  * </p>
 214 |  * @author Ellie Moore
 215 |  * @tparam E the element type
 216 |  * @param low a pointer to the leftmost index
 217 |  * @param high a pointer to the rightmost index
 218 |  */
 219 | template<typename E, class Cmp>
 220 | inline void hSort
 221 |     (
 222 |     E* const low,
 223 |     E* const high,
 224 |     const Cmp cmp
 225 |     ) 
 226 | {
 227 |     E* r = high + 1;
 228 |     E* const l = low;
 229 | 
 230 |     // Build the heap.
 231 |     int x = r - l;
 232 |     for(int i =
 233 |         (x >> 1U); i >= 0; --i)
 234 |         siftDown(l, i, x, cmp);
 235 |     
 236 |     // Sort.
 237 |     while(l < --r) 
 238 |     {
 239 |         const E z = *l; *l = *r;
 240 |         siftDown(l, 0, --x, cmp);
 241 |         *r = z;
 242 |     }
 243 | }
 244 | 
 245 | /**
 246 |  * <h1>
 247 |  *  <b>
 248 |  *  <i>Insertion Sort</i>
 249 |  *  </b>
 250 |  * </h1>
 251 |  *
 252 |  * <p>
 253 |  * Classical ascending insertion sort packaged with a
 254 |  * "pairing" optimization to be used in the context of
 255 |  * Quicksort.
 256 |  * </p>
 257 |  * 
 258 |  * <p>
 259 |  * This optimization is used whenever the portion of
 260 |  * the array to be sorted is padded on the left by
 261 |  * a portion with lesser elements. The fact that all of
 262 |  * the elements on the left are automatically less than
 263 |  * the elements in the current portion allows us to skip
 264 |  * the costly lower boundary check in the nested loops
 265 |  * and insert two elements in one go.
 266 |  * </p>
 267 |  *
 268 |  * @authors Josh Bloch - source
 269 |  * @authors Jon Bently - source
 270 |  * @authors Orson Peters - source
 271 |  * @authors Ellie Moore
 272 |  * @tparam E the element type
 273 |  * @tparam Are we sorting optimistically?
 274 |  * @param leftmost whether this is the leftmost part
 275 |  * @param low a pointer to the leftmost index
 276 |  * @param high a pointer to the rightmost index
 277 |  * left-most partition.
 278 |  */
 279 | template
 280 | <bool NoGuard, bool Guard, bool Bail = true, typename E, class Cmp>
 281 | inline bool iSort
 282 |     (
 283 |     E *const low, 
 284 |     E *const high,
 285 |     const Cmp cmp,
 286 |     const bool leftmost = true
 287 |     ) 
 288 | {
 289 |     E* l = low;
 290 |     E* r = high;
 291 |     int moves = 0;
 292 | 
 293 |     // We aren't guarding, jump
 294 |     // straight into pair insertion
 295 |     // sort.
 296 |     if constexpr (Guard)
 297 |         goto g1;
 298 |     if constexpr (NoGuard)
 299 |         goto g2;
 300 | 
 301 |     if (leftmost) 
 302 |     {
 303 |         g1:
 304 | 
 305 |         // Traditional
 306 |         // insertion
 307 |         // sort.
 308 |         for (E *i = l + 1; i <= r; ++i) 
 309 |         {
 310 |             E t = *i, *j = i - 1;
 311 |             for (; j >= l && cmp(t, *j); --j)
 312 |                 j[1] = *j;
 313 |             j[1] = t;
 314 | 
 315 |             if constexpr (Bail) 
 316 |             {
 317 |                 // If we have moved too
 318 |                 // many elements, abort.
 319 |                 moves += (i - 1) - j;
 320 |                 if(moves > AscendingThreshold)
 321 |                     return false;
 322 |             }
 323 |         }
 324 |     } 
 325 |     else 
 326 |     {
 327 |         g2:
 328 | 
 329 |         // Pair insertion sort.
 330 |         // Skip elements that are
 331 |         // in ascending order.
 332 |         do if (l++ >= r) return true;
 333 |         while (!cmp(*l, *(l - 1)));
 334 | 
 335 |         // This sort uses the sub
 336 |         // array at left to avoid
 337 |         // the lower bound check.
 338 |         // Assumes that this is not
 339 |         // the leftmost partition.
 340 |         for (E *i = l; ++l <= r; i = ++l) 
 341 |         {
 342 |             E ex = *i, ey = *l;
 343 | 
 344 |             // Make sure that
 345 |             // we insert the
 346 |             // larger element
 347 |             // first.
 348 |             if (cmp(ey, ex)) 
 349 |             {
 350 |                 ex = ey;
 351 |                 ey = *i;
 352 |                 if constexpr (Bail) 
 353 |                     ++moves;
 354 |             }
 355 | 
 356 |             // Insert the two
 357 |             // in one downward
 358 |             // motion.
 359 |             while (cmp(ey, *--i))
 360 |                 i[2] = *i;
 361 |             (++i)[1] = ey;
 362 |             while (cmp(ex, *--i))
 363 |                 i[1] = *i;
 364 |             i[1] = ex;
 365 | 
 366 |             if constexpr (Bail) 
 367 |             {
 368 |                 // If we have moved too
 369 |                 // many elements, abort.
 370 |                 moves += (l - 2) - i;
 371 |                 if(moves > AscendingThreshold) 
 372 |                     return false;
 373 |             }
 374 |         }
 375 | 
 376 |         // For odd length arrays,
 377 |         // insert the last element.
 378 |         E ez = *r;
 379 |         while (cmp(ez, *--r))
 380 |             r[1] = *r;
 381 |         r[1] = ez;
 382 |     }
 383 |     return true;
 384 | }
 385 | 
 386 | /**
 387 |  * Explicit constexpr ternary.
 388 |  * 
 389 |  * @tparam EXP the constexpr
 390 |  * @tparam E the return type
 391 |  * @param a the true value
 392 |  * @param b the false value
 393 |  */
 394 | template<bool EXP, typename E>
 395 | constexpr E ternary
 396 |     (
 397 |     E a, 
 398 |     E b
 399 |     )
 400 | {
 401 |     if constexpr (EXP) return a;
 402 |     else return b;
 403 | }
 404 | 
 405 | /**
 406 |  * Scramble a few elements to help
 407 |  * break patterns.
 408 |  *
 409 |  * @tparam E the element type
 410 |  * @param i the first element pointer
 411 |  * @param j the second element pointer
 412 |  */
 413 | template<typename E>
 414 | constexpr void scramble
 415 |     (
 416 |     E* const low, 
 417 |     E* const high, 
 418 |     const size_t len
 419 |     ) 
 420 | {
 421 |     if(len >= InsertionThreshold) 
 422 |     {
 423 |         const int _4th = len >> 2U;
 424 |         swap(low,  low  + _4th);
 425 |         swap(high, high - _4th);
 426 |         if(len > LargeDataThreshold)
 427 |         {
 428 |             swap(low  + 1, low  + (_4th + 1));
 429 |             swap(low  + 2, low  + (_4th + 2));
 430 |             swap(high - 2, high - (_4th + 2));
 431 |             swap(high - 1, high - (_4th + 1));
 432 |         }
 433 |     }
 434 | }
 435 | 
 436 | /**
 437 |  * Aligns the given pointer on 64-byte 
 438 |  * cachline.
 439 |  * 
 440 |  * @tparam E the element type
 441 |  * @param p pointer to memory to align
 442 |  */
 443 | template<typename E>
 444 | constexpr E* align
 445 |     (
 446 |     E* p
 447 |     )
 448 | {
 449 |     return reinterpret_cast<E*>((
 450 |         reinterpret_cast<uintptr_t>(p) + (BlockSize - 1)
 451 |     ) & -uintptr_t(BlockSize));
 452 | }
 453 | 
 454 | /**
 455 |  * <h1>
 456 |  *  <b>
 457 |  *  <i>Blipsort</i>
 458 |  *  </b>
 459 |  * </h1>
 460 |  *
 461 |  * <h2>Branchless Lomuto</h2>
 462 |  * <p>
 463 |  * The decades-old partitioning algorithm recently 
 464 |  * made a resurgence when researchers discovered 
 465 |  * ways to remove the inner branch. Lukas Bergdoll
 466 |  * and Orson Peters' method— published a little 
 467 |  * under two months ago— is the fastest yet. It 
 468 |  * employs a gap in the data to move elements 
 469 |  * twice per iteration rather than swapping them 
 470 |  * (three moves). For arithmetic and pointer types, 
 471 |  * Blipsort employs branchless Lomuto partitioning. 
 472 |  * For other, larger types, Blipsort uses branchless 
 473 |  * or branchful Hoare partitioning.
 474 |  * </p>
 475 |  * 
 476 |  * <h2>Pivot Selectivity</h2>
 477 |  * <p>
 478 |  * Blipsort carefully selects the pivot from the 
 479 |  * middle of five sorted candidates. These 
 480 |  * candidates allow the sort to determine whether 
 481 |  * the data in the current interval is approximately 
 482 |  * descending and inform its "partition left" strategy.
 483 |  * </p>
 484 |  * 
 485 |  * <h2>Insertion Sort</h2>
 486 |  * <p>
 487 |  * Blipsort uses Insertion sort on small intervals 
 488 |  * where asymptotic complexity matters less and 
 489 |  * instruction overhead matters more. Blipsort 
 490 |  * employs Java's Pair Insertion sort on every 
 491 |  * interval except the leftmost. Pair insertion 
 492 |  * sort inserts two elements at a time and doesn't 
 493 |  * need to perform a lower bound check, making it 
 494 |  * slightly faster than normal insertion sort in 
 495 |  * the context of quicksort.
 496 |  * </p>
 497 |  * 
 498 |  * <h2>Pivot Retention</h2>
 499 |  * <p>
 500 |  * Similar to PDQsort, if any of the three middlemost 
 501 |  * candidate pivots is equal to the rightmost element 
 502 |  * of the partition at left, Blipsort moves equal 
 503 |  * elements to the left with branchless Lomuto and 
 504 |  * continues to the right, solving the dutch-flag 
 505 |  * problem and yeilding linear time on data comprised 
 506 |  * of equal elements.
 507 |  * </p> 
 508 |  * 
 509 |  * <h2>Optimism</h2>
 510 |  * <p>
 511 |  * Similar to PDQsort, if the partition is "good" 
 512 |  * (not highly unbalanced), Blipsort switches to 
 513 |  * insertion sort. If the Insertion sort makes more 
 514 |  * than a constant number of moves, Blipsort bails 
 515 |  * and resumes quicksort. This allows Blipsort to 
 516 |  * achieve linear time on already-sorted data.
 517 |  * </p>
 518 |  * 
 519 |  * <h2>Breaking Patterns</h2>
 520 |  * <p>
 521 |  * Like PDQsort, if the partition is bad, Blipsort 
 522 |  * scrambles some elements to break up patterns.
 523 |  * </p>
 524 |  * 
 525 |  * <h2>Rotation</h2>
 526 |  * <p>
 527 |  * When all of the candidate pivots are strictly 
 528 |  * descending, it is very likely that the interval 
 529 |  * is descending as well. Lomuto partitioning slows 
 530 |  * significantly on descending data. Therefore, 
 531 |  * Blipsort neglects to sort descending candidates 
 532 |  * and instead swap-rotates the entire interval 
 533 |  * before partitioning.
 534 |  * </p>
 535 |  *
 536 |  * <h2>Custom Comparators</h2>
 537 |  * <p>
 538 |  * Blipsort allows its user to implement a custom 
 539 |  * boolean comparator. A comparator is best 
 540 |  * implemented with a lambda and no branches. A 
 541 |  * comparator implemented with a lambda can be 
 542 |  * inlined by an optimizing compiler, while a 
 543 |  * comparator implemented with a constexpr/inline 
 544 |  * function typically cannot.
 545 |  * </p>
 546 |  *
 547 |  * @authors Josh Bloch - source
 548 |  * @authors Jon Bently - source
 549 |  * @authors Orson Peters - source
 550 |  * @authors Lukas Bergdoll - source
 551 |  * @authors Stefan Edelkamp - source
 552 |  * @authors Armin Weiß - source
 553 |  * @authors Ellie Moore
 554 |  * @tparam E the element type
 555 |  * @tparam Root whether this is the sort root
 556 |  * @param leftmost whether this is the leftmost part
 557 |  * @param low a pointer to the leftmost index
 558 |  * @param high a pointer to the rightmost index
 559 |  * @param height the distance of the current sort
 560 |  * tree from the initial height of 2log<sub>2</sub>n
 561 |  */
 562 | template
 563 | <bool Expense, bool Block, bool Root = true, typename E, class Cmp>
 564 | inline void qSort
 565 |     (
 566 |     E * low,
 567 |     E * high,
 568 |     int height,
 569 |     const Cmp cmp,
 570 |     bool leftmost = true
 571 |     ) 
 572 | {
 573 |     // Tail call loop.
 574 |     for(size_t x = high - low;;) 
 575 |     {
 576 |         // If this is not the
 577 |         // root node, sort the 
 578 |         // interval by insertion
 579 |         // sort if small enough.
 580 |         if constexpr (!Root)
 581 |         if (x < InsertionThreshold) 
 582 |         {
 583 |             // If we are in the Root,
 584 |             // we won't be insertion 
 585 |             // sorting until we 
 586 |             // iterate on the rightmost
 587 |             // part. However, we are
 588 |             // not in the root here, so
 589 |             // we need to be careful
 590 |             // to use guarded insertion
 591 |             // sort if this is the
 592 |             // leftmost partition.
 593 |             iSort<0,0,0>
 594 |             (low, high, cmp, leftmost);
 595 |             return;
 596 |         }
 597 | 
 598 |         // If this is not the root node, 
 599 |         // heap sort when the runtime
 600 |         // trends towards quadratic.
 601 |         if constexpr (!Root)
 602 |         if(height < 0)
 603 |             return hSort(low, high, cmp);
 604 | 
 605 |         // Find an inexpensive
 606 |         // approximation of a third of
 607 |         // the interval.
 608 |         const size_t y = x >> 2U,
 609 |             _3rd = y + (y >> 1U),
 610 |             _6th = _3rd >> 1U;
 611 | 
 612 |         // Find an approximate
 613 |         // midpoint of the interval.
 614 |         E *const mid = low + (x >> 1U);
 615 | 
 616 |         // Assign tercile indices
 617 |         // to candidate pivots.
 618 |         E *const sl = low  + _3rd;
 619 |         E *const sr = high - _3rd;
 620 | 
 621 |         // Assign outer indices
 622 |         // to candidate pivots.
 623 |         E * cl = low  + _6th;
 624 |         E * cr = high - _6th;
 625 | 
 626 |         // If the candidates aren't
 627 |         // descending...
 628 |         // Insertion sort all five
 629 |         // candidate pivots in-place.
 630 |         if((!cmp(*cl, *low)) || 
 631 |            (!cmp(*sl, *cl))  ||
 632 |            (!cmp(*mid, *sl)) ||
 633 |            (!cmp(*sr, *mid)) ||
 634 |            (!cmp(*cr, *sr))  ||
 635 |            (!cmp(*high, *cr)))
 636 |         {
 637 |             
 638 |             if(cmp(*low, *cl)) 
 639 |                 cl = low;
 640 |             if(cmp(*cr, *high)) 
 641 |                 cr = high;
 642 | 
 643 |             if (cmp(*sl, *cl)) 
 644 |             {
 645 |                 E e = *sl;
 646 |                 *sl = *cl;
 647 |                 *cl =   e;
 648 |             }
 649 | 
 650 |             if (cmp(*mid, *sl)) 
 651 |             {
 652 |                 E e  = *mid;
 653 |                 *mid =  *sl;
 654 |                 *sl  =    e;
 655 |                 if (cmp(e, *cl)) 
 656 |                 {
 657 |                     *sl = *cl;
 658 |                     *cl =   e;
 659 |                 }
 660 |             }
 661 | 
 662 |             if (cmp(*sr, *mid))
 663 |             {
 664 |                 E e  =  *sr;
 665 |                 *sr  = *mid;
 666 |                 *mid =    e;
 667 |                 if (cmp(e, *sl))
 668 |                 {
 669 |                     *mid = *sl;
 670 |                     *sl  =   e;
 671 |                     if (cmp(e, *cl)) 
 672 |                     {
 673 |                         *sl = *cl;
 674 |                         *cl =   e;
 675 |                     }
 676 |                 }
 677 |             }
 678 | 
 679 |             if (cmp(*cr, *sr))
 680 |             {
 681 |                 E e = *cr;
 682 |                 *cr = *sr;
 683 |                 *sr =   e;
 684 |                 if (cmp(e, *mid)) 
 685 |                 {
 686 |                     *sr  = *mid;
 687 |                     *mid =    e;
 688 |                     if (cmp(e, *sl)) 
 689 |                     {
 690 |                         *mid = *sl;
 691 |                         *sl  =   e;
 692 |                         if (cmp(e, *cl)) 
 693 |                         {
 694 |                             *sl = *cl;
 695 |                             *cl =   e;
 696 |                         }
 697 |                     }
 698 |                 }
 699 |             }
 700 |         }
 701 | 
 702 |         // If the candidates are
 703 |         // descending, then the
 704 |         // interval is likely to
 705 |         // be descending somewhat.
 706 |         // rotate the entire interval
 707 |         // around the midpoint.
 708 |         // Don't worry about the
 709 |         // even size case. One
 710 |         // out-of-order element
 711 |         // is no big deal.
 712 |         else
 713 |         {
 714 |             E* u = low;
 715 |             E* q = high;
 716 |             while(u < mid)
 717 |             {
 718 |                 E e = *u;
 719 |                 *u++ = *q;
 720 |                 *q-- = e;
 721 |             }
 722 |         }
 723 |         
 724 |         // If any middle candidate 
 725 |         // pivot is equal to the 
 726 |         // rightmost element of the 
 727 |         // partition to the left,
 728 |         // swap pivot duplicates to 
 729 |         // the side and sort the 
 730 |         // remainder. This is an
 731 |         // alternative to dutch-flag
 732 |         // partitioning.
 733 |         if(!leftmost)
 734 |         {
 735 |             // Check the pivot to 
 736 |             // the left.
 737 |             E h = *(low - 1);
 738 |             if(!cmp(h, *sl)  || 
 739 |                !cmp(h, *mid) || 
 740 |                !cmp(h, *sr))
 741 |             {
 742 |                 E* l = low - 1,
 743 |                  * g = high + 1;
 744 | 
 745 |                 // skip over data
 746 |                 // in place.         
 747 |                 while(cmp(h, *--g));
 748 | 
 749 |                 if(g == high)
 750 |                     while(!cmp(h, *++l) && l < g);
 751 |                 else 
 752 |                     while(!cmp(h, *++l));
 753 | 
 754 |                 // If we are sorting 
 755 |                 // non-arithmetic types,
 756 |                 // use Hoare for fewer
 757 |                 // moves.
 758 |                 if constexpr (Expense)
 759 |                 {
 760 |         /**
 761 |          * Partition left by branchful Hoare scheme
 762 |          * 
 763 |          * During partitioning:
 764 |          * 
 765 |          * +-------------------------------------------------------------+
 766 |          * |    ... == h     |        ... ? ...        |     ... > h     |
 767 |          * +-------------------------------------------------------------+
 768 |          * ^                 ^                         ^                 ^
 769 |          * low               l                         k              high
 770 |          * 
 771 |          * After partitioning:
 772 |          * 
 773 |          * +-------------------------------------------------------------+
 774 |          * |           ... == h           |            > h ...           |
 775 |          * +-------------------------------------------------------------+
 776 |          * ^                              ^                              ^
 777 |          * low                            l                           high
 778 |          */
 779 |                     while(l < g)
 780 |                     {
 781 |                         swap(l, g);
 782 |                         while(cmp(h, *--g));
 783 |                         while(!cmp(h, *++l));
 784 |                     }
 785 |                 }
 786 | 
 787 |                 // If we are sorting 
 788 |                 // arithmetic types,
 789 |                 // use branchless lomuto
 790 |                 // for fewer branches.
 791 |                 else
 792 |                 {   
 793 |         /**
 794 |          * Partition left by branchless Lomuto scheme
 795 |          * 
 796 |          * During partitioning:
 797 |          * 
 798 |          * +-------------------------------------------------------------+
 799 |          * |  ... == h  |  ... > h  | * |     ... ? ...      |  ... > h  |
 800 |          * +-------------------------------------------------------------+
 801 |          * ^            ^           ^                        ^           ^
 802 |          * low          l           k                        g         high
 803 |          * 
 804 |          * After partitioning:
 805 |          * 
 806 |          * +-------------------------------------------------------------+
 807 |          * |           ... == h           |            > h ...           |
 808 |          * +-------------------------------------------------------------+
 809 |          * ^                              ^                              ^
 810 |          * low                            l                           high
 811 |          */
 812 |                     E * k = l, p = *l;
 813 |                     while(k < g)
 814 |                     {
 815 |                         *k = *l;
 816 |                         *l = *++k;
 817 |                         l += !cmp(h, *l);
 818 |                     }
 819 |                     *k = *l; *l = p;
 820 |                 }
 821 | 
 822 |                 // Advance low to the
 823 |                 // start of the right
 824 |                 // partition.
 825 |                 low = l;
 826 | 
 827 |                 // If we have nothing 
 828 |                 // left to sort, return.
 829 |                 if(low >= high)
 830 |                     return;
 831 | 
 832 |                 // Calculate the interval 
 833 |                 // width and loop.
 834 |                 x = high - low;
 835 |                 continue;
 836 |             }
 837 |         }
 838 | 
 839 |         // Initialize l and k.
 840 |         E *l = ternary<Expense>(low, low - 1),
 841 |           *k = high + 1, * g;
 842 | 
 843 |         // Assign midpoint to pivot
 844 |         // variable.
 845 |         const E p = *mid;
 846 | 
 847 |         // If we are sorting 
 848 |         // non-arithmetic types, bring 
 849 |         // left end inside. Left end 
 850 |         // will be replaced and pivot 
 851 |         // will be swapped back later.
 852 |         if constexpr(Expense)
 853 |             *mid = *low;
 854 | 
 855 |         // skip over data
 856 |         // in place.
 857 |         while(cmp(*++l, p));
 858 | 
 859 |         // If we are sorting 
 860 |         // arithmetic types, bring 
 861 |         // left end inside. Left end 
 862 |         // will be replaced and pivot 
 863 |         // will be swapped back later.
 864 |         if constexpr(!Expense)
 865 |             *mid = *l;
 866 | 
 867 |         // skip over data
 868 |         // in place.
 869 |         if(ternary<Expense>
 870 |             (l == low + 1, l == low))
 871 |             while(!cmp(*--k, p) && k > l);
 872 |         else 
 873 |             while(!cmp(*--k, p));
 874 | 
 875 |         // Will we do a significant 
 876 |         // amount of work during 
 877 |         // partitioning?
 878 |         bool work = 
 879 |         ((l - low) + (high - k)) 
 880 |             < (x >> 1U);
 881 | 
 882 |         // If we are sorting 
 883 |         // non-arithmetic types and
 884 |         // conserving memory, use 
 885 |         // Hoare for fewer moves.
 886 |         if constexpr (Expense && !Block)
 887 |         {
 888 |         /**
 889 |          * Partition by branchful Hoare scheme
 890 |          * 
 891 |          * During partitioning:
 892 |          * 
 893 |          * +-------------------------------------------------------------+
 894 |          * |     ... < p     |        ... ? ...        |    ... >= p     |
 895 |          * +-------------------------------------------------------------+
 896 |          * ^                 ^                         ^                 ^
 897 |          * low               l                         k              high
 898 |          * 
 899 |          * After partitioning:
 900 |          * 
 901 |          * +-------------------------------------------------------------+
 902 |          * |           ... < p            |            >= p ...          |
 903 |          * +-------------------------------------------------------------+
 904 |          * ^                              ^                              ^
 905 |          * low                            l                           high
 906 |          */
 907 |             while(l < k)
 908 |             {
 909 |                 swap(l, k);
 910 |                 while(cmp(*++l, p));
 911 |                 while(!cmp(*--k, p));
 912 |             }
 913 |             *low = *--l; *l = p;
 914 |         }
 915 | 
 916 |         // If we are sorting 
 917 |         // non-arithmetic types and
 918 |         // not conserving memory, use 
 919 |         // Block Hoare for fewer moves
 920 |         // and fewer branches.
 921 |         else if constexpr (Expense)
 922 |         {
 923 |         /**
 924 |          * Partition by branchless (Block) Hoare scheme
 925 |          * 
 926 |          * During partitioning:
 927 |          * 
 928 |          * +-------------------------------------------------------------+
 929 |          * |  ... < p  |   cmp   |   ... ? ...   |   cmp   |  ... >= p   |
 930 |          * +-------------------------------------------------------------+
 931 |          * ^           ^         ^               ^         ^             ^
 932 |          * low         _low      l               k     _high          high
 933 |          * 
 934 |          * After partitioning:
 935 |          * 
 936 |          * +-------------------------------------------------------------+
 937 |          * |           ... < p            |            >= p ...          |
 938 |          * +-------------------------------------------------------------+
 939 |          * ^                              ^                              ^
 940 |          * low                            l                           high
 941 |          */
 942 |             if(l < k)
 943 |             {
 944 |                 swap(l++, k);
 945 | 
 946 |                 // Set up blocks and
 947 |                 // align base pointers to
 948 |                 // the cacheline.
 949 |                 uint8_t 
 950 |                 ols[BlockSize << 1U], 
 951 |                 oks[BlockSize << 1U]; 
 952 |                 uint8_t
 953 |                 * olp = align(ols), 
 954 |                 * okp = align(oks); 
 955 |                 
 956 |                 // Initialize frame pointers.
 957 |                 E * _low = l, * _high = k;
 958 | 
 959 |                 // Initialize offset counts and
 960 |                 // start indices for swap routine.
 961 |                 size_t nl = 0, nk = 0, ls = 0, ks = 0;
 962 |                 
 963 |                 while(l < k) 
 964 |                 {
 965 |                     // If both blocks are empty, split 
 966 |                     // the interval in two. Otherwise
 967 |                     // give the whole interval to one
 968 |                     // block.
 969 |                     size_t xx = k - l,
 970 |                     lspl = -(nl == 0) & (xx >> (nk == 0)),
 971 |                     kspl = -(nk == 0) & (xx - lspl);
 972 |                     
 973 |                     // Fill the offset blocks. If the split 
 974 |                     // for either block is larger than 64,
 975 |                     // crop it and unroll the loop. Otherwise,
 976 |                     // keep the loop fully rolled. This should
 977 |                     // only happen near the end of partitioning.
 978 |                     if(lspl >= BlockSize)
 979 |                     {
 980 |                         size_t i = -1;
 981 |                         do
 982 |                         {
 983 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 984 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 985 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 986 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 987 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 988 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 989 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 990 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 991 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 992 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 993 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 994 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 995 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 996 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 997 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 998 |                             olp[nl] = ++i; nl += !cmp(*l++, p);
 999 |                         } while(i < BlockSize - 1);
1000 |                     }
1001 |                     else
1002 |                         for(size_t i = 0; i < lspl; ++i)
1003 |                             olp[nl] = i, nl += !cmp(*l++, p);
1004 | 
1005 |                     if(kspl >= BlockSize)
1006 |                     {
1007 |                         size_t i = 0;
1008 |                         do
1009 |                         {
1010 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1011 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1012 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1013 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1014 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1015 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1016 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1017 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1018 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1019 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1020 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1021 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1022 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1023 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1024 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1025 |                             okp[nk] = ++i; nk += cmp(*--k, p);
1026 |                         } while(i < BlockSize);
1027 | 
1028 |                     }
1029 |                     else
1030 |                         for(size_t i = 0; i < kspl;)
1031 |                             okp[nk] = ++i, nk += cmp(*--k, p);
1032 | 
1033 |                     // n = min(nl, nk), branchless.
1034 |                     size_t n = 
1035 |                         (nl & -(nl < nk)) + (nk & -(nl >= nk));
1036 | 
1037 |                     // Swap the elements using the offsets. 
1038 |                     // Set up working block pointers and lower 
1039 |                     // block end pointer.
1040 |                     uint8_t* 
1041 |                     ll = olp + ls, * kk = okp + ks, * e = ll + n;
1042 | 
1043 |                     // If the offset counts are equal, we are likely
1044 |                     // to be ascending or descending. If ascending,
1045 |                     // we don't need to do anything. If descending, 
1046 |                     // use swaps to stay O(n). Both blocks must 
1047 |                     // contain n offsets. If either block is empty,
1048 |                     // fill it and come back.
1049 |                     if(nl == nk)
1050 |                         for(; ll < e; ++ll, ++kk)
1051 |                             swap(_low + *ll, _high - *kk);
1052 | 
1053 |                     // Otherwise, swap using a cyclic permutation.
1054 |                     // Both blocks must contain n offsets. If either 
1055 |                     // block is empty, fill it and come back.
1056 |                     else if(n > 0)
1057 |                     {
1058 |                         E* _l = _low + *ll, * _k = _high - *kk;
1059 |                         E t = *_l; *_l = *_k;
1060 |                         for(++ll, ++kk; ll < e; ++ll, ++kk)
1061 |                         {
1062 |                             _l = _low  + *ll; *_k = *_l;
1063 |                             _k = _high - *kk; *_l = *_k;
1064 |                         } 
1065 |                         *_k = t;
1066 |                     }
1067 | 
1068 |                     // Adjust offset counts and starts. If a block
1069 |                     // is empty, adjust its frame pointer.
1070 |                     nl -= n; nk -= n;
1071 |                     if(nl == 0) { ls = 0; _low  = l; } else ls += n; 
1072 |                     if(nk == 0) { ks = 0; _high = k; } else ks += n;
1073 |                 }
1074 | 
1075 |                 // swap the remaining elements into place.
1076 |                 if(nl)
1077 |                 {
1078 |                     olp += ls;
1079 |                     for(uint8_t* ll = olp + nl;;)
1080 |                     {
1081 |                         swap(_low + *--ll, --l); 
1082 |                         if(ll <= olp) break;
1083 |                     }
1084 |                 }
1085 | 
1086 |                 if(nk)
1087 |                 {
1088 |                     okp += ks;
1089 |                     for(uint8_t* kk = okp + nk;;)
1090 |                     {
1091 |                         swap(_high - *--kk, l++);
1092 |                         if(kk <= okp) break;
1093 |                     }
1094 |                 }
1095 |             }
1096 |             
1097 |             // Move the pivot into place.
1098 |             *low = *--l; *l = p;
1099 |         }
1100 | 
1101 |         // If we are sorting 
1102 |         // arithmetic types, use 
1103 |         // branchless lomuto for 
1104 |         // fewer branches.
1105 |         else
1106 |         {
1107 |         /**
1108 |          * Partition by branchless Lomuto scheme
1109 |          * 
1110 |          * During partitioning:
1111 |          * 
1112 |          * +-------------------------------------------------------------+
1113 |          * |  ... < p  |  ... >= p  | * |     ... ? ...     |  ... >= p  |
1114 |          * +-------------------------------------------------------------+
1115 |          * ^           ^            ^                       ^            ^
1116 |          * low         l            g                       k         high
1117 |          * 
1118 |          * After partitioning:
1119 |          * 
1120 |          * +-------------------------------------------------------------+
1121 |          * |           ... < p            |            >= p ...          |
1122 |          * +-------------------------------------------------------------+
1123 |          * ^                              ^                              ^
1124 |          * low                            l                           high
1125 |          */
1126 |             g = l; 
1127 |             
1128 |             // If we are not conserving 
1129 |             // memory, unroll the
1130 |             // loop for a tiny boost.
1131 |             if constexpr (Block)
1132 |             {
1133 |                 E* u = k - (BlockSize >> 2U);
1134 |                 while(g < u)
1135 |                 {
1136 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1137 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1138 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1139 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1140 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1141 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1142 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1143 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1144 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1145 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1146 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1147 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1148 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1149 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1150 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1151 |                     *g = *l; *l = *++g; l += cmp(*l, p);
1152 |                 }
1153 |             }
1154 | 
1155 |             while(g < k)
1156 |             {
1157 |                 *g = *l;
1158 |                 *l = *++g;
1159 |                 l += cmp(*l, p);
1160 |             }
1161 |             *g = *l; *l = p;
1162 |         }
1163 | 
1164 |         // Skip the pivot.
1165 |         g = l + (l < high);
1166 |         l -= (l > low);
1167 | 
1168 |         // Cheaply calculate an
1169 |         // eigth of the interval.
1170 |         const size_t 
1171 |             _8th = x >> 3U;
1172 | 
1173 |         // Calculate interval widths.
1174 |         const size_t
1175 |             ls = l - low,
1176 |             gs = high - g;
1177 | 
1178 |         // If the partition is fairly 
1179 |         // balanced, try insertion sort.
1180 |         // If insertion sort runtime
1181 |         // trends higher than O(n), fall 
1182 |         // back to quicksort.
1183 |         if(ls >= _8th &&
1184 |            gs >= _8th) 
1185 |         {
1186 |             if(work) goto l1;
1187 |             if(!iSort<0,0>(low, l, cmp, leftmost)) 
1188 |                 goto l1;
1189 |             if(!iSort<1,0>(g, high, cmp))
1190 |                 goto l2;
1191 |             return;
1192 |         }
1193 | 
1194 |         // The partition is not balanced.
1195 |         // scramble some elements and
1196 |         // try to break the pattern.
1197 |         scramble(low, l, ls);
1198 |         scramble(g, high, gs);
1199 | 
1200 |         // This was a bad partition,
1201 |         // so decrement the height.
1202 |         // When the height is zero,
1203 |         // we will use heapsort.
1204 |         --height;
1205 | 
1206 |         // Sort left portion.
1207 |         l1: qSort<Expense,Block,0>
1208 |         (low, l, height, cmp, leftmost);
1209 | 
1210 |         // Sort right portion 
1211 |         // iteratively.
1212 |         l2: low = g;
1213 | 
1214 |         // Find the wwidth of the
1215 |         // interval
1216 |         x = high - low;
1217 | 
1218 |         // If this is the root,
1219 |         // sort the interval
1220 |         // by insertion sort
1221 |         // if small enough.
1222 |         if constexpr (Root)
1223 |         if (x < InsertionThreshold) 
1224 |         {
1225 |             // If we are in the Root,
1226 |             // insertion sort will
1227 |             // be unguarded.
1228 |             iSort<1,0,0>(low, high, cmp);
1229 |             return;
1230 |         }
1231 | 
1232 |         // If this is the root node, 
1233 |         // heap sort when the runtime
1234 |         // trends towards quadratic.
1235 |         if constexpr (Root)
1236 |         if(height < 0)
1237 |             return hSort(low, high, cmp);
1238 | 
1239 |         leftmost = false;
1240 |     }
1241 | }
1242 | 
1243 | /**
1244 |  * sort 
1245 |  * 
1246 |  * @tparam E the element type
1247 |  * @tparam Cmp the comparator type
1248 |  * @param a the array to be sorted
1249 |  * @param cnt the size of the the array
1250 |  * @param cmp the comparator
1251 |  */
1252 | template <bool Block = true, typename E, class Cmp>
1253 | inline void blipsort
1254 |     (
1255 |     E* const a,
1256 |     const uint32_t cnt,
1257 |     const Cmp cmp
1258 |     ) 
1259 | {
1260 |     if(cnt < InsertionThreshold)
1261 |     {
1262 |         iSort<0,1,0>(a, a + (cnt - 1), cmp);
1263 |         return;
1264 |     }
1265 |     return qSort
1266 |     <!std::is_arithmetic<E>::value && 
1267 |      !std::is_pointer<E>::value, Block>
1268 |         (a, a + (cnt - 1), log2(cnt), cmp);
1269 | }}
1270 | 
1271 | namespace Arrays 
1272 | {
1273 |     /**
1274 |      * <h1>
1275 |      *  <b>
1276 |      *  <i>blipsort</i>
1277 |      *  </b>
1278 |      * </h1> 
1279 |      * 
1280 |      * <p>
1281 |      * Sorts the given array with provided comparator
1282 |      * or in ascending order if nonesuch.
1283 |      * </p>
1284 |      * 
1285 |      * @tparam E the element type
1286 |      * @tparam Cmp the comparator type
1287 |      * @param a the array to be sorted
1288 |      * @param cnt the size of the the array
1289 |      * @param cmp the comparator
1290 |      */
1291 |     template <typename E, class Cmp = std::less<>>
1292 |     inline void blipsort
1293 |         (
1294 |         E* const a,
1295 |         const uint32_t cnt,
1296 |         const Cmp cmp = std::less<>()
1297 |         ) 
1298 |     {
1299 |         Algo::blipsort(a, cnt, cmp);
1300 |     }
1301 | 
1302 |     /**
1303 |      * <h1>
1304 |      *  <b>
1305 |      *  <i>blipsort_embed</i>
1306 |      *  </b>
1307 |      * </h1> 
1308 |      * 
1309 |      * <p>
1310 |      * Sorts the given array with provided comparator
1311 |      * or in ascending order if nonesuch. Conserves
1312 |      * memory by using branchy Hoare instead of 
1313 |      * Block Hoare on large types.
1314 |      * </p>
1315 |      * 
1316 |      * @tparam E the element type
1317 |      * @tparam Cmp the comparator type
1318 |      * @param a the array to be sorted
1319 |      * @param cnt the size of the the array
1320 |      * @param cmp the comparator
1321 |      */
1322 |     template <typename E, class Cmp = std::less<>>
1323 |     inline void blipsort_embed
1324 |         (
1325 |         E* const a,
1326 |         const uint32_t cnt,
1327 |         const Cmp cmp = std::less<>()
1328 |         ) 
1329 |     {
1330 |         Algo::blipsort<0>(a, cnt, cmp);
1331 |     }
1332 | }
1333 | 
1334 | #endif //SORT_H
1335 | 


--------------------------------------------------------------------------------