├── LICENSE
├── README.md
├── bench.c
├── compiled
    └── sort.c
└── src
    ├── arith.singeli
    ├── base.singeli
    ├── common.singeli
    ├── count.singeli
    ├── dropsort.singeli
    ├── glide.singeli
    ├── ins.singeli
    ├── median.singeli
    ├── merge.singeli
    ├── network.singeli
    ├── partition.singeli
    ├── prefix.singeli
    ├── quicksort.singeli
    ├── radix.singeli
    ├── rh.singeli
    ├── small.singeli
    ├── sort.singeli
    └── xorshift.singeli


/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2022 Marshall Lochbaum <mwlochbaum@gmail.com>
 2 | 
 3 | Permission to use, copy, modify, and/or distribute this software for any
 4 | purpose with or without fee is hereby granted.
 5 | 
 6 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
 7 | REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
 8 | AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
 9 | INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
10 | LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
11 | OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
12 | PERFORMANCE OF THIS SOFTWARE.
13 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Singeli sort
  2 | 
  3 | Algorithms in [Singeli](https://github.com/mlochbaum/Singeli), aiming for high performance and broad adaptivity for sorting CPU-native types (integers and floats). A secondary goal is a well-commented and readable codebase that explains what various methods are good for and how they're implemented to use hardware to its full potential. Will probably end up using SIMD if available to speed up a few things, but this is primarily intended to be a portable rather than SIMD-first sort.
  4 | 
  5 | Compile with (add `-t cpp` for C++ compatibility):
  6 | 
  7 |     singeli src/sort.singeli -o sort.c
  8 | 
  9 | Or the following without a Singeli install. CBQN builds on Unix-like systems (including macOS and WSL) in under a minute; see [docker build](https://github.com/vylsaz/cbqn-win-docker-build) for Windows. There's also a pre-compiled copy of sort.c in `compiled/sort.c`. It may not always be up to date.
 10 | 
 11 |     git clone https://github.com/dzaima/CBQN.git
 12 |     cd CBQN && make && cd -
 13 |     git clone https://github.com/mlochbaum/Singeli.git
 14 |     CBQN/BQN Singeli/singeli src/sort.singeli -o sort.c
 15 | 
 16 | To benchmark:
 17 | 
 18 |     gcc -O3 bench.c
 19 |     ./a.out
 20 | 
 21 | Exported functions are defined in src/sort.singeli, and their C prototypes appear at the bottom of sort.c: the arguments for sorts are array, length, aux (or scratch buffer), and possibly aux length in bytes. These are likely to change over time.
 22 | 
 23 | ## Overview
 24 | 
 25 | Singeli sort currently hybridizes the following algorithms; all are used for `sort32` and other functions use subsets. The overall structure is that the glidesort layer may call quicksort, which calls the different base cases in various situations.
 26 | 
 27 | - Quicksort partitioning from [fluxsort](https://github.com/scandum/fluxsort) and [crumsort](https://github.com/scandum/crumsort)
 28 | - Outer merge layer: modified [glidesort](https://github.com/orlp/glidesort) ([powersort](https://github.com/sebawild/powersort) rules made lazy to defer to quicksort if runs aren't found)
 29 | - Merging: based on [piposort](https://github.com/scandum/piposort)
 30 | - Small arrays: sorting networks as in [ipn_unstable](https://github.com/Voultapher/sort-research-rs/blob/main/src/unstable/rust_ipn.rs), extra merging and insertion following [quadsort](https://github.com/scandum/quadsort)
 31 | - Radix sort: mostly like [ska_sort_copy](https://github.com/skarupke/ska_sort)
 32 | - Counting sort: see [section](https://github.com/mlochbaum/rhsort#counting-sort) in rhsort
 33 | - [Robin Hood](https://github.com/mlochbaum/rhsort) sort
 34 | 
 35 | In progress, still has various issues:
 36 | 
 37 | - Drapesort, similar to [drop-Merge sort](https://github.com/emilk/drop-merge-sort)
 38 | 
 39 | Other methods to consider later:
 40 | 
 41 | - In-place partitioning with [pdqsort](https://github.com/orlp/pdqsort). Slower than crumsort but it does adapt to mostly-sorted data well.
 42 | - Interleaved merges and bidirectional partitioning from glidesort. These have not yet been demonstrated to improve performance relative to fluxsort, and there are indications that they slow things down on older processors in addition to bumping up code size. I'll wait for the paper explaining choices made before looking into them further.
 43 | 
 44 | ## Guide to the source
 45 | 
 46 | The source code is supposed to be the place to go to get full descriptions and details. I am certain it fails in this role—particularly in places where I don't expect anyone's reading, so please complain if a part you've chosen to read is not well explained!
 47 | 
 48 | The general-use files:
 49 | 
 50 | - sort.singeli Main file: include statements, and the sorting function definitions.
 51 | - base.singeli Basic definitions to be used elsewhere. This includes operators, which are all user-defined in Singeli.
 52 | - common.singeli Other definitions that are more specific than base but may be used in multiple places.
 53 | - arith.singeli Some log and square root code to keep it out of the way.
 54 | 
 55 | And specific algorithms:
 56 | 
 57 | - quicksort.singeli
 58 |   - partition.singeli Partitioning
 59 |   - median.singeli Medians and pseudomedians for picking candidates
 60 |     - xorshift.singeli Pseudo-random number generator (PRNG) avoids patterns
 61 | - merge.singeli Merging utilities and pisort
 62 | - glide.singeli Glidesort strategy: use merges for natural runs
 63 |   - (merge.singeli)
 64 | - small.singeli Small array sorting
 65 |   - network.singeli Sorting networks for some fixed sizes
 66 |   - ins.singeli Insertion sorting
 67 |   - (merge.singeli)
 68 | - radix.singeli Radix sorts
 69 |   - prefix.singeli Prefix sums
 70 | - count.singeli Counting sort
 71 |   - (prefix.singeli)
 72 | - rh.singeli Robin Hood sort, for evenly distributed data
 73 | - dropsort.singeli (unused) Dropsorts for nearly-sorted arrays
 74 | 
 75 | Some quick notes on Singeli. Everything in brackets `{}` is run at compile time, so a call like `dist{dn}{U, minv, maxv}` is all inlined (`dist` is called a generator). Functions are called with parentheses and are used rarely, for things that are exported, used in many places, or recursive.
 76 | 
 77 | All operators are user-defined, with many picked up from standard includes `skin/c` and `skin/cext`. Some of the ones that are unfamiliar relative to C are listed below.
 78 | 
 79 | | Syntax      | Meaning
 80 | |-------------|---------
 81 | | `x <{dn} y` | Compare `x` and `y`, flipping ordering if `dn` is `1`
 82 | | `x -> i`    | Value at offset `i` from pointer `x`
 83 | | `x <- v`    | Store `v` at pointer `x`
 84 | | `x <-{i} v` | Store `v` at offset `i` from pointer `x`
 85 | | `x <-> y`   | Swap values at pointers `x` and `y`
 86 | | `a <~> b`   | Swap variables of variables `a` and `b`
 87 | | `T~~v`      | Cast `v` to same-width type `T`
 88 | | `T^~v`      | Promote `v` to supertype `T`
 89 | | `T<~v`      | Narrowing conversion of `v` to type `T`
 90 | 
 91 | Singeli sort uses a fair amount of compile-time trickery to support lots of sorting functions while keeping the code reasonably clean. Functions all support sorting in both directions (`dn` is `0` for up and `1` for down), and many of them support a sort-by operation that actually passes around a tuple of pointers: one to be sorted and others to be moved in the same pattern. A related operation is "grade", which reorders indices as the data should be ordered, and leaves the data intact (it may partially or completely sort it in aux space). A few funny operators are used to support sort-by: for example `*+T` to turn tuple type `T` into a tuple of pointer types, `*?` to avoid loading from extra pointers until the values are needed, and `>*` to compare pointer values.
 92 | 
 93 | | Syntax       | Meaning
 94 | |--------------|---------
 95 | | `>p`         | Get first pointer only from multi-pointer
 96 | | `p >*{dn} q` | Compare `p` and `q` by value at first pointer
 97 | | `*?p`        | Lazy load at `p`
 98 | | `p *? i`     | Lazy load at offset `i` from `p`
 99 | | `*+T`        | Tuple of pointer types
100 | 


--------------------------------------------------------------------------------
/bench.c:
--------------------------------------------------------------------------------
  1 | #include <stdlib.h>
  2 | #include <stdio.h>
  3 | #include <string.h>
  4 | #include <time.h>
  5 | 
  6 | #define sortname "singelisort"
  7 | 
  8 | // Options for test to perform:
  9 | #if RANGES  // Small range
 10 |   #define datadesc "10,000 small-range 4-byte integers"
 11 | #elif WORST // RH worst case
 12 |   #define datadesc "small-range plus outlier"
 13 | #else       // Random
 14 |   #define datadesc "random 4-byte integers"
 15 | #endif
 16 | 
 17 | #if WORST
 18 |   #define MODIFY(arr) arr[0] = 3<<28
 19 | #else
 20 |   #define MODIFY(arr) (void)0
 21 | #endif
 22 | 
 23 | typedef int T;
 24 | typedef size_t U;
 25 | static U monoclock(void) {
 26 |   struct timespec ts;
 27 |   clock_gettime(CLOCK_MONOTONIC, &ts);
 28 |   return 1000000000*ts.tv_sec + ts.tv_nsec;
 29 | }
 30 | 
 31 | #include "sort.c"
 32 | 
 33 | static void sort32_alloc(T *x, U n) {
 34 |   U a = (n + 4*(n<1<<16 ? n : 1<<16))*sizeof(T);
 35 |   T *aux = (T*)malloc(a);
 36 |   sort32(x, n, aux, a);
 37 |   free(aux);
 38 | }
 39 | 
 40 | // For qsort
 41 | int cmpi(const void * ap, const void * bp) {
 42 |   T a=*(T*)ap, b=*(T*)bp;
 43 |   return (a>b) - (a<b);
 44 | }
 45 | 
 46 | #ifndef RANGES
 47 | static U n_iter(U n) { return 1+3000000/(20+n); }
 48 | #else
 49 | static U n_iter(U n) { return 1000; }
 50 | #endif
 51 | 
 52 | int main(int argc, char **argv) {
 53 |   printf("Sorting %s: %s\n", datadesc, sortname);
 54 |   // Command-line arguments are max or min,max
 55 |   // Inclusive range, with sizes 10^n tested
 56 |   U min=3, max=6; int ls=0;
 57 |   if (argc>1) {
 58 |     ls = argv[1][0]=='l';
 59 |     if (ls) {
 60 |       // Log line chart 100 to 1e7 with 44 points, plus 4 before for warmup
 61 |       min=0;
 62 |       max = (argc>2) ? atoi(argv[2]) : 48;
 63 |     } else {
 64 |       max=atoi(argv[argc-1]);
 65 |       if (argc>2) min=atoi(argv[argc-2]);
 66 |     }
 67 |   }
 68 | 
 69 |   U sizes[max+1];
 70 |   if (!ls) { for (U k=0,n=1 ; k<=max; k++,n*=10) sizes[k]=n; }
 71 |   else     { for (U k=0,n=34; k<=max; k++,n=n*1.3+(n<70)) sizes[k]=n; if(max==48)sizes[max]=10000000; }
 72 | 
 73 | #ifndef RANGES
 74 |   U s=sizes[max]; s+=n_iter(s)-1;
 75 |   U q=sizes[min]; q+=n_iter(q)-1; if (s<q) s=q;
 76 | #else
 77 |   U s=10000; s+=n_iter(s)-1;
 78 | #endif
 79 |   s*=sizeof(T);
 80 |   T *data = (T*)malloc(s), // Saved random data
 81 |     *sort = (T*)malloc(s), // Array to be sorted
 82 |     *chk  = (T*)malloc(s); // For checking with qsort
 83 |   srand(time(NULL));
 84 |   for (U k=min, m=0; k<=max; k++) {
 85 |     U n = sizes[k];
 86 |     U iter = n_iter(n), off = max-k;
 87 | #if RANGES
 88 |     n=10000; m=0;
 89 | #endif
 90 |     for (U e=n+off+iter-1; m<e; m++) {
 91 | #if WORST
 92 |       data[m]=rand()%1024;
 93 | #elif RANGES
 94 |       data[m]=rand()%sizes[k];
 95 | #else
 96 |       data[m]=rand();
 97 | #endif
 98 |     }
 99 |     s = n*sizeof(T);
100 | #if RANGES
101 |     printf("Testing range %8ld: ", sizes[k]);
102 | #else
103 |     printf("Testing size %8ld: ", n);
104 | #endif
105 |     // Test
106 | #ifndef NOTEST
107 |     memcpy(sort, data, s); MODIFY(sort); sort32_alloc(sort, n);
108 |     memcpy(chk , data, s); MODIFY(chk ); qsort(chk, n, sizeof(T), cmpi);
109 |     for (U i=0; i<n; i++) if (sort[i]!=chk[i]) {
110 |       printf("Fails at [%ld]: %d but should be %d! ", i, sort[i], chk[i]);
111 |       break;
112 |     }
113 | #endif
114 |     // Time
115 |     U sum=0, best=0;
116 |     for (U r=0; r<iter; r++) {
117 |       memcpy(sort, data+off+r, s); MODIFY(sort);
118 |       U t = monoclock();
119 |       sort32_alloc(sort, n);
120 |       t = monoclock()-t;
121 |       sum += t;
122 |       if (r==0||t<best) best=t;
123 |     }
124 |     printf("best:%7.3f avg:%7.3f ns/v", (double)best/n, (double)sum/(iter*n));
125 |     printf("\n");
126 |   }
127 |   return 0;
128 | }
129 | 


--------------------------------------------------------------------------------
/src/arith.singeli:
--------------------------------------------------------------------------------
 1 | # Special arithmetic functions, implemented with loops
 2 | 
 3 | def floor_log2{n if knum{n}} = {
 4 |   if (n < 2) 0 else 1 + floor_log2{n / 2}
 5 | }
 6 | def floor_log2{n:T, min} = {
 7 |   l:T = min
 8 |   nt := n>>l; while (nt!=0) { ++l; nt >>= 1 }
 9 |   l
10 | }
11 | def floor_log2{n:T} = floor_log2{n, 0}
12 | 
13 | def sqrt_approx{n:T, init} = {
14 |   s:T = init
15 |   @for_const (i to 5) s = (s + n/s) / 2
16 |   s
17 | }
18 | 


--------------------------------------------------------------------------------
/src/base.singeli:
--------------------------------------------------------------------------------
  1 | # Base definitions that determine how "our version" of Singeli works
  2 | 
  3 | # Mostly it looks like C
  4 | include 'arch/c'    # Backend
  5 | include 'skin/c'    # Operators
  6 | include 'skin/cext'
  7 | 
  8 | include 'util/tup'  # List utilities: scan and so on
  9 | include 'util/kind' # Kind-test functions like ktup
 10 | 
 11 | oper <~> swap        infix none  20
 12 | oper <-> swap_ptr    infix none  20
 13 | oper >*  ptr_gt      infix none  20
 14 | oper *?  lazy_load   prefix      60
 15 | oper *?  lazy_load   infix right 50
 16 | oper *+  pnt_each    prefix      60
 17 | oper >   single      prefix      80
 18 | 
 19 | def tptr{T} = 'pointer'===typekind{T}
 20 | 
 21 | def single = match { {{x,..._}}=>x; {x}=>x }
 22 | 
 23 | def can_use_unstable{x} = (not ktup{x}) or 1 == length{x}
 24 | 
 25 | def swap{a, b} = { t:=a; a=b; b=t }
 26 | def swap_ptr{a:*T, b:*T} = { t := *a; a <- *b; b <- t }
 27 | 
 28 | def __pnt{{f,...r}} = __pnt{f}
 29 | def ptr_gt{a,b} = load{a,0} > load{b,0}
 30 | def ptr_lt{a,b} = load{a,0} < load{b,0}
 31 | 
 32 | def usize = primtype{'u',width{*void}}
 33 | def U = usize
 34 | def bytes{t if ktyp{t}} = width{t} / 8
 35 | 
 36 | # Change to unsigned: here we always subtract smaller from greater
 37 | def __sub{a:*P, b:*P}  = emit{usize, 'op -', a, b}
 38 | 
 39 | def addtype{T} = if (tptr{T}) usize else T
 40 | def __add{a:T, b:(u1)} = __add{a, addtype{T}^~b}
 41 | def __sub{a:T, b:(u1)} = __sub{a, addtype{T}^~b}
 42 | 
 43 | def clone{old} = { new:=old }
 44 | 
 45 | def leading_zeros{x:(u64)} = u8 <~ emit{i32, '__builtin_clzll', x}
 46 | 
 47 | def expect{e}{X} = emit{type{X}, '__builtin_expect', X,e}
 48 | def RARE   = expect{0}
 49 | def LIKELY = expect{1}
 50 | 
 51 | # Pervasion
 52 | local include 'util/perv'
 53 | extend perv1{clone}
 54 | extend perv2{load}
 55 | extend (perv{3}){store}
 56 | extend perv2{__add}
 57 | extend perv2{__sub}
 58 | extend perv2{__or }
 59 | extend perv2{__shl}
 60 | extend perv2{__shr}
 61 | extend perv1{__decr}
 62 | def eachrec{f,...a} = { (extend (perv{length{a}}){f}){...a} }
 63 | 
 64 | local def extend ecmp{f,g} = {
 65 |   def f{{a,..._},{b,..._}} = f{a,b}
 66 |   def f{dn==0} = f
 67 |   def f{dn==1} = g
 68 | }
 69 | local def dn_ind = tup{0,1, 3,2, 5,4, 7,6}
 70 | extend ({...f}=>each{ecmp,f,select{f,dn_ind}}){
 71 |   __eq, __ne, __lt, __gt, __le, __ge,
 72 |   ptr_gt, ptr_lt
 73 | }
 74 | 
 75 | def pnt_each{...T} = eachrec{__pnt, ...T}
 76 | 
 77 | # ++{1} is -- and +={1} is -=
 78 | def __incr{dn==0} = __incr
 79 | def __incr{dn==1} = __decr
 80 | 
 81 | # -{1} swaps arguments
 82 | def __sub{dn==0} = __sub
 83 | def __sub{dn==1}{a,b} = b-a
 84 | 
 85 | def type{{s,..._}} = type{s}
 86 | def scaltype = match {
 87 |   {*T} => scaltype{T}
 88 |   {T if ktyp{T}} => T
 89 |   {x} => scaltype{type{x}}
 90 | }
 91 | 
 92 | def isid{ptr} = not tptr{type{ptr}}
 93 | def load{ptr, i if isid{ptr}} = ptr + i
 94 | def store{ptr=='sink', i, val} = val
 95 | def __add{ptr=='sink', i} = 'sink'
 96 | 
 97 | def lazy_load{p} = lazy_load{p,0}
 98 | def lazy_load{p,i} = {
 99 |   def s = load{single{p},single{i}}
100 |   if (not ktup{p}) s; else {
101 |     def ll{p,i}{} = load{p,i}
102 |     def pr = slice{p,1}
103 |     merge{
104 |       tup{s},
105 |       if (ktup{i}) each{ll, pr, slice{i,1}} else each{ll{.,i},pr}
106 |     }
107 |   }
108 | }
109 | def store{p, i, val if kgen{val}} = store{p, i, val{}}
110 | def get{a} = a
111 | def get{a if kgen{a}} = a{}
112 | extend perv1{get}
113 | 
114 | def for{vars,begin,end,iter} = {
115 |   def e = usize^~end
116 |   i := usize^~begin
117 |   while (i < e) {
118 |     iter{i, vars}
119 |     i += 1
120 |   }
121 | }
122 | def for_backwards{vars,begin,end,iter} = {
123 |   i := usize^~end
124 |   def e = usize^~begin
125 |   while (i > e) {
126 |     i -= 1
127 |     iter{i, vars}
128 |   }
129 | }
130 | def for_const{vars,begin,end,iter} = {
131 |   if (begin < end) {
132 |     for_const{vars,begin, end-1, iter}
133 |     iter{end-1, vars}
134 |   }
135 | }
136 | def for_unroll{unr}{vars,begin,end,iter} = {
137 |   def e = usize^~end
138 |   i:usize = begin
139 |   eu := e & ~ usize~~(unr-1)
140 |   while (i < eu) {
141 |     @for_const (j to unr) iter{i+j, vars}
142 |     i += unr
143 |   }
144 |   while (i < e) {
145 |     iter{i, vars}
146 |     i += 1
147 |   }
148 | }
149 | 


--------------------------------------------------------------------------------
/src/common.singeli:
--------------------------------------------------------------------------------
 1 | # Utilities likely to be useful for multiple sorting algorithms
 2 | 
 3 | def map{op, a, b, n} = @for (a over n) op{a,b}
 4 | def map{op, a, b:*T, n} = @for (a, b over n) op{a,b}
 5 | 
 6 | def map{op,     a , {...b}, n} = each{map{op, ., b, n}, a}
 7 | def map{op, {...a}, {...b}, n} = each{map{op, ., ., n}, a, b}
 8 | 
 9 | # memset/memcpy
10 | def set = map{=, ...}
11 | 
12 | def reverse{x:*T, n} = {
13 |   xr := x + n-1
14 |   @for (i to n/2) x+i <-> xr-i
15 | }
16 | 
17 | def filter_neq{dst, src, len, v} = {
18 |   d:=dst; l:=len
19 |   s1:= >src-1; while (l > 0 and s1->l == v) --l
20 |   @for_unroll{8} (src over i to l) {
21 |     d <- src; d += v != >src  # Branchless update
22 |   }
23 |   d
24 | }
25 | 
26 | def findrange{dn, arr, len} = {
27 |   minv:=arr->0; maxv:=minv;
28 |   @for (arr over i from 1 to len) {
29 |     if (arr <{dn} minv) minv=arr
30 |     if (arr >{dn} maxv) maxv=arr;
31 |   }
32 |   tup{minv,maxv}
33 | }
34 | def readrange{arr, len} = tup{arr->(-1), arr->len}
35 | 
36 | def dist{dn} = {
37 |   def dsub{a:T, b:T} = primtype{'u',width{T}} ~~ (b -{dn} a)
38 |   def dsub{U,a,b} = U ^~ dsub{a,b}
39 | }
40 | def dist{...as if 1 < length{as}} = dist{0}{...as}
41 | 
42 | # Try the given types to lower constant overhead on small arrays
43 | def index_options{sort, n:U, test, itypes} = {
44 |   def done = makelabel{}
45 |   def try{C} = { if (C<U and LIKELY{test{n,C}}) { sort{C}; goto{done} } }
46 |   each{try, itypes}
47 |   sort{U}
48 |   setlabel{done}
49 | }
50 | 
51 | # Descending-ness and length of the first run in x
52 | def find_run{dn, x, n:U} = {
53 |   desc := x->0 >{dn} x->1  # Run is descending
54 |   l:U = 2                  # Run length
55 |   def follow_run{cmp} = {
56 |     while (l < n and cmp{x->(l-1), x->l}) ++l
57 |   }
58 |   if (not desc) follow_run{<={dn}} else follow_run{>{dn}}
59 |   tup{desc, l}
60 | }
61 | 


--------------------------------------------------------------------------------
/src/count.singeli:
--------------------------------------------------------------------------------
 1 | local include './prefix'
 2 | 
 3 | def incrp{p} = p <- p->0 + 1
 4 | def widen{v} = promote{width{*void}, v}
 5 | 
 6 | def do_count{val:*T, len:U, count:*U, min:T, zero} = {
 7 |   c := count - widen{min}
 8 |   @for (val over len) {
 9 |     incrp{c + widen{val}}
10 |     if (zero) val=0
11 |   }
12 | }
13 | 
14 | def count_fill{dn, x, n, count, min:T, range:U} = {
15 |   # Count the values
16 |   do_count{x, n, count, min, 0}
17 |   # Write based on the counts
18 |   dst := x; v := if (not dn) min else min + T<~(range-1)
19 |   def for_dn = if (not dn) for else for_backwards
20 |   @for_dn (count over range) {
21 |     set{dst, v, count}; dst += count; ++{dn}v
22 |   }
23 | }
24 | 
25 | def count_sum{dn, x, n, count, min:T, range:U} = {
26 |   # Count, and zero, the array
27 |   do_count{x, n, count, min, 1}
28 | 
29 |   # Write differences to x
30 |   j:U = if (not dn) 0 else range-1   # Index in count
31 |   r := undefined{U}                  # Running total
32 |   while ((r=count->j) == 0) ++{dn}j  # Skip leading 0 counts quickly
33 |   x0 := min + T<~j                   # First result
34 |   while (r < n) { incrp{x+r}; ++{dn}j; r += count->j }
35 | 
36 |   prefix_sum{dn}{x, n, x0}
37 | }
38 | 
39 | # Counting sort of the n values starting at x
40 | def count_sort{dn, x:*T, n:U, aux:*U, min:T, range:U} = {
41 |   set{aux, 0, range}
42 |   if (range < n/8) { # Short range: branching on count is cheap
43 |     count_fill{dn, x, n, aux, min, range}
44 |   } else {
45 |     count_sum{dn, x, n, aux, min, range}
46 |   }
47 | }
48 | 
49 | # Assume full range
50 | def count_sort{dn, x:*T, n:U, aux} = {
51 |   def range = 1<<width{T}
52 |   def min = if (issigned{T}) -range/2 else 0
53 |   count_sort{dn, x, n, *U~~aux, T~~min, U~~range}
54 | }
55 | 


--------------------------------------------------------------------------------
/src/dropsort.singeli:
--------------------------------------------------------------------------------
 1 | # Dropsort! Don't like a value? Kick it out!
 2 | 
 3 | # Drapesort is inspired by drop-merge sort, which cleans up after this
 4 | # crazy strategy by then sorting the dropped values, and merging them
 5 | # back in. Keeping two drop-lists instead of one makes it so the
 6 | # drop-lists maintain and even improve order in the original data.
 7 | 
 8 | # Each time it finds an inversion, it either drops incoming values to
 9 | # the too-low list or previously accepted values to the too-high list.
10 | 
11 | # It's usually stable, but the following sequence breaks stability:
12 | # - Some values initially kept
13 | # - Later higher values dropped
14 | # - Enough low values cause the first set of values to be dropped too,
15 | #   placing them after the higher values
16 | # I haven't found a fix for this other than to prevent dropping before
17 | # the last dropped value in each list, which would prevent the method
18 | # from working on a lot of mostly-ordered lists.
19 | 
20 | def drape_sort{dn, sort}{x:*T, n:U, aux:*T} = {
21 |   lows:U = 0          # Number of low values dropped
22 |   highs:U = n         # n - high values dropped (written backwards)
23 |   i:U = 1             # Index to read
24 |   keep:U = 0          # Index of last kept value (not one past!)
25 |   while (i < n) {
26 |     v := x->i         # Read next value
27 |     u := x->keep      # Latest kept value
28 |     if (u <={dn} v) { # In order?
29 |       ++i; ++keep; x <-{keep} v
30 |     } else {
31 |       # Drop on left or right?
32 |       # We'll choose the side that lets us drop the fewest
33 |       j:U = 1
34 |       vp := v
35 |       def done = makelabel{}
36 |       while (1) {
37 |         vj := undefined{T}
38 |         if (j == n-i or u <={dn} (vj = x->(i+j)) or vp >{dn} vj) { # Drop right
39 |           @for (j) { aux <-{lows} x->i; ++i; ++lows }
40 |           goto{done}
41 |         }
42 |         if (j > keep or x->(keep-j) <={dn} v) { # Drop left
43 |           d := x + (keep-j+1) # Drop from here and replace with x+i
44 |           @for (d over j) { --highs; aux <-{highs} d; d = x->i; ++i }
45 |           goto{done}
46 |         }
47 |         vp = vj
48 |         ++j
49 |       }
50 |       setlabel{done}
51 |     }
52 |   }
53 | 
54 |   # Sort values and merge back into x
55 |   def merge_back{old_end, new_end, drops, drop_len, rev, cmp} = {
56 |     # For stability, reverse and sort instead of sorting down
57 |     if (rev) reverse{drops, drop_len}
58 |     # Values past j are no longer needed, so use them as aux space
59 |     sort{drops, drop_len, x + old_end}
60 | 
61 |     # Now merge
62 |     j := old_end; i := new_end
63 |     # Guarding index j below takes time, so stop before it's needed
64 |     stop:U = 0
65 |     while (stop<drop_len and cmp{drops->stop, x->0}) ++stop
66 |     # Main loop
67 |     @for_backwards (d in drops over _ from stop to drop_len) {
68 |       v:=undefined{T}; jn:=undefined{U}
69 |       while (cmp{d, (v = x->(jn=j-1))}) { --i; x <-{i} v; j=jn }
70 |       --i; x <-{i} d
71 |     }
72 |     # Take care of the part that couldn't be guarded
73 |     if (stop > 0) {
74 |       while (j>0) { --j; --i; x <-{i} x->j }
75 |       set{x, drops, stop}
76 |     }
77 |   }
78 | 
79 |   j := keep + 1   # Number of kept values
80 |   jh := n - lows  # Plus high values once those are merged
81 |   if (j < jh)   merge_back{j, jh, aux+highs, n-highs, 1, <={dn}}
82 |   if (lows > 0) merge_back{jh, n, aux,       lows,    0, < {dn}}
83 | }
84 | 


--------------------------------------------------------------------------------
/src/glide.singeli:
--------------------------------------------------------------------------------
  1 | # Merging methods
  2 | # Merging 4 segments at a time is most efficient, but 3 is used for
  3 | # unbalanced natural runs, and 2 at the end
  4 | # Should probably implement versions where the left or right half is
  5 | # already in place, to avoid the set{} calls below
  6 | def merge_into{dn, dst:*T, src:*T, left:U, n:U} = {
  7 |   parity_merge_any{dn, T}(dst, src, left, n)
  8 | }
  9 | # Merge of 4
 10 | def quad_merge{dn, x:*T, a, b, c, d, aux:*T} = {
 11 |   l := a + b
 12 |   r := c + d
 13 |   merge_into{dn, aux,   x,   a, l}    # Left half
 14 |   merge_into{dn, aux+l, x+l, c, r}    # Right half
 15 |   merge_into{dn, x,     aux, l, l+r}  # Combine
 16 | }
 17 | # Merge of 3, as a function since it's used twice
 18 | fn triple_merge{dn, T}(x:*T, l:U, m:U, r:U, aux:*T) : void = {
 19 |   if (l < r) {
 20 |     h := l + m
 21 |     merge_into{dn, aux, x, l, h}      # Left
 22 |     set{aux+h, x+h, r}                # Right
 23 |     merge_into{dn, x, aux, h, h+r}    # Combine
 24 |   } else {
 25 |     h := m + r
 26 |     set{aux, x, l}                    # Left
 27 |     merge_into{dn, aux+l, x+l, m, h}  # Right
 28 |     merge_into{dn, x, aux, l, l+h}    # Combine
 29 |   }
 30 | }
 31 | 
 32 | # Return a generator that gets depth
 33 | def powersort_init{ns} = {
 34 |   n := u64^~ns
 35 |   scale := (n + ((1<<62) - 1)) / n
 36 |   def merge_tree_depth{mid, nl, nr} = {
 37 |     mid2 := 2 * (u64^~mid)
 38 |     u64^~leading_zeros{(scale * (mid2 - u64^~nl)) ^ (scale * (mid2 + u64^~nr))}
 39 |   }
 40 | }
 41 | 
 42 | fn glide_sort{dn, T, min_run, sort}(x:*T, n:U, aux:*void, aux_bytes:U) : void = {
 43 | 
 44 |   def base_sort{x, n} = sort{x, n, aux, *u8~~stack - *u8~~aux}
 45 |   def merge_2{x, l, r}       = merge_pair{dn, x, l, l+r, *T~~aux}
 46 |   def merge_3{x, l, m, r}    = triple_merge{dn, T}(x, l, m, r, *T~~aux)
 47 |   def merge_4{x, a, b, c, d} = quad_merge{dn, x, a, b, c, d, *T~~aux}
 48 | 
 49 |   # A logical run consists of a length (e.g. nl) and sortedness (sl)
 50 |   # The starting position is maintained separately
 51 |   # Sortedness indicates
 52 |   #   0 if unsorted
 53 |   #   mid for two sorted sequences
 54 |   #   ~0 if sorted (length is also viable but this is easier)
 55 |   def SORTED = (1 << width{U}) - 1
 56 | 
 57 |   # "Run" of length at most r starting at x
 58 |   # Modify r to give the length of the run, and return its sortedness
 59 |   # Always searches until a natural run of length >=min_run is found
 60 |   # This coalesces unsorted runs, never returning two in a row
 61 |   saved_run:U = 0
 62 |   def new_run{x, r:U} = {
 63 |     s:U = 0
 64 |     if (saved_run > 0) {
 65 |       r = saved_run; s = SORTED
 66 |       saved_run = 0
 67 |     } else {
 68 |       i := -(U~~min_run)     # Run start
 69 |       l    := undefined{U}   # Run length
 70 |       desc := undefined{u1}  # Run is descending
 71 |       def no_run = makelabel{}
 72 |       # Leave runs with many equal elements to quicksort
 73 |       def reject_run{p,l} = { f := l/4 + l/16; p->f == p->(l-f) }
 74 |       do {
 75 |         i += min_run
 76 |         if (r-i < min_run) goto{no_run}
 77 |         tup{desc,l} = find_run{dn, x+i, r-i}
 78 |       } while (l < min_run or reject_run{x+i, l})
 79 |       # Haven't jumped to no_run, so we have a run
 80 |       # Backtrack to find the beginning, and reverse it if descending
 81 |       def find_begin{cmp} = {
 82 |         i0 := i
 83 |         while (i > 0 and cmp{x->(i-1), x->i}) --i
 84 |         l += i0 - i
 85 |       }
 86 |       if (not desc) find_begin{<={dn}}
 87 |       else        { find_begin{>{dn}}; reverse{x+i, l} }
 88 |       # Next set of values might be the run or the part after
 89 |       if (i == 0) {
 90 |         r = l; s = SORTED
 91 |       } else {
 92 |         saved_run = l; r = i
 93 |       }
 94 |       setlabel{no_run}
 95 |     }
 96 |     s
 97 |   }
 98 | 
 99 |   def sort_run{x, n, s} = {
100 |     if      (s == 0) base_sort{x, n}
101 |     else if (s < n)  merge_2{x, s, n-s}
102 |   }
103 | 
104 |   # Modify xp, nr/sr to merge in nl/sl on the left
105 |   def logical_merge{xp, nl, sl, nr, sr} = {
106 |     xr := xp
107 |     xp -= nl  # Point to new run start
108 |     if ((sl | sr) == 0) { # Both unsorted (TODO check aux_bytes)
109 |       sr = 0
110 |     } else {
111 |       s_min := sl; if (sr < s_min) s_min = sr
112 |       if (s_min == 0) { # One unsorted
113 |         # Sort the unsorted half if it's under two thirds of total size
114 |         if (sr == 0 and nl > nr/2) { base_sort{xr, nr}; sr = SORTED; s_min = sl }
115 |         if (sl == 0 and nr > nl/2) { base_sort{xp, nl}; sl = SORTED; s_min = sr }
116 |       }
117 |       if (s_min == 0) {
118 |         sr = 0
119 |       } else if (s_min == SORTED) { # Both sorted, now split-sorted
120 |         sr = nl
121 |       } else {
122 |         if      (sr == SORTED) merge_3{xp, sl, nl-sl, nr}
123 |         else if (sl == SORTED) merge_3{xp, nl,        sr, nr-sr}
124 |         else                   merge_4{xp, sl, nl-sl, sr, nr-sr}
125 |         sr = SORTED
126 |       }
127 |     }
128 |     nr += nl
129 |   }
130 | 
131 |   def merge_depth = powersort_init{n}
132 | 
133 |   # The merge stack has a frame for each defined run
134 |   stack_top := *U~~(*u8~~aux + aux_bytes)
135 |   stack := stack_top
136 |   def Num = 0; def Sort = 1; def Depth = 2; def Frame = 3
137 |   def has_stack{} = stack < stack_top
138 |   def merge_p{} = {
139 |     logical_merge{xp, stack->Num, clone{stack->Sort}, np, sp}
140 |     stack += Frame
141 |   }
142 | 
143 |   xp := x
144 |   np := n; sp := new_run{x, np}
145 | 
146 |   xi := x + np; end := x + n
147 |   while (xi < end) {
148 |     # New run: now only used to determine if previous should be merged
149 |     # Next round when it's previous it may be merged
150 |     nn := end - xi; sn := new_run{xi, nn}
151 |     target_depth := merge_depth{xi - x, np, nn}
152 |     xi += nn
153 | 
154 |     # Merge previous run with the ones that want to be deeper than it
155 |     while (has_stack{} and stack->Depth >= target_depth) merge_p{}
156 | 
157 |     # Push to stack; shift new to prev
158 |     stack -= Frame
159 |     stack <-{Num} np; stack <-{Sort} sp; stack <-{Depth} target_depth
160 |     xp += np
161 |     np = nn; sp = sn
162 |   }
163 |   while (has_stack{}) merge_p{}
164 |   sort_run{x, np, sp}
165 | }
166 | 
167 | fn glide_sort{dn, T}(x:*T, n:U, aux:*void, aux_bytes:U) : void = {
168 |   if (n < 16) {
169 |     sort_lt32{dn, x, n, 16}
170 |   } else {
171 |     def flux{...a} = flux_sort{dn, T}(...a)
172 |     glide_sort{dn, T, 256, flux}(x, n, aux, aux_bytes)
173 |   }
174 | }
175 | 


--------------------------------------------------------------------------------
/src/ins.singeli:
--------------------------------------------------------------------------------
 1 | # Guarded or unguarded insertion sort of len values at x
 2 | # Unguarded requires x->(-1) to precede all of these values
 3 | def insertion_sort{dn, x:*T, len:U, guard} = {
 4 |   # First value's already in place; insert the others
 5 |   @for (xi in x over i from 1 to len) {
 6 |     # j moves backward along the array until finding the right spot
 7 |     j := i; jn := i
 8 |     xj := xi
 9 |     while ((not guard or 0<j) and xi <{dn} (xj=x->(jn=j-1))) {
10 |       x <-{j} xj; j=jn  # Move previous value forward
11 |     }
12 |     x <-{j} xi
13 |   }
14 | }
15 | # Default to guarded
16 | def insertion_sort{dn, x:*T, len:U} = insertion_sort{dn, x, len, 1}
17 | 
18 | # Sort an array where indices less than start are already sorted
19 | def insertion_finish{dn, dst, src, start, n} = {
20 |   @for (i from start to n) {
21 |     end := dst + i
22 |     def xi = src*?i
23 |     prev := end - 1
24 |     if (*?prev >{dn} xi) {
25 |       def xi = get{xi}
26 |       if (*?dst >{dn} xi) {
27 |         top := i
28 |         do { end <- prev->0; --end; --prev } while (--top != 0)
29 |         end <- xi
30 |       } else {
31 |         do { end <- prev->0; --end; --prev } while (*?prev >{dn} xi)
32 |         end <- xi
33 |       }
34 |     }
35 |   }
36 | }
37 | 


--------------------------------------------------------------------------------
/src/median.singeli:
--------------------------------------------------------------------------------
 1 | local {
 2 | 
 3 | include './xorshift'
 4 | 
 5 | # Given a pointer and an odd-length tuple of indices, return the index
 6 | # of the median value without moving any values
 7 | def locate_median{src:*T, inds} = {
 8 |   def l = length{inds}
 9 |   def k = l >> 1  # Median is greater than exactly k values
10 | 
11 |   # Count number of comparisons
12 |   # Only l-1 counters: if median comes last it's found by elimination
13 |   def sums = each{{_}=>{t:u8=0}, range{l-1}}
14 | 
15 |   def get{i} = src->select{inds, i}
16 |   def s{i} = select{sums, i}
17 | 
18 |   @for_const (i from 0 to l-1) {
19 |     vi := get{i}
20 |     @for_const (j from i+1 to l) {
21 |       c:u1 = vi > get{j}
22 |       s{i} += c
23 |       if (j < l-1) s{j} += ~c
24 |     }
25 |     if (s{i} == k) return{select{inds, i}}
26 |   }
27 |   select{inds, l-1}
28 | }
29 | fn locate_median_3{T,U}(src:*T, i0:U, i1:U, i2:U) : U = {
30 |   locate_median{src, tup{i0,i1,i2}}
31 | }
32 | fn locate_median_5{T,U}(src:*T, i0:U, i1:U, i2:U, i3:U, i4:U) : U = {
33 |   locate_median{src, tup{i0,i1,i2,i3,i4}}
34 | }
35 | def median_from{medfn}{array:*T, U} = {
36 |   def fun = medfn{T,U}
37 |   {...inds} => fun(array, ...inds)
38 | }
39 | def median3_from = median_from{locate_median_3}
40 | def median5_from = median_from{locate_median_5}
41 | 
42 | } # end local
43 | 
44 | def locate_3_median{array:*T, n:U} = {
45 |   median3_from{array,U}{0,n/2,n-1}
46 | }
47 | 
48 | def locate_3of3_pseudomedian{array:*T, n:U} = {
49 |   q1 := n / 4
50 |   q2 := n / 2
51 |   q3 := n - q1
52 |   def med = median3_from{array, U}
53 |   med{
54 |     med{q1-1, q2-1, q3  },  # 136
55 |     med{ 0  , q2  , q3+1},  # 048
56 |     med{q1  , q2+1, n -1}   # 257
57 |   }
58 | }
59 | 
60 | def locate_5of3_pseudomedian{array:*T, n:U} = {
61 |   def xorshift16 = make_split_xorshift{tup{7,9,8}, n, 63}
62 |   div := n / 16
63 |   def med = median3_from{array, U}
64 |   def get3{f} = med{...xorshift16{clone{f * div}, div}}
65 |   median5_from{array,U}{...each{get3, tup{0,3,7,10,13}}}
66 | }
67 | 


--------------------------------------------------------------------------------
/src/merge.singeli:
--------------------------------------------------------------------------------
  1 | # Merge sorting
  2 | 
  3 | # Parity merge: branchless
  4 | # Main data movement
  5 | def parity_pos{dn, left, right, dst, i} = {
  6 |   l := left->0; r := right->i
  7 |   c := l <={dn} r
  8 |   if (c) r=l; dst <-{i} r
  9 |   right -= c
 10 |   left  += c
 11 | }
 12 | def parity_neg{dn, left, right, dst, i} = {
 13 |   l := left->(-i); r := right->0
 14 |   c := l <={dn} r
 15 |   if (c) l=r; dst <-{-i} l
 16 |   right -= c
 17 |   left  += c
 18 | }
 19 | 
 20 | # Merge halves of length-n array with constant n (4 and 8 used)
 21 | def parity_merge_const{dn, n, dst, src} = {
 22 |   def h = n / 2
 23 |   left := src; right := src + h; dstc := dst
 24 |   @for_const (i to h) parity_pos{dn, left, right, dstc, i}
 25 | 
 26 |   left = src + (h-1); right = src + (n-1); dstc = dst + (n-1)
 27 |   @for_const (i to h) parity_neg{dn, left, right, dstc, i}
 28 | }
 29 | 
 30 | # Branchless merge, combining lengths left and n-left from src to dst
 31 | # Not in-place: dst can't overlap src
 32 | # With guard==0, must have left*2 == n
 33 | # With guard==1,           left == n/2 (n can be odd)
 34 | # With guard==2, any lengths handled
 35 | def parity_merge_any{dn, T} = parity_merge{dn, T, 2}
 36 | fn parity_merge{dn, T, guard}(dst:*+T, src:*+T, left:U, n:U) : void = {
 37 |   parity_merge{dn, guard, dst, src, left, n}
 38 | }
 39 | def parity_merge_fn{dn, guard, dst, src, left, n} = {
 40 |   parity_merge{dn, eachrec{scaltype,dst}, guard}(dst, src, left, n)
 41 | }
 42 | def parity_merge{dn, guard, dst, src, left:U, n:U} = {
 43 |   def handle_any = guard >= 2
 44 |   def handle_odd = guard == 1
 45 | 
 46 |   right := n - left
 47 |   lpos := src     ; rpos := src + left  ; dpos := dst
 48 |   lneg := rpos - 1; rneg := lneg + right; dneg := dst + n - 1
 49 | 
 50 |   def done = makelabel{}
 51 |   half := if (handle_any) n/2 else left
 52 |   if (handle_any) {
 53 |     def cut_sides{{short,shneg,shpos}, {long,loneg,lopos}, cmp} = {
 54 |       ov := long - short
 55 |       # This is faster, but also needed for correctness of the
 56 |       # following parity merge!
 57 |       if (ov > short) {
 58 |         unbalanced_merge{shpos,short, lopos,long, dpos,n, cmp}
 59 |         goto{done}
 60 |       }
 61 |       # If there are ov "outside" elements on either side of the long
 62 |       # side, move them to get a balanced merge
 63 |       # If not, there are long-ov=short "inside" elements that will
 64 |       # merge before the last one on the short side, so it's safe to
 65 |       # perform short+short>half merges in that direction: only the
 66 |       # last can finish the short side
 67 |       def setneg{} = { loneg-=ov; dneg-=ov; set{dneg+1, loneg+1, ov} }
 68 |       def setpos{} = { set{dpos, lopos, ov}; lopos+=ov; dpos+=ov }
 69 |       if      (RARE{ cmp{* >shneg, >lopos->short   }}) setneg{}
 70 |       else if (RARE{~cmp{* >shpos, >loneg->(-short)}}) setpos{}
 71 |       else goto{nocut}
 72 |       half = short
 73 |     }
 74 |     def nocut = makelabel{}
 75 |     def ldata = tup{left ,lneg,lpos}
 76 |     def rdata = tup{right,rneg,rpos}
 77 |     if      (left  < half) cut_sides{ldata, rdata, <={dn}}
 78 |     else if (right < half) cut_sides{rdata, ldata, < {dn}}
 79 |     n = 2*half
 80 |     setlabel{nocut}
 81 |   }
 82 | 
 83 |   @for_unroll{2} (i to half) {
 84 |     parity_pos{dn, lpos, rpos, dpos, i}
 85 |     parity_neg{dn, lneg, rneg, dneg, i}
 86 |   }
 87 |   if (handle_odd and n%2 != 0) {
 88 |     l := lpos->0; r := rpos->half
 89 |     if (lpos > lneg-half) l = r
 90 |     dpos <-{half} l
 91 |   }
 92 |   setlabel{done}
 93 | }
 94 | 
 95 | def unbalanced_merge{shp, short, lop, long, dst, n, lt} = {
 96 |   # Step size on the long side
 97 |   def K = 8
 98 |   # Main loop must stop after placing element short-1 or long-K
 99 |   i:u64 = 0; j:u64 = 0; k:u64 = 0
100 |   while (j+K<=long and i<short) {
101 |     # Merge up to 1 element from the short side and K from long
102 |     u := shp->i # Next short value to place
103 |     c:u64 = 0   # Number of elements u passes--its index if <K
104 |     @for_const (v in lop+j, d in dst+k over K) {
105 |       c += ~lt{u,v}
106 |       d = v
107 |     }
108 |     k += c; j += c
109 |     dst <-{k} u
110 |     a := c<K; i += a; k += a
111 |   }
112 |   # Finish up with a branchy backwards loop
113 |   if (i == 0) {
114 |     u := shp->i
115 |     while (lt{u, lop->j}) { dst<-{k}lop->j; ++j; ++k }
116 |     dst <-{k} u; ++i; ++k
117 |   }
118 |   i = short-1; j = long-1
119 |   @for_backwards (d in dst over _ from k to n) {
120 |     u := shp->i; v := lop->j
121 |     if (not lt{u,v}) { d = u; --i }
122 |     else             { d = v; --j }
123 |   }
124 | }
125 | 
126 | # Merge arrays of length l and n-l starting at a, using buffer aux
127 | # Can be done without moving both sides, but this way's easy
128 | def merge_pair{dn, x, left:U, n:U, aux} = {
129 |   set{aux, x, n}
130 |   parity_merge_fn{dn, 2, x, aux, left, n}
131 | }
132 | 
133 | # Merge array x of size n, if units of length block are pre-sorted
134 | def merge_from{dn, x, n:U, aux, block} = {
135 |   src:=x; dst:=aux
136 |   w:U = block; while (w < n) {
137 |     ww:=2*w
138 |     i:U=0; while (i < n-w) {
139 |       l := n-i; if (l>ww) l=ww
140 |       parity_merge_fn{dn, 2, dst+i, src+i, w, l}
141 |       i += ww
142 |     }
143 |     if (i < n) set{dst+i, src+i, n-i}
144 |     src <~> dst
145 |     w = ww
146 |   }
147 |   if (src != x) set{x, src, n}
148 | }
149 | 
150 | # A bottom-down fast but not very adaptive merge sort
151 | # Requires dst != aux; src may be dst or aux
152 | fn pisort{dn, T}(dst:*+T, src:*+T, n:U, aux:*T) : void = {
153 |   pisort{dn, call{pisort{dn,T}, ...}, dst, src, n, aux}
154 | }
155 | def pisort{dn, recur, dst, src, n:U, aux} = {
156 |   if (n < 32) {
157 |     sort_lt32{dn, dst, src, n}
158 |     return{}
159 |   }
160 | 
161 |   h1 := n / 2; h2 := n - h1
162 | 
163 |   recur{aux,      src,      h1, dst}
164 |   recur{aux + h1, src + h1, h2, dst + h1}
165 | 
166 |   if (aux->(h1-1) <={dn} aux->h1) {
167 |     eachrec{set{., ., n}, dst, aux}
168 |   } else {
169 |     parity_merge_fn{dn, 1, dst, aux, h1, n}
170 |   }
171 | }
172 | 
173 | # Allow index part of src to be just a base index
174 | fn pisort{dn, T, I}(dst:tup{*T,*I}, src:tup{*T,I}, n:U, aux:tup{*T,*I}) : void = {
175 |   pisort{dn, call{pisort{dn,T,I}, ...}, dst, src, n, aux}
176 | }
177 | fn pigrade{dn, T, I}(dst:*I, src:*T, n:U, aux:*void) : void = {
178 |   ai := *I~~aux
179 |   dv := *T~~(ai + n)
180 |   av := dv + n
181 |   pisort{dn, T, I}(tup{dv, dst}, tup{src, 0}, n, tup{av, ai})
182 | }
183 | 


--------------------------------------------------------------------------------
/src/network.singeli:
--------------------------------------------------------------------------------
 1 | # Network sorting: sort by ordering pairs of values in a fixed sequence
 2 | # With a branchless swap the entire sort involves no branching
 3 | # These networks also have a small depth, or number of layers
 4 | # Swaps within a layer are subject to instruction-level parallelization
 5 | 
 6 | # Not stable! Stable networks are mergesort-like and much bigger
 7 | 
 8 | # The networks here are all proven optimal in depth, but the 16-value
 9 | # network is not known to be optimal in number of comparisons
10 | 
11 | # The first layer in each network covers all values, so if source and
12 | # destination differ, move from src to dst in that layer (i_move below)
13 | 
14 | # Branchlessly order values at i and j while moving from src to dst
15 | local def order{dn}{dst, src, i, j} = {
16 |   a := src->i; b := src->j
17 |   x := (a^b) & - type{a}^~(a >{dn} b)
18 |   dst <-{i} x^a
19 |   dst <-{j} x^b
20 | }
21 | # Perform all swaps
22 | local def network_sort{i_move, i_inplace}{dn, dst, src} = {
23 |   def run{s, i} = each{order{dn}{dst, s, ...}, ...flip{i}}
24 |   run{src, i_move}    # First round, move src to dst while swapping
25 |   run{dst, i_inplace} # Now just swap
26 | }
27 | 
28 | # Now we can write our index lists out all pretty
29 | local oper <> tup infix none 90
30 | 
31 | # https://bertdobbelaere.github.io/sorting_networks.html#N4L5D3
32 | def network_sort_4 = network_sort{
33 |   tup{0<>2, 1<>3},
34 |   tup{0<>1, 2<>3,
35 |       1<>2}
36 | }
37 | # https://bertdobbelaere.github.io/sorting_networks.html#N8L19D6
38 | def network_sort_8 = network_sort{
39 |  tup{0<>2, 1<>3, 4<>6, 5<>7},
40 |  tup{0<>4, 1<>5, 2<>6, 3<>7,
41 |      0<>1, 2<>3, 4<>5, 6<>7,
42 |      2<>4, 3<>5, 1<>4, 3<>6,
43 |      1<>2, 3<>4, 5<>6}
44 | }
45 | # https://bertdobbelaere.github.io/sorting_networks.html#N12L39D9
46 | def network_sort_12 = network_sort{
47 |   tup{0<>8, 1<>7, 2<>6, 3<>11, 4<>10, 5<>9},
48 |   tup{0<>2, 1<>4, 3<>5, 6<>8, 7<>10, 9<>11,
49 |       0<>1, 2<>9, 4<>7, 5<>6, 10<>11,
50 |       1<>3, 2<>7, 4<>9, 8<>10,
51 |       0<>1, 2<>3, 4<>5, 6<>7, 8<>9, 10<>11,
52 |       1<>2, 3<>5, 6<>8, 9<>10,
53 |       2<>4, 3<>6, 5<>8, 7<>9,
54 |       1<>2, 3<>4, 5<>6, 7<>8, 9<>10}
55 | }
56 | # https://bertdobbelaere.github.io/sorting_networks.html#N16L60D10
57 | def network_sort_16 = network_sort{
58 |   tup{0<>13, 1<>12, 2<>15, 3<>14, 4<>8, 5<>6, 7<>11, 9<>10},
59 |   tup{0<>5, 1<>7, 2<>9, 3<>4, 6<>13, 8<>14, 10<>15, 11<>12,
60 |       0<>1, 2<>3, 4<>5, 6<>8, 7<>9, 10<>11, 12<>13, 14<>15,
61 |       0<>2, 1<>3, 4<>10, 5<>11, 6<>7, 8<>9, 12<>14, 13<>15,
62 |       1<>2, 3<>12, 4<>6, 5<>7, 8<>10, 9<>11, 13<>14,
63 |       1<>4, 2<>6, 5<>8, 7<>10, 9<>13, 11<>14,
64 |       2<>4, 3<>6, 9<>12, 11<>13,
65 |       3<>5, 6<>8, 7<>9, 10<>12,
66 |       3<>4, 5<>6, 7<>8, 9<>10, 11<>12,
67 |       6<>7, 8<>9}
68 | }
69 | 


--------------------------------------------------------------------------------
/src/partition.singeli:
--------------------------------------------------------------------------------
 1 | # Partition algorithms for quicksort
 2 | 
 3 | # Stable partitioning with external memory
 4 | # Place values v into dst_false or dst_true according to cmp{v, piv}
 5 | # Return the number of true comparisons, or length of dst_true
 6 | def flux_partition{src:*T, cmp, piv:T, dst_true:*T, dst_false:*T, n:U} = {
 7 |   # Number of true comparisons, and index into dst_true
 8 |   l:U = 0
 9 |   dst_f := dst_false
10 |   @for_unroll{8} (src over n) {
11 |     c := cmp{src, piv}
12 |     # Write to both destinations: one will be overwritten
13 |     dst_true <-{ l} src
14 |     dst_f    <-{-l} src; ++dst_f
15 |     l += c
16 |   }
17 |   l
18 | }
19 | 
20 | # Unstable in-place partitioning
21 | # Fulcrum partition works from the outside in, alternating sides as
22 | # necessary to maintain a small gap both on the left and the right.
23 | # It's faster than fluxsort partitioning for large arrays. I believe
24 | # this is because there are two active regions
25 | #   src_left/dst_left, src_right/dst_right
26 | # versus the three in flux_partition
27 | #   src, dst_true, dst_false
28 | # leading to less waiting for the cache
29 | def fulcrum_partition{x:*T, cmp, piv:T, aux:*T, n:U} = {
30 |   # Partition bl elements at a time in the main loop
31 |   def bl = 16
32 |   dst_left  := x;      src_left  := dst_left  + bl
33 |   dst_right := x + n;  src_right := dst_right - bl
34 | 
35 |   # Create the initial gap between src and dst
36 |   set{aux,      dst_left,  bl}
37 |   set{aux + bl, src_right, bl}
38 | 
39 |   # Branchless move, as in flux_partition except left and right parts
40 |   # move in opposite directions
41 |   # The shared index l allows us to not increment dst_left,
42 |   # and unconditionally decrement dst_right
43 |   l:U = 0  # Number of values placed on the left so far
44 |   def put{ptr} = {
45 |     v := *ptr
46 |     c := cmp{v, piv}
47 |     dst_left  <-{l} v; --dst_right
48 |     dst_right <-{l} v
49 |     l += c
50 |   }
51 |   # Perform num moves at once, where num <= bl at every call
52 |   # Since the total gap is 2*bl, one side or the other must have room
53 |   # That's the side we don't partition from!
54 |   def part{for, num} = {
55 |     diff := src_left - dst_left - l  # Gap on left side
56 |     if (diff < bl) { @for (num) { put{src_left};  ++src_left  } }
57 |     else           { @for (num) { --src_right; put{src_right} } }
58 |   }
59 | 
60 |   @for (n / bl - 2) part{for_const, bl} # Main loop, unrolled
61 |   part{for, n % bl}                     # Finish up
62 |   @for (2*bl) { put{aux}; ++aux }       # Partition cleared values
63 |   l
64 | }
65 | 


--------------------------------------------------------------------------------
/src/prefix.singeli:
--------------------------------------------------------------------------------
 1 | # Prefix sums
 2 | # Inclusive prefix sum is used everywhere: simpler to write and faster
 3 | # If dn==1, subtract instead of adding
 4 | 
 5 | local {
 6 | 
 7 | # Prefix sum on the size-wi units of word(s) x of width ww
 8 | # Values in x must be registers: the sum is done in place
 9 | def prefix_word{wi, ww, x} = {
10 |   # Shift amounts, e.g. tup{8, 16, 32}
11 |   def shifts{w} = if (w<ww) merge{w, shifts{2*w}} else tup{}
12 |   # Add each of these in succession, interleaving if x is a tuple
13 |   each{{k} => {x += x<<k}, shifts{wi}}
14 | }
15 | # Add incoming values first
16 | def prefix_word{wi, ww, x, add} = {
17 |   x += add
18 |   prefix_word{wi, ww, x}
19 | }
20 | 
21 | # Basic pattern, mutates sum
22 | def psf{dn, cfor}{x, len, sum} = {
23 |   @cfor (x over len) {
24 |     sum +={dn} x
25 |     x = sum
26 |   }
27 | }
28 | 
29 | # SWAR optimization for 1- or 2-byte counts and no SIMD
30 | def ps_words{x, len, sum, out_fn} = {
31 |   # Width of an element and a word
32 |   def we = width{scaltype{x}}; def ww = width{U}
33 |   # Number of elements in a word, and total words
34 |   def l = ww / we
35 |   nw:U = len / l
36 |   # Word pointer(s)
37 |   def xw = eachrec{~~{*U,.}, x}
38 |   @for (xw over nw) {
39 |     prefix_word{we, ww, sum, xw}
40 |     xw = out_fn{sum}
41 |     sum >>= ww-we
42 |   }
43 |   # Return the number of elements summed; caller handles tail
44 |   l * nw
45 | }
46 | 
47 | } # end local
48 | 
49 | # For counting sort
50 | def prefix_sum{dn}{x:*T, len, init:T} = {
51 |   def we = width{T}
52 |   if (we > 16) {
53 |     psf{dn, for_unroll{4}}{x, len, clone{init}}
54 |   } else {
55 |     sum := U^~primtype{'u',we}~~init
56 |     # Adjust for signed overflow and descending with xor
57 |     def fixsum = if (not (issigned{T} or dn)) { {s}=>s } else {
58 |       def off = (1<<(width{T}-1)) - dn
59 |       sum ^= off
60 |       xor:U = off; prefix_word{we, width{U}, xor}
61 |       {s} => xor ^ s
62 |     }
63 |     # Full words
64 |     lenq := ps_words{x, len, sum, fixsum}
65 |     # Last partial word
66 |     psf{dn, for}{x + lenq, len - lenq, clone{T<~fixsum{sum}}}
67 |   }
68 | }
69 | 
70 | # For radix sort: interleaved sums
71 | def radix_prefix_sum{dn, n, {...ptrs}, len} = {
72 |   def T = scaltype{ptrs}
73 |   if (not dn) slice{ptrs, 1} <- 0
74 |   def sum{S,v} = each{{_}=>{s:S=S<~v}, ptrs}
75 |   if (width{T} > 16) {
76 |     psf{dn, for}{ptrs, len, sum{T, if (not dn) 0 else n}}
77 |   } else {
78 |     def fixsum = if (not dn) { {s}=>s } else {
79 |       nw:U = U^~n; prefix_word{width{T}, width{U}, nw}
80 |       {s} => nw - s
81 |     }
82 |     ps_words{ptrs, len, sum{U,0}, fixsum}
83 |   }
84 | }
85 | 


--------------------------------------------------------------------------------
/src/quicksort.singeli:
--------------------------------------------------------------------------------
  1 | include './partition'
  2 | include './median'
  3 | include './xorshift'
  4 | include './arith'  # Logs and square roots
  5 | 
  6 | def base_cases{dn, dst:*T, src:*T, aux:*T, n:U, aux_bytes:U, min:T} = {
  7 |   # Short array
  8 |   if (n < 192) {
  9 |     pisort{dn, T}(dst, src, n, aux)
 10 |     return{}
 11 |   }
 12 | 
 13 |   # Distribution base cases
 14 |   if (isint{T}) {
 15 |     max := dst->n
 16 |     range := dist{dn}{min, max}
 17 |     def nmin = if (not dn) min else max
 18 |     if (U^~(range/4) < n and U^~range < aux_bytes/bytes{U} and range < (1<<18)) {
 19 |       # Always sort in place on dst
 20 |       # Counting sort could have a different src/dst, but if src isn't
 21 |       # dst then it's equal to aux, and we need that space
 22 |       if (src != dst) set{dst, src, n}
 23 |       count_sort{dn, dst, u32<~n, *u32~~aux, nmin, u32<~range + 1}
 24 |       return{}
 25 |     }
 26 |     if (width{T} == 32 and n <= (1<<16) and range < (1<<16)) {
 27 |       radpack32{dn}(*u32~~dst, *u32~~src, u32<~n, *void~~aux, u32~~nmin)
 28 |       return{}
 29 |     }
 30 |   }
 31 | }
 32 | 
 33 | # Robin Hood approval state
 34 | def RH_UNTRIED = 0
 35 | def RH_APPROVED = 2
 36 | 
 37 | # No direction: doesn't affect median; built into sort{} and proc_pivots{}
 38 | def get_pivot{array:*T, n:U, getaux, sort, proc_pivots, rh_state} = {
 39 |   # log_2 of minimum size for sampling
 40 |   sl0:U = 8
 41 |   # Output array and index
 42 |   arr:=array; ind:U = 0
 43 |   if (rh_state!=RH_UNTRIED and n <= 1024) {
 44 |     ind = locate_3of3_pseudomedian{array, n}
 45 |   } else if (rh_state!=RH_UNTRIED and n <= 1 << (sl0 = 14)) {
 46 |     ind = locate_5of3_pseudomedian{array, n}
 47 |   } else {
 48 |     aux := getaux{}
 49 |     # gap is the expected distance between adjacent samples
 50 |     # We'll get about n/gap samples
 51 |     log2:U = floor_log2{n, sl0}
 52 |     gap_min := 1 << (log2 / 2)
 53 |     gap := sqrt_approx{n, gap_min}
 54 | 
 55 |     # Collect samples with split xorshift and add to aux
 56 |     aux1 := aux
 57 |     def add{ind} = { aux1 <- array->ind; ++aux1 }
 58 |     mask := gap_min - 1
 59 |     def add3 = make_split_xorshift{tup{13,17,5}, n, mask, add}
 60 | 
 61 |     i:U = 0; while (i < n - (mask + 2 * gap)) add3{i, gap}
 62 |     ns := aux1 - aux
 63 |     sort{aux, ns, *void~~aux1, ns*bytes{T}}
 64 |     proc_pivots{aux, ns}
 65 |     arr = aux
 66 |     ind = ns / 2
 67 |   }
 68 |   arr -> ind
 69 | }
 70 | 
 71 | # Fluxsort recurrence: partition, handle right then left side
 72 | # src may be equal to dst or aux
 73 | def flux_recur{dn, recur, tailcall, piv:T, src:*T, dst:*T, aux:*T, n:U, aux_bytes:U, min:T, rh_state:(u8)} = {
 74 | 
 75 |   # Partition: left side directly in dst with length l, includes pivots
 76 |   l:U = 0
 77 |   rsrc := aux
 78 |   if (LIKELY{n < 1<<14}) {
 79 |     l = flux_partition{src, <={dn}, piv, dst, aux, n}
 80 |   } else {
 81 |     # Previous partitions must have used fulcrum too, so we haven't yet
 82 |     # touched aux, and src==dst
 83 |     l = fulcrum_partition{dst, <={dn}, piv, aux, n}  # (unstable)
 84 |     rsrc = dst + l
 85 |   }
 86 |   r := n - l  # Values on the right
 87 |   m := l      # Values not on the right
 88 | 
 89 |   # Never recurse on a partition with more than this many values
 90 |   most := n - n/16
 91 | 
 92 |   # If most values end up on the left, they're probably mostly pivots
 93 |   if (l > most) {
 94 |     # Partition again with pivots on the right
 95 |     # This bounds performance by O(k*n) for only k unique values
 96 |     if (can_use_unstable{dst}) {
 97 |       l = filter_neq{dst, dst, m, piv} - dst
 98 |       set{dst+l, piv, m-l}
 99 |     } else {
100 |       l = flux_partition{dst, <{dn}, piv, dst, aux+r, m}
101 |       set{dst+l, aux+r, m-l} # Should probably write a reverse partition to avoid this
102 |     }
103 |   }
104 | 
105 |   # Sort the right-hand side, moving it from rsrc to dst
106 |   rdst := dst + m
107 |   if (r > most) { # Unbalanced
108 |     pisort{dn, T}(rdst, rsrc, r, aux)
109 |   } else {
110 |     recur{rsrc, rdst, aux, r, aux_bytes, piv, rh_state}
111 |   }
112 | 
113 |   if (l > most) { # Unbalanced
114 |     pisort{dn, T}(dst, dst, l, aux)
115 |     return{}
116 |   }
117 | 
118 |   # Left-hand side by tail call
119 |   src = dst
120 |   n = l
121 | }
122 | 
123 | fn flux_loop{dn, T}(src:*T, dst:*T, aux:*T, n:U, aux_bytes:U, min:T, rh_state:u8) : void = {
124 |   while (u1~~1) {
125 |     base_cases{dn, dst, src, aux, n, aux_bytes, min}
126 | 
127 |     # Find pivot and check for RH sorting
128 |     def getaux{} = { a:=aux; if (a==src) a=dst; a }
129 |     def proc_pivots{pivots, num} = {
130 |       if (isint{T} and n<=1<<17 and rh_state!=RH_APPROVED) {
131 |         # 1 if tried and not approved, 2 for RH_APPROVED
132 |         rh_state = 1 + u8^~checkdist{dn, pivots, num, min, n, dist{dn}{U, min, dst->n}}
133 |       }
134 |     }
135 |     def fun{gen} = call{gen{dn, T}, ...}
136 |     piv := get_pivot{src, n, getaux, fun{flux_sort}, proc_pivots, rh_state}
137 |     if (isint{T} and rh_state == RH_APPROVED) {
138 |       try_rh{dn, dst, src, n, aux, aux_bytes, min, dst->n}
139 |     }
140 | 
141 |     # Then finish sorting; loop around for tail call
142 |     flux_recur{dn, fun{flux_loop},1, piv, src,dst,aux,n,aux_bytes, min,rh_state}
143 |   }
144 | }
145 | 
146 | fn flux_sort{dn, T}(x:*T, n:U, aux:*void, aux_bytes:U) : void = {
147 |   if (n <= 192) {
148 |     pisort{dn, T}(x, x, n, *T~~aux)
149 |     return{}
150 |   }
151 | 
152 |   # Find the minimum value and index of last maximum
153 |   def block = 1024
154 |   min := x->0; max := min
155 |   i:U = 0; imax := i
156 |   do {
157 |     i0 := i
158 |     i += block; if (i > n) i = n
159 |     blm := x->i0
160 |     @for (x over j from i0 to i) {
161 |       if (x <{dn} min) min=x
162 |       if (x >{dn} blm) blm=x
163 |     }
164 |     if (blm >={dn} max) { max=blm; imax=i; } # Save block index; refine later
165 |   } while (i < n)
166 | 
167 |   do { --imax } while (x->imax <{dn} max)
168 |   x <-{imax} x->(n-1)  # (unstable)
169 |   x <-{n-1} max
170 | 
171 |   # Now sort
172 |   flux_loop{dn, T}(x, x, *T~~aux, n-1, aux_bytes, min, 0)
173 | }
174 | 


--------------------------------------------------------------------------------
/src/radix.singeli:
--------------------------------------------------------------------------------
  1 | # LSD Radix sort (includes bucket sort as the 1-step case)
  2 | # Sorts the array according to the least significant byte, then the
  3 | # next higher, and so on.
  4 | # It has the best "generic" performance of any algorithm here, but has
  5 | # no adaptivity and in fact slows down on some fairly common patterns
  6 | # due to cache associativity. So while the implementation here is
  7 | # general, only the 1- and 2-byte forms are used for hybrid sorting.
  8 | 
  9 | local {
 10 | 
 11 | include './prefix'
 12 | 
 13 | # Fix the radix at 1 byte because other widths take too much computation
 14 | def radix_bits = 8
 15 | def count_len  = 1<<radix_bits
 16 | 
 17 | # Number of steps required
 18 | def getsteps{x} = width{scaltype{x}} / radix_bits
 19 | 
 20 | # Apply f to the last component of the list if T is signed
 21 | def where_signed{T, f, list} = {
 22 |   if (not issigned{T}) list
 23 |   else { def {...head, tail} = list; tup{...head, f{tail}} }
 24 | }
 25 | 
 26 | # Given a tuple, get corresponding key generators: bytes starting from
 27 | # the bottom
 28 | def keyfns{len, T} = {
 29 |   each{
 30 |     {sh, K} => { {v} => K<~(v>>sh) },
 31 |     range{len} * radix_bits,                # Shifts
 32 |     where_signed{T, {_}=>i8, copy{len, u8}} # Types
 33 |   }
 34 | }
 35 | 
 36 | # Sort the n values in x with a number of radix passes equal to steps
 37 | # Store counts in count, which must hold steps*count_len counts
 38 | # The result for step i is stored at select{dsts,i}
 39 | def radix_main{dn, src, n:U, dsts, count:*C} = {
 40 |   def T = scaltype{src}
 41 |   def steps = length{dsts}
 42 |   # Tuple of zeroed count arrays, and offset for signed ints
 43 |   def counts = init_counts{count, steps}
 44 |   def counts_off = where_signed{T, {c}=>c+128, counts}
 45 |   # Count frequency of all bytes simultaneously
 46 |   def keys = keyfns{steps, T}
 47 |   radix_counts{dn, single{src}, n, counts_off, keys}
 48 |   # Exclusive sum of each count array (interleaved for speed)
 49 |   radix_prefix_sum{dn, n, counts, count_len}
 50 |   # And do the radix sorting through the successive dsts
 51 |   def srcs = shiftright{tup{src}, dsts}
 52 |   each{radix_move{n}, srcs, dsts, counts_off, keys}
 53 | }
 54 | 
 55 | # Zero and split into num arrays with count_len values each
 56 | def init_counts{space:*T, num} = {
 57 |   set{space, 0, num*count_len}
 58 |   scan{+, merge{space, copy{num-1, count_len}}}
 59 | }
 60 | 
 61 | # Perform all counts in a single pass
 62 | def radix_counts{dn, x, n, counts, keys} = {
 63 |   # Rather than take an exclusive prefix sum in the ascending case, we
 64 |   # write counts at an offset of 1. The descending case subtracts
 65 |   # cumulative counts from n, so they should be inclusive with no offset
 66 |   def counts_shift = if (not dn) counts + 1 else counts
 67 |   @for (x over n) {
 68 |     def incr{c, k} = incrp{c + k{x}}
 69 |     each{incr, counts_shift, keys}
 70 |   }
 71 | }
 72 | 
 73 | # One step of radix sorting
 74 | def radix_move{n}{src, dst, count, key} = {
 75 |   @for (src over n) {
 76 |     def k = key{single{src}}
 77 |     c := count->k
 78 |     dst <-{c} src
 79 |     count <-{k} c+1
 80 |   }
 81 | }
 82 | 
 83 | } # end local
 84 | 
 85 | # Swap back and forth with aux, ending with the elements in x again
 86 | def radix_inplace{dn, x:*T, n:U, aux:*T, count:*C} = {
 87 |   def steps = getsteps{x}
 88 |   radix_main{dn, x, n, cycle{steps,tup{aux,x}}, count}
 89 |   if (steps % 2) set{x, aux, n}
 90 | }
 91 | 
 92 | def radix_grade_inplace{dn, x:*T, xa, g:*I, ga, n:U, count:*C} = {
 93 |   def steps = getsteps{x}
 94 |   def c{...p} = cycle{steps, p}
 95 |   def dsts = flip{tup{
 96 |     shiftleft{c{x,xa}, 'sink'}, # Start at xa; ignore last
 97 |     reverse  {c{g,ga}}          # End at g
 98 |   }}
 99 |   radix_main{dn, tup{x,scaltype{*I}~~0}, n, dsts, count}
100 | }
101 | 
102 | # Radix sorting works even for a count one larger than the type maximum:
103 | # the count array can overflow, but after taking an exclusive sum it
104 | # only affects portions after the last element (that is, target indices
105 | # can't overflow)
106 | def radix_fits{dn}{n, T if isint{T} and not issigned{T}} = {
107 |   # But the descending SWAR code in radix_prefix_sum fails at that length
108 |   def adjust = dn and width{T}<=16
109 |   n <= 1<<width{T} - adjust
110 | }
111 | 
112 | # Radix sort on arbitrary types
113 | def radix{dn, x:*T, n:U, aux0} = {
114 |   def aux = *T~~aux0
115 |   count := aux+n
116 |   def rad{C} = radix_inplace{dn, x, n, aux, *C~~count}
117 |   index_options{rad, n, radix_fits{dn}, tup{u8, u32}}
118 | }
119 | 
120 | # Sort 32-bit data with a 16-bit range
121 | # In place except the counts in aux
122 | fn radpack32{dn}(dst:*u32, src:*u32, n:u32, aux:*void, min:u32) : void = {
123 |   x16 := *u16 ~~ dst
124 |   @for (x16, src over n) x16 = u16<~(src - min)
125 |   def rad{C} = radix_inplace{dn, x16, n, x16 + n, *C~~aux}
126 |   index_options{rad, n, radix_fits{dn}, tup{u8, u32}}
127 |   @for_backwards (x16, dst over n) dst = min + u32^~x16
128 | }
129 | 


--------------------------------------------------------------------------------
/src/rh.singeli:
--------------------------------------------------------------------------------
  1 | # Robin Hood Sort
  2 | # https://github.com/mlochbaum/rhsort
  3 | # An algorithm for data with an approximately uniform distribution,
  4 | # which is very fast on uniform random data and even faster on more even
  5 | # distributions such as shuffled arithmetic progressions
  6 | 
  7 | # Minimum size to steal from buffer
  8 | def BLOCK = 16
  9 | # Initial barrier before stealing anything
 10 | def thresh_init = 2*BLOCK
 11 | 
 12 | # Main Robin Hood insertion algorithm
 13 | # Fill and empty the buffer. Doesn't include merges.
 14 | def rh_insert{dn, x, len:U, buf, buflen, pos, empty:T} = {
 15 |   set{>buf, empty, buflen}
 16 | 
 17 |   # Stolen blocks go to xb
 18 |   xb := x
 19 |   threshold:U = thresh_init
 20 | 
 21 |   # Main loop: insert array entries into buffer
 22 |   @for (x over i from 0 to len) {
 23 |     j:U = pos{x}
 24 |     def h = buf*?j
 25 |     if (LIKELY{>h==empty}) {
 26 |       # Easy insert
 27 |       buf <-{j} x
 28 |     } else {
 29 |       # Collision
 30 |       end := insertval{dn, j, x, h, buf, empty}
 31 |       # Big collision
 32 |       if (RARE{end-j >= threshold}) {
 33 |         threshold = BLOCK
 34 |         stealblocks{j, end, buf, xb, pos, empty}
 35 |       }
 36 |     }
 37 |   }
 38 | 
 39 |   # Move all values from the buffer back to the array
 40 |   xt := filter_neq{xb, buf, buflen, empty}
 41 |   # Recover sentinel elements based on total count
 42 |   if (can_use_unstable{xt}) set{xt, empty, (x+len)-xt}
 43 | 
 44 |   xb
 45 | }
 46 | 
 47 | # Insert an element val to buf at init (cur := buf->init)
 48 | # Return the location after the last value moved
 49 | def insertval{dn, init:U, val, h, buf, empty:T} = {
 50 |   ins:=init; end:=init
 51 |   cur := get{h}
 52 |   # Reposition elements after val branchlessly during the search
 53 |   do {
 54 |     ++end; n := buf->end   # Might write over this
 55 |     def c = val >={dn} cur # If we have to move past that entry
 56 |     buf <-{end-c} cur      # Reposition cur
 57 |     ins += c               # Increments until val's final location found
 58 |     cur = n
 59 |   } while (>cur != empty)  # Until the end of the chain
 60 |   buf <-{ins} val
 61 |   1+end   # Account for just-inserted val
 62 | }
 63 | 
 64 | def stealblocks{start:U, end:U, buf, dst, pos, empty:T} = {
 65 |   def j = start
 66 |   # Find the beginning of the chain (required for stability)
 67 |   while (j>0 and >buf->(j-1)!=empty) --j
 68 |   # Move as many blocks from it as possible
 69 |   hj := buf+j; hf := buf+end
 70 |   while (hj <= hf-BLOCK) {
 71 |     @for (hj, dst over BLOCK) { dst = hj; >hj = empty }
 72 |     hj += BLOCK; dst += BLOCK
 73 |   }
 74 |   # Leftover elements might have to move backwards
 75 |   pr:U = j
 76 |   while (hj < hf) {
 77 |     e := *hj; hj <- empty; ++hj
 78 |     p:=pos{e}; if (p>pr) pr = p
 79 |     buf <-{pr} e; ++pr
 80 |   }
 81 | }
 82 | 
 83 | # Get value to index conversion
 84 | # Modifies r!
 85 | def getposfn{dn, minv, n:U, r} = {
 86 |   sh:U = 0                              # Contract to fit range
 87 |   while (r>5*n) { ++sh; r=r>>1 }        # Shrink to stay at O(n) memory
 88 |   {v} => dist{dn}{U,minv,>v} >> sh
 89 | }
 90 | 
 91 | # Statistical check of samples to make sure it's not too clumpy
 92 | def checkdist{dn, sample, num:U, pos} = {
 93 |   prev := pos{sample->0}
 94 |   score:U = 0
 95 |   threshold := 60 + num/6
 96 |   good:u1 = 0  # result
 97 |   def bad = makelabel{}
 98 |   @for (sample over _ from 1 to num) {
 99 |     next:=pos{sample}; d:=next-{dn}prev; prev=next
100 |     if (d<16) { score+=16-d; if (score >= threshold) goto{bad} }
101 |   }
102 |   good = 1
103 |   setlabel{bad}
104 |   good
105 | }
106 | def checkdist{dn, sample, num:U, minv:T, n:U, r} = {
107 |   def pos = getposfn{dn, minv, n, clone{r}}
108 |   checkdist{dn, sample, num, pos}
109 | }
110 | 
111 | def rh_main{dn, alloc, check}{x, n:U, minv:T, maxv, r} = {
112 |   def pos = getposfn{dn, minv, n, r}
113 | 
114 |   # Goes down to BLOCK once we know we have to merge
115 |   sz:U = r + thresh_init                # Buffer size
116 |   aux := alloc{sz}
117 |   check{pos}
118 | 
119 |   # Treat maxv as "empty": the buffer will swallow these,
120 |   # but they can be recovered by counting
121 |   empty := maxv
122 |   xb := rh_insert{dn, x, n, aux, sz, pos, empty}
123 | 
124 |   # Merge stolen blocks back in if necessary
125 |   l:U = U<~(>xb - >x)   # Size of those blocks
126 |   if (l > 0) {
127 |     # Sort x[0..l]
128 |     merge_from{dn, x, l, aux, BLOCK}
129 |     # And merge with the rest of x
130 |     merge_pair{dn, x, l, n, aux}
131 |   }
132 | }
133 | 
134 | # Sort array of ints with length n.
135 | # Assume there's enough aux space.
136 | fn rh_sort{dn, T, range}(x:*T, n:U, aux:*T) : void = {
137 |   # Find the range.
138 |   {minv, maxv} := range{dn, x, n}
139 |   r:U = dist{dn}{U, minv, maxv}        # Size of range
140 |   if (r/4 < n) {
141 |     count_sort{dn, x, n, *U~~aux, if (not dn) minv else maxv, r+1}
142 |   } else {
143 |     rh_main{dn, {sz}=>aux, {pos}=>1}{x, n, minv, maxv, r}
144 |   }
145 | }
146 | 
147 | def try_rh{dn, dst:*T, src:*T, n:U, aux:*T, aux_bytes:U, minv, maxv} = {
148 |   na := aux_bytes / bytes{T}
149 |   def exit = makelabel{}
150 |   def req{cond} = { if (not cond) goto{exit} }
151 |   req{n <= 1<<16}              # Partitioning is better
152 |   req{2*n+n/2 <= na}           # Not enough space, quick rejection
153 | 
154 |   def alloc{sz} = {
155 |     req{sz<=na}
156 |     if (dst != src) set{dst, src, n} # Finally we commit and clear up aux
157 |     aux
158 |   }
159 |   r:U = dist{dn}{U, minv, maxv}
160 |   rh_main{dn, alloc, {pos}=>1}{dst, n, minv, maxv, r}
161 |   
162 |   return{}
163 |   setlabel{exit}
164 | }
165 | 


--------------------------------------------------------------------------------
/src/small.singeli:
--------------------------------------------------------------------------------
 1 | # Sorting small arrays, less than 32 elements
 2 | # Follows quadsort: https://github.com/scandum/quadsort
 3 | 
 4 | # The general strategy is to sort an initial portion (at least half the
 5 | # array) with a power-of-two length using merge sort, then add the rest
 6 | # of the elements with insertion sort.
 7 | # These methods are mostly branchless, with some shortcuts that make
 8 | # them adapt slightly to already-sorted data
 9 | 
10 | local include './network'
11 | 
12 | # Sort exactly 2 elements branchlessly
13 | def sort_2{dn, dst, src} = {
14 |   c := src >*{dn} src+1
15 |   t := src->(~c)
16 |   dst <-{0} src->c
17 |   dst <-{1} t
18 |   c
19 | }
20 | def sort_2{dn, ptr} = sort_2{dn, ptr, ptr}
21 | 
22 | # 0 to 3 elements, could be considered a bubble sort or insertion sort
23 | def sort_lt4{dn, dst, src, n} = {
24 |   def mov = not is{dst,src}
25 |   if (n > 1) {
26 |     sort_2{dn, dst, src}
27 |     if (n > 2) {
28 |       if (mov) dst <-{2} src->2
29 |       sort_2{dn, dst+1}
30 |       sort_2{dn, dst}
31 |     }
32 |   } else if (mov and n == 1) {
33 |     dst <- src->0
34 |   }
35 | }
36 | 
37 | # The adaptive quad swap
38 | def sort_4_quad{dn, dst, src} = {
39 |   def s22{dst, src} = @for_const (i to 2) sort_2{dn, dst+2*i, src+2*i}
40 |   s22{dst, src}
41 |   if (sort_2{dn, dst+1}) {
42 |     s22{dst, dst}
43 |     sort_2{dn, dst+1}
44 |   }
45 | }
46 | 
47 | # Specialized parity merging for 8 or 16 elements
48 | # Always use 2 rounds of merges to get from ptr to swap and back
49 | def sort_8_16_parity{n, sort_q, merge_h}{dn, dst, src} = {
50 |   def T = scaltype{dst}
51 |   # n elements of swap space
52 |   def swap = eachrec{{p:*T}=>{s:*T = copy{n,0}}, dst}
53 | 
54 |   # Sort groups of 2 or 4 elements
55 |   def q = n/4
56 |   @for_const (i to 4) sort_q{dn, dst + q*i, src + q*i}
57 | 
58 |   # Check to see if these groups need to be merged at all
59 |   def chk{i} = dst+(i*q - 1) >*{dn} dst+(i*q)
60 |   if (chk{1} or chk{2} or chk{3}) {
61 |     # Two rounds of merging: dst to swap in two parts, then back
62 |     def h = n/2
63 |     @for_const (i to 2) parity_merge_const{dn, h, swap + h*i, dst + h*i}
64 |     merge_h{dn, dst, swap}
65 |   }
66 | }
67 | def sort_8_parity = sort_8_16_parity{
68 |   8, sort_2,
69 |   {dn,d,s}=>parity_merge_const{dn, 8, d, s}
70 | }
71 | def sort_16_parity = sort_8_16_parity{
72 |   16, sort_4_quad,
73 |   {dn,d,s}=>parity_merge_fn{dn, 0, d, s, 8, 16}
74 | }
75 | 
76 | def sort_lt32{dn, dst, src, n:U, max} = {
77 |   def un = can_use_unstable{src} and isint{scaltype{src}}
78 |   def use{l, sort_un, sort_stable} = {
79 |     (if (un) sort_un else sort_stable){dn, dst, src}
80 |     def cpy{dst,src} = {
81 |       if (isid{src} or (not dst===src and dst!=src)) {
82 |         @for (dst,src over _ from l to n) dst=src
83 |       }
84 |     }
85 |     eachrec{cpy, dst, src}
86 |     insertion_finish{dn, dst, dst, l, n}
87 |   }
88 |   if      (max <=  4 or n <  4) sort_lt4{dn, dst, src, n}
89 |   else if (max <=  8 or n <  8) use{ 4, network_sort_4,  sort_4_quad}
90 |   else if (max <= 16 or n < 16) use{ 8, network_sort_8,  sort_8_parity}
91 |   else                          use{16, network_sort_16, sort_16_parity}
92 | }
93 | def sort_lt32{dn, dst, src, n:U} = sort_lt32{dn, dst, src, n, 32}
94 | def sort_lt32{dn, ptr, n:U, max if knum{max}} = sort_lt32{dn, ptr, ptr, n, max}
95 | def sort_lt32{dn, ptr, n:U} = sort_lt32{dn, ptr, ptr, n}
96 | 


--------------------------------------------------------------------------------
/src/sort.singeli:
--------------------------------------------------------------------------------
 1 | # Main file that defines the exported sorting algorithms
 2 | 
 3 | include './base'
 4 | include './common'
 5 | include './ins'
 6 | include './merge'
 7 | include './count'
 8 | include './radix'
 9 | include './rh'
10 | include './small'
11 | include './quicksort'
12 | include './glide'
13 | 
14 | # For 1-byte inputs, counting sort is best except at small sizes
15 | fn sort8{dn, T}(x:*T, n:U, aux:*void) : void = {
16 |   if (n < 16) {
17 |     sort_lt32{dn, x, n, 16}
18 |   } else if (radix_fits{dn}{n,u8}) {
19 |     radix_inplace{dn, x, n, *T~~aux, *u8~~aux + n}
20 |   } else {
21 |     count_sort{dn, x, n, aux}
22 |   }
23 | }
24 | 
25 | # For 2-byte inputs, use radix or counting sort
26 | fn sort16{dn, T}(x:*T, n:U, aux:*void) : void = {
27 |   if (n < 24) {
28 |     sort_lt32{dn, x, n}
29 |   } else if (n < 1<<15) {
30 |     radix{dn, x, n, aux}
31 |   } else {
32 |     count_sort{dn, x, n, aux}
33 |   }
34 | }
35 | 
36 | # Reasonable for both 1-byte and 2-byte inputs
37 | fn grade8_16{dn, T,I}(g:*I, x:*T, n:U, aux:*void) : void = {
38 |   if (n < 16) {
39 |     sort_lt32{dn, tup{x,g}, tup{x,I~~0}, n, 16}
40 |   } else if (width{T} == 8) {
41 |     # Bucket sort; final x isn't needed and initial g is implicit
42 |     radix_grade_inplace{dn, x, 0, g, 0, n, *U~~aux}
43 |   } else {
44 |     xa := *T~~aux
45 |     ga := *I~~(xa+n)
46 |     aa := *U~~(ga+n)
47 |     radix_grade_inplace{dn, x, xa, g, ga, n, aa}
48 |   }
49 | }
50 | 
51 | local def up = 0
52 | local def down = 1
53 | 
54 | export{'sort8',  sort8 {up, i8}}
55 | export{'sort16', sort16{up, i16}}
56 | 
57 | export{'grade8_64',  grade8_16{up, i8 , u64}}
58 | export{'grade16_64', grade8_16{up, i16, u64}}
59 | 
60 | export{'rhsort32', rh_sort{up, i32, findrange}}
61 | 
62 | export{'sort32',   glide_sort{up, i32}}
63 | export{'sort_u64', glide_sort{up, u64}}
64 | export{'sort_f64', glide_sort{up, f64}}
65 | 


--------------------------------------------------------------------------------
/src/xorshift.singeli:
--------------------------------------------------------------------------------
 1 | # Xorshift generator, e.g. s^=s<<13; s^=s>>17; s^=s<<5;
 2 | # All three steps for one value is expensive, so use intermediates too.
 3 | # One run of the result calls action{} on and outputs 3 values
 4 | def make_split_xorshift{shifts, s, m, action} = {
 5 |   seed := u32<~s; def doseed{op,a}{} = {seed ^= op{seed,a}}
 6 |   mask := u32<~m
 7 |   def updates = each{doseed, tup{<<,>>,<<}, shifts}
 8 |   {start:U, inc:U} => {
 9 |     def run{upd} = {
10 |       def r = action{clone{start + U^~(seed & mask)}}
11 |       upd{}
12 |       start += inc
13 |       r
14 |     }
15 |     each{run, updates}
16 |   }
17 | }
18 | def make_split_xorshift{sh,s,m} = make_split_xorshift{sh,s,m,{v}=>v}
19 | 


--------------------------------------------------------------------------------