├── LICENSE ├── README.md ├── bench.c ├── compiled └── sort.c └── src ├── arith.singeli ├── base.singeli ├── common.singeli ├── count.singeli ├── dropsort.singeli ├── glide.singeli ├── ins.singeli ├── median.singeli ├── merge.singeli ├── network.singeli ├── partition.singeli ├── prefix.singeli ├── quicksort.singeli ├── radix.singeli ├── rh.singeli ├── small.singeli ├── sort.singeli └── xorshift.singeli /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2022 Marshall Lochbaum 2 | 3 | Permission to use, copy, modify, and/or distribute this software for any 4 | purpose with or without fee is hereby granted. 5 | 6 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH 7 | REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY 8 | AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, 9 | INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM 10 | LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR 11 | OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR 12 | PERFORMANCE OF THIS SOFTWARE. 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Singeli sort 2 | 3 | Algorithms in [Singeli](https://github.com/mlochbaum/Singeli), aiming for high performance and broad adaptivity for sorting CPU-native types (integers and floats). A secondary goal is a well-commented and readable codebase that explains what various methods are good for and how they're implemented to use hardware to its full potential. Will probably end up using SIMD if available to speed up a few things, but this is primarily intended to be a portable rather than SIMD-first sort. 4 | 5 | Compile with (add `-t cpp` for C++ compatibility): 6 | 7 | singeli src/sort.singeli -o sort.c 8 | 9 | Or the following without a Singeli install. CBQN builds on Unix-like systems (including macOS and WSL) in under a minute; see [docker build](https://github.com/vylsaz/cbqn-win-docker-build) for Windows. There's also a pre-compiled copy of sort.c in `compiled/sort.c`. It may not always be up to date. 10 | 11 | git clone https://github.com/dzaima/CBQN.git 12 | cd CBQN && make && cd - 13 | git clone https://github.com/mlochbaum/Singeli.git 14 | CBQN/BQN Singeli/singeli src/sort.singeli -o sort.c 15 | 16 | To benchmark: 17 | 18 | gcc -O3 bench.c 19 | ./a.out 20 | 21 | Exported functions are defined in src/sort.singeli, and their C prototypes appear at the bottom of sort.c: the arguments for sorts are array, length, aux (or scratch buffer), and possibly aux length in bytes. These are likely to change over time. 22 | 23 | ## Overview 24 | 25 | Singeli sort currently hybridizes the following algorithms; all are used for `sort32` and other functions use subsets. The overall structure is that the glidesort layer may call quicksort, which calls the different base cases in various situations. 26 | 27 | - Quicksort partitioning from [fluxsort](https://github.com/scandum/fluxsort) and [crumsort](https://github.com/scandum/crumsort) 28 | - Outer merge layer: modified [glidesort](https://github.com/orlp/glidesort) ([powersort](https://github.com/sebawild/powersort) rules made lazy to defer to quicksort if runs aren't found) 29 | - Merging: based on [piposort](https://github.com/scandum/piposort) 30 | - Small arrays: sorting networks as in [ipn_unstable](https://github.com/Voultapher/sort-research-rs/blob/main/src/unstable/rust_ipn.rs), extra merging and insertion following [quadsort](https://github.com/scandum/quadsort) 31 | - Radix sort: mostly like [ska_sort_copy](https://github.com/skarupke/ska_sort) 32 | - Counting sort: see [section](https://github.com/mlochbaum/rhsort#counting-sort) in rhsort 33 | - [Robin Hood](https://github.com/mlochbaum/rhsort) sort 34 | 35 | In progress, still has various issues: 36 | 37 | - Drapesort, similar to [drop-Merge sort](https://github.com/emilk/drop-merge-sort) 38 | 39 | Other methods to consider later: 40 | 41 | - In-place partitioning with [pdqsort](https://github.com/orlp/pdqsort). Slower than crumsort but it does adapt to mostly-sorted data well. 42 | - Interleaved merges and bidirectional partitioning from glidesort. These have not yet been demonstrated to improve performance relative to fluxsort, and there are indications that they slow things down on older processors in addition to bumping up code size. I'll wait for the paper explaining choices made before looking into them further. 43 | 44 | ## Guide to the source 45 | 46 | The source code is supposed to be the place to go to get full descriptions and details. I am certain it fails in this role—particularly in places where I don't expect anyone's reading, so please complain if a part you've chosen to read is not well explained! 47 | 48 | The general-use files: 49 | 50 | - sort.singeli Main file: include statements, and the sorting function definitions. 51 | - base.singeli Basic definitions to be used elsewhere. This includes operators, which are all user-defined in Singeli. 52 | - common.singeli Other definitions that are more specific than base but may be used in multiple places. 53 | - arith.singeli Some log and square root code to keep it out of the way. 54 | 55 | And specific algorithms: 56 | 57 | - quicksort.singeli 58 | - partition.singeli Partitioning 59 | - median.singeli Medians and pseudomedians for picking candidates 60 | - xorshift.singeli Pseudo-random number generator (PRNG) avoids patterns 61 | - merge.singeli Merging utilities and pisort 62 | - glide.singeli Glidesort strategy: use merges for natural runs 63 | - (merge.singeli) 64 | - small.singeli Small array sorting 65 | - network.singeli Sorting networks for some fixed sizes 66 | - ins.singeli Insertion sorting 67 | - (merge.singeli) 68 | - radix.singeli Radix sorts 69 | - prefix.singeli Prefix sums 70 | - count.singeli Counting sort 71 | - (prefix.singeli) 72 | - rh.singeli Robin Hood sort, for evenly distributed data 73 | - dropsort.singeli (unused) Dropsorts for nearly-sorted arrays 74 | 75 | Some quick notes on Singeli. Everything in brackets `{}` is run at compile time, so a call like `dist{dn}{U, minv, maxv}` is all inlined (`dist` is called a generator). Functions are called with parentheses and are used rarely, for things that are exported, used in many places, or recursive. 76 | 77 | All operators are user-defined, with many picked up from standard includes `skin/c` and `skin/cext`. Some of the ones that are unfamiliar relative to C are listed below. 78 | 79 | | Syntax | Meaning 80 | |-------------|--------- 81 | | `x <{dn} y` | Compare `x` and `y`, flipping ordering if `dn` is `1` 82 | | `x -> i` | Value at offset `i` from pointer `x` 83 | | `x <- v` | Store `v` at pointer `x` 84 | | `x <-{i} v` | Store `v` at offset `i` from pointer `x` 85 | | `x <-> y` | Swap values at pointers `x` and `y` 86 | | `a <~> b` | Swap variables of variables `a` and `b` 87 | | `T~~v` | Cast `v` to same-width type `T` 88 | | `T^~v` | Promote `v` to supertype `T` 89 | | `T<~v` | Narrowing conversion of `v` to type `T` 90 | 91 | Singeli sort uses a fair amount of compile-time trickery to support lots of sorting functions while keeping the code reasonably clean. Functions all support sorting in both directions (`dn` is `0` for up and `1` for down), and many of them support a sort-by operation that actually passes around a tuple of pointers: one to be sorted and others to be moved in the same pattern. A related operation is "grade", which reorders indices as the data should be ordered, and leaves the data intact (it may partially or completely sort it in aux space). A few funny operators are used to support sort-by: for example `*+T` to turn tuple type `T` into a tuple of pointer types, `*?` to avoid loading from extra pointers until the values are needed, and `>*` to compare pointer values. 92 | 93 | | Syntax | Meaning 94 | |--------------|--------- 95 | | `>p` | Get first pointer only from multi-pointer 96 | | `p >*{dn} q` | Compare `p` and `q` by value at first pointer 97 | | `*?p` | Lazy load at `p` 98 | | `p *? i` | Lazy load at offset `i` from `p` 99 | | `*+T` | Tuple of pointer types 100 | -------------------------------------------------------------------------------- /bench.c: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | #define sortname "singelisort" 7 | 8 | // Options for test to perform: 9 | #if RANGES // Small range 10 | #define datadesc "10,000 small-range 4-byte integers" 11 | #elif WORST // RH worst case 12 | #define datadesc "small-range plus outlier" 13 | #else // Random 14 | #define datadesc "random 4-byte integers" 15 | #endif 16 | 17 | #if WORST 18 | #define MODIFY(arr) arr[0] = 3<<28 19 | #else 20 | #define MODIFY(arr) (void)0 21 | #endif 22 | 23 | typedef int T; 24 | typedef size_t U; 25 | static U monoclock(void) { 26 | struct timespec ts; 27 | clock_gettime(CLOCK_MONOTONIC, &ts); 28 | return 1000000000*ts.tv_sec + ts.tv_nsec; 29 | } 30 | 31 | #include "sort.c" 32 | 33 | static void sort32_alloc(T *x, U n) { 34 | U a = (n + 4*(n<1<<16 ? n : 1<<16))*sizeof(T); 35 | T *aux = (T*)malloc(a); 36 | sort32(x, n, aux, a); 37 | free(aux); 38 | } 39 | 40 | // For qsort 41 | int cmpi(const void * ap, const void * bp) { 42 | T a=*(T*)ap, b=*(T*)bp; 43 | return (a>b) - (a1) { 58 | ls = argv[1][0]=='l'; 59 | if (ls) { 60 | // Log line chart 100 to 1e7 with 44 points, plus 4 before for warmup 61 | min=0; 62 | max = (argc>2) ? atoi(argv[2]) : 48; 63 | } else { 64 | max=atoi(argv[argc-1]); 65 | if (argc>2) min=atoi(argv[argc-2]); 66 | } 67 | } 68 | 69 | U sizes[max+1]; 70 | if (!ls) { for (U k=0,n=1 ; k<=max; k++,n*=10) sizes[k]=n; } 71 | else { for (U k=0,n=34; k<=max; k++,n=n*1.3+(n<70)) sizes[k]=n; if(max==48)sizes[max]=10000000; } 72 | 73 | #ifndef RANGES 74 | U s=sizes[max]; s+=n_iter(s)-1; 75 | U q=sizes[min]; q+=n_iter(q)-1; if (s>l; while (nt!=0) { ++l; nt >>= 1 } 9 | l 10 | } 11 | def floor_log2{n:T} = floor_log2{n, 0} 12 | 13 | def sqrt_approx{n:T, init} = { 14 | s:T = init 15 | @for_const (i to 5) s = (s + n/s) / 2 16 | s 17 | } 18 | -------------------------------------------------------------------------------- /src/base.singeli: -------------------------------------------------------------------------------- 1 | # Base definitions that determine how "our version" of Singeli works 2 | 3 | # Mostly it looks like C 4 | include 'arch/c' # Backend 5 | include 'skin/c' # Operators 6 | include 'skin/cext' 7 | 8 | include 'util/tup' # List utilities: scan and so on 9 | include 'util/kind' # Kind-test functions like ktup 10 | 11 | oper <~> swap infix none 20 12 | oper <-> swap_ptr infix none 20 13 | oper >* ptr_gt infix none 20 14 | oper *? lazy_load prefix 60 15 | oper *? lazy_load infix right 50 16 | oper *+ pnt_each prefix 60 17 | oper > single prefix 80 18 | 19 | def tptr{T} = 'pointer'===typekind{T} 20 | 21 | def single = match { {{x,..._}}=>x; {x}=>x } 22 | 23 | def can_use_unstable{x} = (not ktup{x}) or 1 == length{x} 24 | 25 | def swap{a, b} = { t:=a; a=b; b=t } 26 | def swap_ptr{a:*T, b:*T} = { t := *a; a <- *b; b <- t } 27 | 28 | def __pnt{{f,...r}} = __pnt{f} 29 | def ptr_gt{a,b} = load{a,0} > load{b,0} 30 | def ptr_lt{a,b} = load{a,0} < load{b,0} 31 | 32 | def usize = primtype{'u',width{*void}} 33 | def U = usize 34 | def bytes{t if ktyp{t}} = width{t} / 8 35 | 36 | # Change to unsigned: here we always subtract smaller from greater 37 | def __sub{a:*P, b:*P} = emit{usize, 'op -', a, b} 38 | 39 | def addtype{T} = if (tptr{T}) usize else T 40 | def __add{a:T, b:(u1)} = __add{a, addtype{T}^~b} 41 | def __sub{a:T, b:(u1)} = __sub{a, addtype{T}^~b} 42 | 43 | def clone{old} = { new:=old } 44 | 45 | def leading_zeros{x:(u64)} = u8 <~ emit{i32, '__builtin_clzll', x} 46 | 47 | def expect{e}{X} = emit{type{X}, '__builtin_expect', X,e} 48 | def RARE = expect{0} 49 | def LIKELY = expect{1} 50 | 51 | # Pervasion 52 | local include 'util/perv' 53 | extend perv1{clone} 54 | extend perv2{load} 55 | extend (perv{3}){store} 56 | extend perv2{__add} 57 | extend perv2{__sub} 58 | extend perv2{__or } 59 | extend perv2{__shl} 60 | extend perv2{__shr} 61 | extend perv1{__decr} 62 | def eachrec{f,...a} = { (extend (perv{length{a}}){f}){...a} } 63 | 64 | local def extend ecmp{f,g} = { 65 | def f{{a,..._},{b,..._}} = f{a,b} 66 | def f{dn==0} = f 67 | def f{dn==1} = g 68 | } 69 | local def dn_ind = tup{0,1, 3,2, 5,4, 7,6} 70 | extend ({...f}=>each{ecmp,f,select{f,dn_ind}}){ 71 | __eq, __ne, __lt, __gt, __le, __ge, 72 | ptr_gt, ptr_lt 73 | } 74 | 75 | def pnt_each{...T} = eachrec{__pnt, ...T} 76 | 77 | # ++{1} is -- and +={1} is -= 78 | def __incr{dn==0} = __incr 79 | def __incr{dn==1} = __decr 80 | 81 | # -{1} swaps arguments 82 | def __sub{dn==0} = __sub 83 | def __sub{dn==1}{a,b} = b-a 84 | 85 | def type{{s,..._}} = type{s} 86 | def scaltype = match { 87 | {*T} => scaltype{T} 88 | {T if ktyp{T}} => T 89 | {x} => scaltype{type{x}} 90 | } 91 | 92 | def isid{ptr} = not tptr{type{ptr}} 93 | def load{ptr, i if isid{ptr}} = ptr + i 94 | def store{ptr=='sink', i, val} = val 95 | def __add{ptr=='sink', i} = 'sink' 96 | 97 | def lazy_load{p} = lazy_load{p,0} 98 | def lazy_load{p,i} = { 99 | def s = load{single{p},single{i}} 100 | if (not ktup{p}) s; else { 101 | def ll{p,i}{} = load{p,i} 102 | def pr = slice{p,1} 103 | merge{ 104 | tup{s}, 105 | if (ktup{i}) each{ll, pr, slice{i,1}} else each{ll{.,i},pr} 106 | } 107 | } 108 | } 109 | def store{p, i, val if kgen{val}} = store{p, i, val{}} 110 | def get{a} = a 111 | def get{a if kgen{a}} = a{} 112 | extend perv1{get} 113 | 114 | def for{vars,begin,end,iter} = { 115 | def e = usize^~end 116 | i := usize^~begin 117 | while (i < e) { 118 | iter{i, vars} 119 | i += 1 120 | } 121 | } 122 | def for_backwards{vars,begin,end,iter} = { 123 | i := usize^~end 124 | def e = usize^~begin 125 | while (i > e) { 126 | i -= 1 127 | iter{i, vars} 128 | } 129 | } 130 | def for_const{vars,begin,end,iter} = { 131 | if (begin < end) { 132 | for_const{vars,begin, end-1, iter} 133 | iter{end-1, vars} 134 | } 135 | } 136 | def for_unroll{unr}{vars,begin,end,iter} = { 137 | def e = usize^~end 138 | i:usize = begin 139 | eu := e & ~ usize~~(unr-1) 140 | while (i < eu) { 141 | @for_const (j to unr) iter{i+j, vars} 142 | i += unr 143 | } 144 | while (i < e) { 145 | iter{i, vars} 146 | i += 1 147 | } 148 | } 149 | -------------------------------------------------------------------------------- /src/common.singeli: -------------------------------------------------------------------------------- 1 | # Utilities likely to be useful for multiple sorting algorithms 2 | 3 | def map{op, a, b, n} = @for (a over n) op{a,b} 4 | def map{op, a, b:*T, n} = @for (a, b over n) op{a,b} 5 | 6 | def map{op, a , {...b}, n} = each{map{op, ., b, n}, a} 7 | def map{op, {...a}, {...b}, n} = each{map{op, ., ., n}, a, b} 8 | 9 | # memset/memcpy 10 | def set = map{=, ...} 11 | 12 | def reverse{x:*T, n} = { 13 | xr := x + n-1 14 | @for (i to n/2) x+i <-> xr-i 15 | } 16 | 17 | def filter_neq{dst, src, len, v} = { 18 | d:=dst; l:=len 19 | s1:= >src-1; while (l > 0 and s1->l == v) --l 20 | @for_unroll{8} (src over i to l) { 21 | d <- src; d += v != >src # Branchless update 22 | } 23 | d 24 | } 25 | 26 | def findrange{dn, arr, len} = { 27 | minv:=arr->0; maxv:=minv; 28 | @for (arr over i from 1 to len) { 29 | if (arr <{dn} minv) minv=arr 30 | if (arr >{dn} maxv) maxv=arr; 31 | } 32 | tup{minv,maxv} 33 | } 34 | def readrange{arr, len} = tup{arr->(-1), arr->len} 35 | 36 | def dist{dn} = { 37 | def dsub{a:T, b:T} = primtype{'u',width{T}} ~~ (b -{dn} a) 38 | def dsub{U,a,b} = U ^~ dsub{a,b} 39 | } 40 | def dist{...as if 1 < length{as}} = dist{0}{...as} 41 | 42 | # Try the given types to lower constant overhead on small arrays 43 | def index_options{sort, n:U, test, itypes} = { 44 | def done = makelabel{} 45 | def try{C} = { if (C0 >{dn} x->1 # Run is descending 54 | l:U = 2 # Run length 55 | def follow_run{cmp} = { 56 | while (l < n and cmp{x->(l-1), x->l}) ++l 57 | } 58 | if (not desc) follow_run{<={dn}} else follow_run{>{dn}} 59 | tup{desc, l} 60 | } 61 | -------------------------------------------------------------------------------- /src/count.singeli: -------------------------------------------------------------------------------- 1 | local include './prefix' 2 | 3 | def incrp{p} = p <- p->0 + 1 4 | def widen{v} = promote{width{*void}, v} 5 | 6 | def do_count{val:*T, len:U, count:*U, min:T, zero} = { 7 | c := count - widen{min} 8 | @for (val over len) { 9 | incrp{c + widen{val}} 10 | if (zero) val=0 11 | } 12 | } 13 | 14 | def count_fill{dn, x, n, count, min:T, range:U} = { 15 | # Count the values 16 | do_count{x, n, count, min, 0} 17 | # Write based on the counts 18 | dst := x; v := if (not dn) min else min + T<~(range-1) 19 | def for_dn = if (not dn) for else for_backwards 20 | @for_dn (count over range) { 21 | set{dst, v, count}; dst += count; ++{dn}v 22 | } 23 | } 24 | 25 | def count_sum{dn, x, n, count, min:T, range:U} = { 26 | # Count, and zero, the array 27 | do_count{x, n, count, min, 1} 28 | 29 | # Write differences to x 30 | j:U = if (not dn) 0 else range-1 # Index in count 31 | r := undefined{U} # Running total 32 | while ((r=count->j) == 0) ++{dn}j # Skip leading 0 counts quickly 33 | x0 := min + T<~j # First result 34 | while (r < n) { incrp{x+r}; ++{dn}j; r += count->j } 35 | 36 | prefix_sum{dn}{x, n, x0} 37 | } 38 | 39 | # Counting sort of the n values starting at x 40 | def count_sort{dn, x:*T, n:U, aux:*U, min:T, range:U} = { 41 | set{aux, 0, range} 42 | if (range < n/8) { # Short range: branching on count is cheap 43 | count_fill{dn, x, n, aux, min, range} 44 | } else { 45 | count_sum{dn, x, n, aux, min, range} 46 | } 47 | } 48 | 49 | # Assume full range 50 | def count_sort{dn, x:*T, n:U, aux} = { 51 | def range = 1<i # Read next value 27 | u := x->keep # Latest kept value 28 | if (u <={dn} v) { # In order? 29 | ++i; ++keep; x <-{keep} v 30 | } else { 31 | # Drop on left or right? 32 | # We'll choose the side that lets us drop the fewest 33 | j:U = 1 34 | vp := v 35 | def done = makelabel{} 36 | while (1) { 37 | vj := undefined{T} 38 | if (j == n-i or u <={dn} (vj = x->(i+j)) or vp >{dn} vj) { # Drop right 39 | @for (j) { aux <-{lows} x->i; ++i; ++lows } 40 | goto{done} 41 | } 42 | if (j > keep or x->(keep-j) <={dn} v) { # Drop left 43 | d := x + (keep-j+1) # Drop from here and replace with x+i 44 | @for (d over j) { --highs; aux <-{highs} d; d = x->i; ++i } 45 | goto{done} 46 | } 47 | vp = vj 48 | ++j 49 | } 50 | setlabel{done} 51 | } 52 | } 53 | 54 | # Sort values and merge back into x 55 | def merge_back{old_end, new_end, drops, drop_len, rev, cmp} = { 56 | # For stability, reverse and sort instead of sorting down 57 | if (rev) reverse{drops, drop_len} 58 | # Values past j are no longer needed, so use them as aux space 59 | sort{drops, drop_len, x + old_end} 60 | 61 | # Now merge 62 | j := old_end; i := new_end 63 | # Guarding index j below takes time, so stop before it's needed 64 | stop:U = 0 65 | while (stopstop, x->0}) ++stop 66 | # Main loop 67 | @for_backwards (d in drops over _ from stop to drop_len) { 68 | v:=undefined{T}; jn:=undefined{U} 69 | while (cmp{d, (v = x->(jn=j-1))}) { --i; x <-{i} v; j=jn } 70 | --i; x <-{i} d 71 | } 72 | # Take care of the part that couldn't be guarded 73 | if (stop > 0) { 74 | while (j>0) { --j; --i; x <-{i} x->j } 75 | set{x, drops, stop} 76 | } 77 | } 78 | 79 | j := keep + 1 # Number of kept values 80 | jh := n - lows # Plus high values once those are merged 81 | if (j < jh) merge_back{j, jh, aux+highs, n-highs, 1, <={dn}} 82 | if (lows > 0) merge_back{jh, n, aux, lows, 0, < {dn}} 83 | } 84 | -------------------------------------------------------------------------------- /src/glide.singeli: -------------------------------------------------------------------------------- 1 | # Merging methods 2 | # Merging 4 segments at a time is most efficient, but 3 is used for 3 | # unbalanced natural runs, and 2 at the end 4 | # Should probably implement versions where the left or right half is 5 | # already in place, to avoid the set{} calls below 6 | def merge_into{dn, dst:*T, src:*T, left:U, n:U} = { 7 | parity_merge_any{dn, T}(dst, src, left, n) 8 | } 9 | # Merge of 4 10 | def quad_merge{dn, x:*T, a, b, c, d, aux:*T} = { 11 | l := a + b 12 | r := c + d 13 | merge_into{dn, aux, x, a, l} # Left half 14 | merge_into{dn, aux+l, x+l, c, r} # Right half 15 | merge_into{dn, x, aux, l, l+r} # Combine 16 | } 17 | # Merge of 3, as a function since it's used twice 18 | fn triple_merge{dn, T}(x:*T, l:U, m:U, r:U, aux:*T) : void = { 19 | if (l < r) { 20 | h := l + m 21 | merge_into{dn, aux, x, l, h} # Left 22 | set{aux+h, x+h, r} # Right 23 | merge_into{dn, x, aux, h, h+r} # Combine 24 | } else { 25 | h := m + r 26 | set{aux, x, l} # Left 27 | merge_into{dn, aux+l, x+l, m, h} # Right 28 | merge_into{dn, x, aux, l, l+h} # Combine 29 | } 30 | } 31 | 32 | # Return a generator that gets depth 33 | def powersort_init{ns} = { 34 | n := u64^~ns 35 | scale := (n + ((1<<62) - 1)) / n 36 | def merge_tree_depth{mid, nl, nr} = { 37 | mid2 := 2 * (u64^~mid) 38 | u64^~leading_zeros{(scale * (mid2 - u64^~nl)) ^ (scale * (mid2 + u64^~nr))} 39 | } 40 | } 41 | 42 | fn glide_sort{dn, T, min_run, sort}(x:*T, n:U, aux:*void, aux_bytes:U) : void = { 43 | 44 | def base_sort{x, n} = sort{x, n, aux, *u8~~stack - *u8~~aux} 45 | def merge_2{x, l, r} = merge_pair{dn, x, l, l+r, *T~~aux} 46 | def merge_3{x, l, m, r} = triple_merge{dn, T}(x, l, m, r, *T~~aux) 47 | def merge_4{x, a, b, c, d} = quad_merge{dn, x, a, b, c, d, *T~~aux} 48 | 49 | # A logical run consists of a length (e.g. nl) and sortedness (sl) 50 | # The starting position is maintained separately 51 | # Sortedness indicates 52 | # 0 if unsorted 53 | # mid for two sorted sequences 54 | # ~0 if sorted (length is also viable but this is easier) 55 | def SORTED = (1 << width{U}) - 1 56 | 57 | # "Run" of length at most r starting at x 58 | # Modify r to give the length of the run, and return its sortedness 59 | # Always searches until a natural run of length >=min_run is found 60 | # This coalesces unsorted runs, never returning two in a row 61 | saved_run:U = 0 62 | def new_run{x, r:U} = { 63 | s:U = 0 64 | if (saved_run > 0) { 65 | r = saved_run; s = SORTED 66 | saved_run = 0 67 | } else { 68 | i := -(U~~min_run) # Run start 69 | l := undefined{U} # Run length 70 | desc := undefined{u1} # Run is descending 71 | def no_run = makelabel{} 72 | # Leave runs with many equal elements to quicksort 73 | def reject_run{p,l} = { f := l/4 + l/16; p->f == p->(l-f) } 74 | do { 75 | i += min_run 76 | if (r-i < min_run) goto{no_run} 77 | tup{desc,l} = find_run{dn, x+i, r-i} 78 | } while (l < min_run or reject_run{x+i, l}) 79 | # Haven't jumped to no_run, so we have a run 80 | # Backtrack to find the beginning, and reverse it if descending 81 | def find_begin{cmp} = { 82 | i0 := i 83 | while (i > 0 and cmp{x->(i-1), x->i}) --i 84 | l += i0 - i 85 | } 86 | if (not desc) find_begin{<={dn}} 87 | else { find_begin{>{dn}}; reverse{x+i, l} } 88 | # Next set of values might be the run or the part after 89 | if (i == 0) { 90 | r = l; s = SORTED 91 | } else { 92 | saved_run = l; r = i 93 | } 94 | setlabel{no_run} 95 | } 96 | s 97 | } 98 | 99 | def sort_run{x, n, s} = { 100 | if (s == 0) base_sort{x, n} 101 | else if (s < n) merge_2{x, s, n-s} 102 | } 103 | 104 | # Modify xp, nr/sr to merge in nl/sl on the left 105 | def logical_merge{xp, nl, sl, nr, sr} = { 106 | xr := xp 107 | xp -= nl # Point to new run start 108 | if ((sl | sr) == 0) { # Both unsorted (TODO check aux_bytes) 109 | sr = 0 110 | } else { 111 | s_min := sl; if (sr < s_min) s_min = sr 112 | if (s_min == 0) { # One unsorted 113 | # Sort the unsorted half if it's under two thirds of total size 114 | if (sr == 0 and nl > nr/2) { base_sort{xr, nr}; sr = SORTED; s_min = sl } 115 | if (sl == 0 and nr > nl/2) { base_sort{xp, nl}; sl = SORTED; s_min = sr } 116 | } 117 | if (s_min == 0) { 118 | sr = 0 119 | } else if (s_min == SORTED) { # Both sorted, now split-sorted 120 | sr = nl 121 | } else { 122 | if (sr == SORTED) merge_3{xp, sl, nl-sl, nr} 123 | else if (sl == SORTED) merge_3{xp, nl, sr, nr-sr} 124 | else merge_4{xp, sl, nl-sl, sr, nr-sr} 125 | sr = SORTED 126 | } 127 | } 128 | nr += nl 129 | } 130 | 131 | def merge_depth = powersort_init{n} 132 | 133 | # The merge stack has a frame for each defined run 134 | stack_top := *U~~(*u8~~aux + aux_bytes) 135 | stack := stack_top 136 | def Num = 0; def Sort = 1; def Depth = 2; def Frame = 3 137 | def has_stack{} = stack < stack_top 138 | def merge_p{} = { 139 | logical_merge{xp, stack->Num, clone{stack->Sort}, np, sp} 140 | stack += Frame 141 | } 142 | 143 | xp := x 144 | np := n; sp := new_run{x, np} 145 | 146 | xi := x + np; end := x + n 147 | while (xi < end) { 148 | # New run: now only used to determine if previous should be merged 149 | # Next round when it's previous it may be merged 150 | nn := end - xi; sn := new_run{xi, nn} 151 | target_depth := merge_depth{xi - x, np, nn} 152 | xi += nn 153 | 154 | # Merge previous run with the ones that want to be deeper than it 155 | while (has_stack{} and stack->Depth >= target_depth) merge_p{} 156 | 157 | # Push to stack; shift new to prev 158 | stack -= Frame 159 | stack <-{Num} np; stack <-{Sort} sp; stack <-{Depth} target_depth 160 | xp += np 161 | np = nn; sp = sn 162 | } 163 | while (has_stack{}) merge_p{} 164 | sort_run{x, np, sp} 165 | } 166 | 167 | fn glide_sort{dn, T}(x:*T, n:U, aux:*void, aux_bytes:U) : void = { 168 | if (n < 16) { 169 | sort_lt32{dn, x, n, 16} 170 | } else { 171 | def flux{...a} = flux_sort{dn, T}(...a) 172 | glide_sort{dn, T, 256, flux}(x, n, aux, aux_bytes) 173 | } 174 | } 175 | -------------------------------------------------------------------------------- /src/ins.singeli: -------------------------------------------------------------------------------- 1 | # Guarded or unguarded insertion sort of len values at x 2 | # Unguarded requires x->(-1) to precede all of these values 3 | def insertion_sort{dn, x:*T, len:U, guard} = { 4 | # First value's already in place; insert the others 5 | @for (xi in x over i from 1 to len) { 6 | # j moves backward along the array until finding the right spot 7 | j := i; jn := i 8 | xj := xi 9 | while ((not guard or 0(jn=j-1))) { 10 | x <-{j} xj; j=jn # Move previous value forward 11 | } 12 | x <-{j} xi 13 | } 14 | } 15 | # Default to guarded 16 | def insertion_sort{dn, x:*T, len:U} = insertion_sort{dn, x, len, 1} 17 | 18 | # Sort an array where indices less than start are already sorted 19 | def insertion_finish{dn, dst, src, start, n} = { 20 | @for (i from start to n) { 21 | end := dst + i 22 | def xi = src*?i 23 | prev := end - 1 24 | if (*?prev >{dn} xi) { 25 | def xi = get{xi} 26 | if (*?dst >{dn} xi) { 27 | top := i 28 | do { end <- prev->0; --end; --prev } while (--top != 0) 29 | end <- xi 30 | } else { 31 | do { end <- prev->0; --end; --prev } while (*?prev >{dn} xi) 32 | end <- xi 33 | } 34 | } 35 | } 36 | } 37 | -------------------------------------------------------------------------------- /src/median.singeli: -------------------------------------------------------------------------------- 1 | local { 2 | 3 | include './xorshift' 4 | 5 | # Given a pointer and an odd-length tuple of indices, return the index 6 | # of the median value without moving any values 7 | def locate_median{src:*T, inds} = { 8 | def l = length{inds} 9 | def k = l >> 1 # Median is greater than exactly k values 10 | 11 | # Count number of comparisons 12 | # Only l-1 counters: if median comes last it's found by elimination 13 | def sums = each{{_}=>{t:u8=0}, range{l-1}} 14 | 15 | def get{i} = src->select{inds, i} 16 | def s{i} = select{sums, i} 17 | 18 | @for_const (i from 0 to l-1) { 19 | vi := get{i} 20 | @for_const (j from i+1 to l) { 21 | c:u1 = vi > get{j} 22 | s{i} += c 23 | if (j < l-1) s{j} += ~c 24 | } 25 | if (s{i} == k) return{select{inds, i}} 26 | } 27 | select{inds, l-1} 28 | } 29 | fn locate_median_3{T,U}(src:*T, i0:U, i1:U, i2:U) : U = { 30 | locate_median{src, tup{i0,i1,i2}} 31 | } 32 | fn locate_median_5{T,U}(src:*T, i0:U, i1:U, i2:U, i3:U, i4:U) : U = { 33 | locate_median{src, tup{i0,i1,i2,i3,i4}} 34 | } 35 | def median_from{medfn}{array:*T, U} = { 36 | def fun = medfn{T,U} 37 | {...inds} => fun(array, ...inds) 38 | } 39 | def median3_from = median_from{locate_median_3} 40 | def median5_from = median_from{locate_median_5} 41 | 42 | } # end local 43 | 44 | def locate_3_median{array:*T, n:U} = { 45 | median3_from{array,U}{0,n/2,n-1} 46 | } 47 | 48 | def locate_3of3_pseudomedian{array:*T, n:U} = { 49 | q1 := n / 4 50 | q2 := n / 2 51 | q3 := n - q1 52 | def med = median3_from{array, U} 53 | med{ 54 | med{q1-1, q2-1, q3 }, # 136 55 | med{ 0 , q2 , q3+1}, # 048 56 | med{q1 , q2+1, n -1} # 257 57 | } 58 | } 59 | 60 | def locate_5of3_pseudomedian{array:*T, n:U} = { 61 | def xorshift16 = make_split_xorshift{tup{7,9,8}, n, 63} 62 | div := n / 16 63 | def med = median3_from{array, U} 64 | def get3{f} = med{...xorshift16{clone{f * div}, div}} 65 | median5_from{array,U}{...each{get3, tup{0,3,7,10,13}}} 66 | } 67 | -------------------------------------------------------------------------------- /src/merge.singeli: -------------------------------------------------------------------------------- 1 | # Merge sorting 2 | 3 | # Parity merge: branchless 4 | # Main data movement 5 | def parity_pos{dn, left, right, dst, i} = { 6 | l := left->0; r := right->i 7 | c := l <={dn} r 8 | if (c) r=l; dst <-{i} r 9 | right -= c 10 | left += c 11 | } 12 | def parity_neg{dn, left, right, dst, i} = { 13 | l := left->(-i); r := right->0 14 | c := l <={dn} r 15 | if (c) l=r; dst <-{-i} l 16 | right -= c 17 | left += c 18 | } 19 | 20 | # Merge halves of length-n array with constant n (4 and 8 used) 21 | def parity_merge_const{dn, n, dst, src} = { 22 | def h = n / 2 23 | left := src; right := src + h; dstc := dst 24 | @for_const (i to h) parity_pos{dn, left, right, dstc, i} 25 | 26 | left = src + (h-1); right = src + (n-1); dstc = dst + (n-1) 27 | @for_const (i to h) parity_neg{dn, left, right, dstc, i} 28 | } 29 | 30 | # Branchless merge, combining lengths left and n-left from src to dst 31 | # Not in-place: dst can't overlap src 32 | # With guard==0, must have left*2 == n 33 | # With guard==1, left == n/2 (n can be odd) 34 | # With guard==2, any lengths handled 35 | def parity_merge_any{dn, T} = parity_merge{dn, T, 2} 36 | fn parity_merge{dn, T, guard}(dst:*+T, src:*+T, left:U, n:U) : void = { 37 | parity_merge{dn, guard, dst, src, left, n} 38 | } 39 | def parity_merge_fn{dn, guard, dst, src, left, n} = { 40 | parity_merge{dn, eachrec{scaltype,dst}, guard}(dst, src, left, n) 41 | } 42 | def parity_merge{dn, guard, dst, src, left:U, n:U} = { 43 | def handle_any = guard >= 2 44 | def handle_odd = guard == 1 45 | 46 | right := n - left 47 | lpos := src ; rpos := src + left ; dpos := dst 48 | lneg := rpos - 1; rneg := lneg + right; dneg := dst + n - 1 49 | 50 | def done = makelabel{} 51 | half := if (handle_any) n/2 else left 52 | if (handle_any) { 53 | def cut_sides{{short,shneg,shpos}, {long,loneg,lopos}, cmp} = { 54 | ov := long - short 55 | # This is faster, but also needed for correctness of the 56 | # following parity merge! 57 | if (ov > short) { 58 | unbalanced_merge{shpos,short, lopos,long, dpos,n, cmp} 59 | goto{done} 60 | } 61 | # If there are ov "outside" elements on either side of the long 62 | # side, move them to get a balanced merge 63 | # If not, there are long-ov=short "inside" elements that will 64 | # merge before the last one on the short side, so it's safe to 65 | # perform short+short>half merges in that direction: only the 66 | # last can finish the short side 67 | def setneg{} = { loneg-=ov; dneg-=ov; set{dneg+1, loneg+1, ov} } 68 | def setpos{} = { set{dpos, lopos, ov}; lopos+=ov; dpos+=ov } 69 | if (RARE{ cmp{* >shneg, >lopos->short }}) setneg{} 70 | else if (RARE{~cmp{* >shpos, >loneg->(-short)}}) setpos{} 71 | else goto{nocut} 72 | half = short 73 | } 74 | def nocut = makelabel{} 75 | def ldata = tup{left ,lneg,lpos} 76 | def rdata = tup{right,rneg,rpos} 77 | if (left < half) cut_sides{ldata, rdata, <={dn}} 78 | else if (right < half) cut_sides{rdata, ldata, < {dn}} 79 | n = 2*half 80 | setlabel{nocut} 81 | } 82 | 83 | @for_unroll{2} (i to half) { 84 | parity_pos{dn, lpos, rpos, dpos, i} 85 | parity_neg{dn, lneg, rneg, dneg, i} 86 | } 87 | if (handle_odd and n%2 != 0) { 88 | l := lpos->0; r := rpos->half 89 | if (lpos > lneg-half) l = r 90 | dpos <-{half} l 91 | } 92 | setlabel{done} 93 | } 94 | 95 | def unbalanced_merge{shp, short, lop, long, dst, n, lt} = { 96 | # Step size on the long side 97 | def K = 8 98 | # Main loop must stop after placing element short-1 or long-K 99 | i:u64 = 0; j:u64 = 0; k:u64 = 0 100 | while (j+K<=long and ii # Next short value to place 103 | c:u64 = 0 # Number of elements u passes--its index if i 115 | while (lt{u, lop->j}) { dst<-{k}lop->j; ++j; ++k } 116 | dst <-{k} u; ++i; ++k 117 | } 118 | i = short-1; j = long-1 119 | @for_backwards (d in dst over _ from k to n) { 120 | u := shp->i; v := lop->j 121 | if (not lt{u,v}) { d = u; --i } 122 | else { d = v; --j } 123 | } 124 | } 125 | 126 | # Merge arrays of length l and n-l starting at a, using buffer aux 127 | # Can be done without moving both sides, but this way's easy 128 | def merge_pair{dn, x, left:U, n:U, aux} = { 129 | set{aux, x, n} 130 | parity_merge_fn{dn, 2, x, aux, left, n} 131 | } 132 | 133 | # Merge array x of size n, if units of length block are pre-sorted 134 | def merge_from{dn, x, n:U, aux, block} = { 135 | src:=x; dst:=aux 136 | w:U = block; while (w < n) { 137 | ww:=2*w 138 | i:U=0; while (i < n-w) { 139 | l := n-i; if (l>ww) l=ww 140 | parity_merge_fn{dn, 2, dst+i, src+i, w, l} 141 | i += ww 142 | } 143 | if (i < n) set{dst+i, src+i, n-i} 144 | src <~> dst 145 | w = ww 146 | } 147 | if (src != x) set{x, src, n} 148 | } 149 | 150 | # A bottom-down fast but not very adaptive merge sort 151 | # Requires dst != aux; src may be dst or aux 152 | fn pisort{dn, T}(dst:*+T, src:*+T, n:U, aux:*T) : void = { 153 | pisort{dn, call{pisort{dn,T}, ...}, dst, src, n, aux} 154 | } 155 | def pisort{dn, recur, dst, src, n:U, aux} = { 156 | if (n < 32) { 157 | sort_lt32{dn, dst, src, n} 158 | return{} 159 | } 160 | 161 | h1 := n / 2; h2 := n - h1 162 | 163 | recur{aux, src, h1, dst} 164 | recur{aux + h1, src + h1, h2, dst + h1} 165 | 166 | if (aux->(h1-1) <={dn} aux->h1) { 167 | eachrec{set{., ., n}, dst, aux} 168 | } else { 169 | parity_merge_fn{dn, 1, dst, aux, h1, n} 170 | } 171 | } 172 | 173 | # Allow index part of src to be just a base index 174 | fn pisort{dn, T, I}(dst:tup{*T,*I}, src:tup{*T,I}, n:U, aux:tup{*T,*I}) : void = { 175 | pisort{dn, call{pisort{dn,T,I}, ...}, dst, src, n, aux} 176 | } 177 | fn pigrade{dn, T, I}(dst:*I, src:*T, n:U, aux:*void) : void = { 178 | ai := *I~~aux 179 | dv := *T~~(ai + n) 180 | av := dv + n 181 | pisort{dn, T, I}(tup{dv, dst}, tup{src, 0}, n, tup{av, ai}) 182 | } 183 | -------------------------------------------------------------------------------- /src/network.singeli: -------------------------------------------------------------------------------- 1 | # Network sorting: sort by ordering pairs of values in a fixed sequence 2 | # With a branchless swap the entire sort involves no branching 3 | # These networks also have a small depth, or number of layers 4 | # Swaps within a layer are subject to instruction-level parallelization 5 | 6 | # Not stable! Stable networks are mergesort-like and much bigger 7 | 8 | # The networks here are all proven optimal in depth, but the 16-value 9 | # network is not known to be optimal in number of comparisons 10 | 11 | # The first layer in each network covers all values, so if source and 12 | # destination differ, move from src to dst in that layer (i_move below) 13 | 14 | # Branchlessly order values at i and j while moving from src to dst 15 | local def order{dn}{dst, src, i, j} = { 16 | a := src->i; b := src->j 17 | x := (a^b) & - type{a}^~(a >{dn} b) 18 | dst <-{i} x^a 19 | dst <-{j} x^b 20 | } 21 | # Perform all swaps 22 | local def network_sort{i_move, i_inplace}{dn, dst, src} = { 23 | def run{s, i} = each{order{dn}{dst, s, ...}, ...flip{i}} 24 | run{src, i_move} # First round, move src to dst while swapping 25 | run{dst, i_inplace} # Now just swap 26 | } 27 | 28 | # Now we can write our index lists out all pretty 29 | local oper <> tup infix none 90 30 | 31 | # https://bertdobbelaere.github.io/sorting_networks.html#N4L5D3 32 | def network_sort_4 = network_sort{ 33 | tup{0<>2, 1<>3}, 34 | tup{0<>1, 2<>3, 35 | 1<>2} 36 | } 37 | # https://bertdobbelaere.github.io/sorting_networks.html#N8L19D6 38 | def network_sort_8 = network_sort{ 39 | tup{0<>2, 1<>3, 4<>6, 5<>7}, 40 | tup{0<>4, 1<>5, 2<>6, 3<>7, 41 | 0<>1, 2<>3, 4<>5, 6<>7, 42 | 2<>4, 3<>5, 1<>4, 3<>6, 43 | 1<>2, 3<>4, 5<>6} 44 | } 45 | # https://bertdobbelaere.github.io/sorting_networks.html#N12L39D9 46 | def network_sort_12 = network_sort{ 47 | tup{0<>8, 1<>7, 2<>6, 3<>11, 4<>10, 5<>9}, 48 | tup{0<>2, 1<>4, 3<>5, 6<>8, 7<>10, 9<>11, 49 | 0<>1, 2<>9, 4<>7, 5<>6, 10<>11, 50 | 1<>3, 2<>7, 4<>9, 8<>10, 51 | 0<>1, 2<>3, 4<>5, 6<>7, 8<>9, 10<>11, 52 | 1<>2, 3<>5, 6<>8, 9<>10, 53 | 2<>4, 3<>6, 5<>8, 7<>9, 54 | 1<>2, 3<>4, 5<>6, 7<>8, 9<>10} 55 | } 56 | # https://bertdobbelaere.github.io/sorting_networks.html#N16L60D10 57 | def network_sort_16 = network_sort{ 58 | tup{0<>13, 1<>12, 2<>15, 3<>14, 4<>8, 5<>6, 7<>11, 9<>10}, 59 | tup{0<>5, 1<>7, 2<>9, 3<>4, 6<>13, 8<>14, 10<>15, 11<>12, 60 | 0<>1, 2<>3, 4<>5, 6<>8, 7<>9, 10<>11, 12<>13, 14<>15, 61 | 0<>2, 1<>3, 4<>10, 5<>11, 6<>7, 8<>9, 12<>14, 13<>15, 62 | 1<>2, 3<>12, 4<>6, 5<>7, 8<>10, 9<>11, 13<>14, 63 | 1<>4, 2<>6, 5<>8, 7<>10, 9<>13, 11<>14, 64 | 2<>4, 3<>6, 9<>12, 11<>13, 65 | 3<>5, 6<>8, 7<>9, 10<>12, 66 | 3<>4, 5<>6, 7<>8, 9<>10, 11<>12, 67 | 6<>7, 8<>9} 68 | } 69 | -------------------------------------------------------------------------------- /src/partition.singeli: -------------------------------------------------------------------------------- 1 | # Partition algorithms for quicksort 2 | 3 | # Stable partitioning with external memory 4 | # Place values v into dst_false or dst_true according to cmp{v, piv} 5 | # Return the number of true comparisons, or length of dst_true 6 | def flux_partition{src:*T, cmp, piv:T, dst_true:*T, dst_false:*T, n:U} = { 7 | # Number of true comparisons, and index into dst_true 8 | l:U = 0 9 | dst_f := dst_false 10 | @for_unroll{8} (src over n) { 11 | c := cmp{src, piv} 12 | # Write to both destinations: one will be overwritten 13 | dst_true <-{ l} src 14 | dst_f <-{-l} src; ++dst_f 15 | l += c 16 | } 17 | l 18 | } 19 | 20 | # Unstable in-place partitioning 21 | # Fulcrum partition works from the outside in, alternating sides as 22 | # necessary to maintain a small gap both on the left and the right. 23 | # It's faster than fluxsort partitioning for large arrays. I believe 24 | # this is because there are two active regions 25 | # src_left/dst_left, src_right/dst_right 26 | # versus the three in flux_partition 27 | # src, dst_true, dst_false 28 | # leading to less waiting for the cache 29 | def fulcrum_partition{x:*T, cmp, piv:T, aux:*T, n:U} = { 30 | # Partition bl elements at a time in the main loop 31 | def bl = 16 32 | dst_left := x; src_left := dst_left + bl 33 | dst_right := x + n; src_right := dst_right - bl 34 | 35 | # Create the initial gap between src and dst 36 | set{aux, dst_left, bl} 37 | set{aux + bl, src_right, bl} 38 | 39 | # Branchless move, as in flux_partition except left and right parts 40 | # move in opposite directions 41 | # The shared index l allows us to not increment dst_left, 42 | # and unconditionally decrement dst_right 43 | l:U = 0 # Number of values placed on the left so far 44 | def put{ptr} = { 45 | v := *ptr 46 | c := cmp{v, piv} 47 | dst_left <-{l} v; --dst_right 48 | dst_right <-{l} v 49 | l += c 50 | } 51 | # Perform num moves at once, where num <= bl at every call 52 | # Since the total gap is 2*bl, one side or the other must have room 53 | # That's the side we don't partition from! 54 | def part{for, num} = { 55 | diff := src_left - dst_left - l # Gap on left side 56 | if (diff < bl) { @for (num) { put{src_left}; ++src_left } } 57 | else { @for (num) { --src_right; put{src_right} } } 58 | } 59 | 60 | @for (n / bl - 2) part{for_const, bl} # Main loop, unrolled 61 | part{for, n % bl} # Finish up 62 | @for (2*bl) { put{aux}; ++aux } # Partition cleared values 63 | l 64 | } 65 | -------------------------------------------------------------------------------- /src/prefix.singeli: -------------------------------------------------------------------------------- 1 | # Prefix sums 2 | # Inclusive prefix sum is used everywhere: simpler to write and faster 3 | # If dn==1, subtract instead of adding 4 | 5 | local { 6 | 7 | # Prefix sum on the size-wi units of word(s) x of width ww 8 | # Values in x must be registers: the sum is done in place 9 | def prefix_word{wi, ww, x} = { 10 | # Shift amounts, e.g. tup{8, 16, 32} 11 | def shifts{w} = if (w {x += x<>= ww-we 42 | } 43 | # Return the number of elements summed; caller handles tail 44 | l * nw 45 | } 46 | 47 | } # end local 48 | 49 | # For counting sort 50 | def prefix_sum{dn}{x:*T, len, init:T} = { 51 | def we = width{T} 52 | if (we > 16) { 53 | psf{dn, for_unroll{4}}{x, len, clone{init}} 54 | } else { 55 | sum := U^~primtype{'u',we}~~init 56 | # Adjust for signed overflow and descending with xor 57 | def fixsum = if (not (issigned{T} or dn)) { {s}=>s } else { 58 | def off = (1<<(width{T}-1)) - dn 59 | sum ^= off 60 | xor:U = off; prefix_word{we, width{U}, xor} 61 | {s} => xor ^ s 62 | } 63 | # Full words 64 | lenq := ps_words{x, len, sum, fixsum} 65 | # Last partial word 66 | psf{dn, for}{x + lenq, len - lenq, clone{T<~fixsum{sum}}} 67 | } 68 | } 69 | 70 | # For radix sort: interleaved sums 71 | def radix_prefix_sum{dn, n, {...ptrs}, len} = { 72 | def T = scaltype{ptrs} 73 | if (not dn) slice{ptrs, 1} <- 0 74 | def sum{S,v} = each{{_}=>{s:S=S<~v}, ptrs} 75 | if (width{T} > 16) { 76 | psf{dn, for}{ptrs, len, sum{T, if (not dn) 0 else n}} 77 | } else { 78 | def fixsum = if (not dn) { {s}=>s } else { 79 | nw:U = U^~n; prefix_word{width{T}, width{U}, nw} 80 | {s} => nw - s 81 | } 82 | ps_words{ptrs, len, sum{U,0}, fixsum} 83 | } 84 | } 85 | -------------------------------------------------------------------------------- /src/quicksort.singeli: -------------------------------------------------------------------------------- 1 | include './partition' 2 | include './median' 3 | include './xorshift' 4 | include './arith' # Logs and square roots 5 | 6 | def base_cases{dn, dst:*T, src:*T, aux:*T, n:U, aux_bytes:U, min:T} = { 7 | # Short array 8 | if (n < 192) { 9 | pisort{dn, T}(dst, src, n, aux) 10 | return{} 11 | } 12 | 13 | # Distribution base cases 14 | if (isint{T}) { 15 | max := dst->n 16 | range := dist{dn}{min, max} 17 | def nmin = if (not dn) min else max 18 | if (U^~(range/4) < n and U^~range < aux_bytes/bytes{U} and range < (1<<18)) { 19 | # Always sort in place on dst 20 | # Counting sort could have a different src/dst, but if src isn't 21 | # dst then it's equal to aux, and we need that space 22 | if (src != dst) set{dst, src, n} 23 | count_sort{dn, dst, u32<~n, *u32~~aux, nmin, u32<~range + 1} 24 | return{} 25 | } 26 | if (width{T} == 32 and n <= (1<<16) and range < (1<<16)) { 27 | radpack32{dn}(*u32~~dst, *u32~~src, u32<~n, *void~~aux, u32~~nmin) 28 | return{} 29 | } 30 | } 31 | } 32 | 33 | # Robin Hood approval state 34 | def RH_UNTRIED = 0 35 | def RH_APPROVED = 2 36 | 37 | # No direction: doesn't affect median; built into sort{} and proc_pivots{} 38 | def get_pivot{array:*T, n:U, getaux, sort, proc_pivots, rh_state} = { 39 | # log_2 of minimum size for sampling 40 | sl0:U = 8 41 | # Output array and index 42 | arr:=array; ind:U = 0 43 | if (rh_state!=RH_UNTRIED and n <= 1024) { 44 | ind = locate_3of3_pseudomedian{array, n} 45 | } else if (rh_state!=RH_UNTRIED and n <= 1 << (sl0 = 14)) { 46 | ind = locate_5of3_pseudomedian{array, n} 47 | } else { 48 | aux := getaux{} 49 | # gap is the expected distance between adjacent samples 50 | # We'll get about n/gap samples 51 | log2:U = floor_log2{n, sl0} 52 | gap_min := 1 << (log2 / 2) 53 | gap := sqrt_approx{n, gap_min} 54 | 55 | # Collect samples with split xorshift and add to aux 56 | aux1 := aux 57 | def add{ind} = { aux1 <- array->ind; ++aux1 } 58 | mask := gap_min - 1 59 | def add3 = make_split_xorshift{tup{13,17,5}, n, mask, add} 60 | 61 | i:U = 0; while (i < n - (mask + 2 * gap)) add3{i, gap} 62 | ns := aux1 - aux 63 | sort{aux, ns, *void~~aux1, ns*bytes{T}} 64 | proc_pivots{aux, ns} 65 | arr = aux 66 | ind = ns / 2 67 | } 68 | arr -> ind 69 | } 70 | 71 | # Fluxsort recurrence: partition, handle right then left side 72 | # src may be equal to dst or aux 73 | def flux_recur{dn, recur, tailcall, piv:T, src:*T, dst:*T, aux:*T, n:U, aux_bytes:U, min:T, rh_state:(u8)} = { 74 | 75 | # Partition: left side directly in dst with length l, includes pivots 76 | l:U = 0 77 | rsrc := aux 78 | if (LIKELY{n < 1<<14}) { 79 | l = flux_partition{src, <={dn}, piv, dst, aux, n} 80 | } else { 81 | # Previous partitions must have used fulcrum too, so we haven't yet 82 | # touched aux, and src==dst 83 | l = fulcrum_partition{dst, <={dn}, piv, aux, n} # (unstable) 84 | rsrc = dst + l 85 | } 86 | r := n - l # Values on the right 87 | m := l # Values not on the right 88 | 89 | # Never recurse on a partition with more than this many values 90 | most := n - n/16 91 | 92 | # If most values end up on the left, they're probably mostly pivots 93 | if (l > most) { 94 | # Partition again with pivots on the right 95 | # This bounds performance by O(k*n) for only k unique values 96 | if (can_use_unstable{dst}) { 97 | l = filter_neq{dst, dst, m, piv} - dst 98 | set{dst+l, piv, m-l} 99 | } else { 100 | l = flux_partition{dst, <{dn}, piv, dst, aux+r, m} 101 | set{dst+l, aux+r, m-l} # Should probably write a reverse partition to avoid this 102 | } 103 | } 104 | 105 | # Sort the right-hand side, moving it from rsrc to dst 106 | rdst := dst + m 107 | if (r > most) { # Unbalanced 108 | pisort{dn, T}(rdst, rsrc, r, aux) 109 | } else { 110 | recur{rsrc, rdst, aux, r, aux_bytes, piv, rh_state} 111 | } 112 | 113 | if (l > most) { # Unbalanced 114 | pisort{dn, T}(dst, dst, l, aux) 115 | return{} 116 | } 117 | 118 | # Left-hand side by tail call 119 | src = dst 120 | n = l 121 | } 122 | 123 | fn flux_loop{dn, T}(src:*T, dst:*T, aux:*T, n:U, aux_bytes:U, min:T, rh_state:u8) : void = { 124 | while (u1~~1) { 125 | base_cases{dn, dst, src, aux, n, aux_bytes, min} 126 | 127 | # Find pivot and check for RH sorting 128 | def getaux{} = { a:=aux; if (a==src) a=dst; a } 129 | def proc_pivots{pivots, num} = { 130 | if (isint{T} and n<=1<<17 and rh_state!=RH_APPROVED) { 131 | # 1 if tried and not approved, 2 for RH_APPROVED 132 | rh_state = 1 + u8^~checkdist{dn, pivots, num, min, n, dist{dn}{U, min, dst->n}} 133 | } 134 | } 135 | def fun{gen} = call{gen{dn, T}, ...} 136 | piv := get_pivot{src, n, getaux, fun{flux_sort}, proc_pivots, rh_state} 137 | if (isint{T} and rh_state == RH_APPROVED) { 138 | try_rh{dn, dst, src, n, aux, aux_bytes, min, dst->n} 139 | } 140 | 141 | # Then finish sorting; loop around for tail call 142 | flux_recur{dn, fun{flux_loop},1, piv, src,dst,aux,n,aux_bytes, min,rh_state} 143 | } 144 | } 145 | 146 | fn flux_sort{dn, T}(x:*T, n:U, aux:*void, aux_bytes:U) : void = { 147 | if (n <= 192) { 148 | pisort{dn, T}(x, x, n, *T~~aux) 149 | return{} 150 | } 151 | 152 | # Find the minimum value and index of last maximum 153 | def block = 1024 154 | min := x->0; max := min 155 | i:U = 0; imax := i 156 | do { 157 | i0 := i 158 | i += block; if (i > n) i = n 159 | blm := x->i0 160 | @for (x over j from i0 to i) { 161 | if (x <{dn} min) min=x 162 | if (x >{dn} blm) blm=x 163 | } 164 | if (blm >={dn} max) { max=blm; imax=i; } # Save block index; refine later 165 | } while (i < n) 166 | 167 | do { --imax } while (x->imax <{dn} max) 168 | x <-{imax} x->(n-1) # (unstable) 169 | x <-{n-1} max 170 | 171 | # Now sort 172 | flux_loop{dn, T}(x, x, *T~~aux, n-1, aux_bytes, min, 0) 173 | } 174 | -------------------------------------------------------------------------------- /src/radix.singeli: -------------------------------------------------------------------------------- 1 | # LSD Radix sort (includes bucket sort as the 1-step case) 2 | # Sorts the array according to the least significant byte, then the 3 | # next higher, and so on. 4 | # It has the best "generic" performance of any algorithm here, but has 5 | # no adaptivity and in fact slows down on some fairly common patterns 6 | # due to cache associativity. So while the implementation here is 7 | # general, only the 1- and 2-byte forms are used for hybrid sorting. 8 | 9 | local { 10 | 11 | include './prefix' 12 | 13 | # Fix the radix at 1 byte because other widths take too much computation 14 | def radix_bits = 8 15 | def count_len = 1< { {v} => K<~(v>>sh) }, 31 | range{len} * radix_bits, # Shifts 32 | where_signed{T, {_}=>i8, copy{len, u8}} # Types 33 | } 34 | } 35 | 36 | # Sort the n values in x with a number of radix passes equal to steps 37 | # Store counts in count, which must hold steps*count_len counts 38 | # The result for step i is stored at select{dsts,i} 39 | def radix_main{dn, src, n:U, dsts, count:*C} = { 40 | def T = scaltype{src} 41 | def steps = length{dsts} 42 | # Tuple of zeroed count arrays, and offset for signed ints 43 | def counts = init_counts{count, steps} 44 | def counts_off = where_signed{T, {c}=>c+128, counts} 45 | # Count frequency of all bytes simultaneously 46 | def keys = keyfns{steps, T} 47 | radix_counts{dn, single{src}, n, counts_off, keys} 48 | # Exclusive sum of each count array (interleaved for speed) 49 | radix_prefix_sum{dn, n, counts, count_len} 50 | # And do the radix sorting through the successive dsts 51 | def srcs = shiftright{tup{src}, dsts} 52 | each{radix_move{n}, srcs, dsts, counts_off, keys} 53 | } 54 | 55 | # Zero and split into num arrays with count_len values each 56 | def init_counts{space:*T, num} = { 57 | set{space, 0, num*count_len} 58 | scan{+, merge{space, copy{num-1, count_len}}} 59 | } 60 | 61 | # Perform all counts in a single pass 62 | def radix_counts{dn, x, n, counts, keys} = { 63 | # Rather than take an exclusive prefix sum in the ascending case, we 64 | # write counts at an offset of 1. The descending case subtracts 65 | # cumulative counts from n, so they should be inclusive with no offset 66 | def counts_shift = if (not dn) counts + 1 else counts 67 | @for (x over n) { 68 | def incr{c, k} = incrp{c + k{x}} 69 | each{incr, counts_shift, keys} 70 | } 71 | } 72 | 73 | # One step of radix sorting 74 | def radix_move{n}{src, dst, count, key} = { 75 | @for (src over n) { 76 | def k = key{single{src}} 77 | c := count->k 78 | dst <-{c} src 79 | count <-{k} c+1 80 | } 81 | } 82 | 83 | } # end local 84 | 85 | # Swap back and forth with aux, ending with the elements in x again 86 | def radix_inplace{dn, x:*T, n:U, aux:*T, count:*C} = { 87 | def steps = getsteps{x} 88 | radix_main{dn, x, n, cycle{steps,tup{aux,x}}, count} 89 | if (steps % 2) set{x, aux, n} 90 | } 91 | 92 | def radix_grade_inplace{dn, x:*T, xa, g:*I, ga, n:U, count:*C} = { 93 | def steps = getsteps{x} 94 | def c{...p} = cycle{steps, p} 95 | def dsts = flip{tup{ 96 | shiftleft{c{x,xa}, 'sink'}, # Start at xa; ignore last 97 | reverse {c{g,ga}} # End at g 98 | }} 99 | radix_main{dn, tup{x,scaltype{*I}~~0}, n, dsts, count} 100 | } 101 | 102 | # Radix sorting works even for a count one larger than the type maximum: 103 | # the count array can overflow, but after taking an exclusive sum it 104 | # only affects portions after the last element (that is, target indices 105 | # can't overflow) 106 | def radix_fits{dn}{n, T if isint{T} and not issigned{T}} = { 107 | # But the descending SWAR code in radix_prefix_sum fails at that length 108 | def adjust = dn and width{T}<=16 109 | n <= 1<buf, empty, buflen} 16 | 17 | # Stolen blocks go to xb 18 | xb := x 19 | threshold:U = thresh_init 20 | 21 | # Main loop: insert array entries into buffer 22 | @for (x over i from 0 to len) { 23 | j:U = pos{x} 24 | def h = buf*?j 25 | if (LIKELY{>h==empty}) { 26 | # Easy insert 27 | buf <-{j} x 28 | } else { 29 | # Collision 30 | end := insertval{dn, j, x, h, buf, empty} 31 | # Big collision 32 | if (RARE{end-j >= threshold}) { 33 | threshold = BLOCK 34 | stealblocks{j, end, buf, xb, pos, empty} 35 | } 36 | } 37 | } 38 | 39 | # Move all values from the buffer back to the array 40 | xt := filter_neq{xb, buf, buflen, empty} 41 | # Recover sentinel elements based on total count 42 | if (can_use_unstable{xt}) set{xt, empty, (x+len)-xt} 43 | 44 | xb 45 | } 46 | 47 | # Insert an element val to buf at init (cur := buf->init) 48 | # Return the location after the last value moved 49 | def insertval{dn, init:U, val, h, buf, empty:T} = { 50 | ins:=init; end:=init 51 | cur := get{h} 52 | # Reposition elements after val branchlessly during the search 53 | do { 54 | ++end; n := buf->end # Might write over this 55 | def c = val >={dn} cur # If we have to move past that entry 56 | buf <-{end-c} cur # Reposition cur 57 | ins += c # Increments until val's final location found 58 | cur = n 59 | } while (>cur != empty) # Until the end of the chain 60 | buf <-{ins} val 61 | 1+end # Account for just-inserted val 62 | } 63 | 64 | def stealblocks{start:U, end:U, buf, dst, pos, empty:T} = { 65 | def j = start 66 | # Find the beginning of the chain (required for stability) 67 | while (j>0 and >buf->(j-1)!=empty) --j 68 | # Move as many blocks from it as possible 69 | hj := buf+j; hf := buf+end 70 | while (hj <= hf-BLOCK) { 71 | @for (hj, dst over BLOCK) { dst = hj; >hj = empty } 72 | hj += BLOCK; dst += BLOCK 73 | } 74 | # Leftover elements might have to move backwards 75 | pr:U = j 76 | while (hj < hf) { 77 | e := *hj; hj <- empty; ++hj 78 | p:=pos{e}; if (p>pr) pr = p 79 | buf <-{pr} e; ++pr 80 | } 81 | } 82 | 83 | # Get value to index conversion 84 | # Modifies r! 85 | def getposfn{dn, minv, n:U, r} = { 86 | sh:U = 0 # Contract to fit range 87 | while (r>5*n) { ++sh; r=r>>1 } # Shrink to stay at O(n) memory 88 | {v} => dist{dn}{U,minv,>v} >> sh 89 | } 90 | 91 | # Statistical check of samples to make sure it's not too clumpy 92 | def checkdist{dn, sample, num:U, pos} = { 93 | prev := pos{sample->0} 94 | score:U = 0 95 | threshold := 60 + num/6 96 | good:u1 = 0 # result 97 | def bad = makelabel{} 98 | @for (sample over _ from 1 to num) { 99 | next:=pos{sample}; d:=next-{dn}prev; prev=next 100 | if (d<16) { score+=16-d; if (score >= threshold) goto{bad} } 101 | } 102 | good = 1 103 | setlabel{bad} 104 | good 105 | } 106 | def checkdist{dn, sample, num:U, minv:T, n:U, r} = { 107 | def pos = getposfn{dn, minv, n, clone{r}} 108 | checkdist{dn, sample, num, pos} 109 | } 110 | 111 | def rh_main{dn, alloc, check}{x, n:U, minv:T, maxv, r} = { 112 | def pos = getposfn{dn, minv, n, r} 113 | 114 | # Goes down to BLOCK once we know we have to merge 115 | sz:U = r + thresh_init # Buffer size 116 | aux := alloc{sz} 117 | check{pos} 118 | 119 | # Treat maxv as "empty": the buffer will swallow these, 120 | # but they can be recovered by counting 121 | empty := maxv 122 | xb := rh_insert{dn, x, n, aux, sz, pos, empty} 123 | 124 | # Merge stolen blocks back in if necessary 125 | l:U = U<~(>xb - >x) # Size of those blocks 126 | if (l > 0) { 127 | # Sort x[0..l] 128 | merge_from{dn, x, l, aux, BLOCK} 129 | # And merge with the rest of x 130 | merge_pair{dn, x, l, n, aux} 131 | } 132 | } 133 | 134 | # Sort array of ints with length n. 135 | # Assume there's enough aux space. 136 | fn rh_sort{dn, T, range}(x:*T, n:U, aux:*T) : void = { 137 | # Find the range. 138 | {minv, maxv} := range{dn, x, n} 139 | r:U = dist{dn}{U, minv, maxv} # Size of range 140 | if (r/4 < n) { 141 | count_sort{dn, x, n, *U~~aux, if (not dn) minv else maxv, r+1} 142 | } else { 143 | rh_main{dn, {sz}=>aux, {pos}=>1}{x, n, minv, maxv, r} 144 | } 145 | } 146 | 147 | def try_rh{dn, dst:*T, src:*T, n:U, aux:*T, aux_bytes:U, minv, maxv} = { 148 | na := aux_bytes / bytes{T} 149 | def exit = makelabel{} 150 | def req{cond} = { if (not cond) goto{exit} } 151 | req{n <= 1<<16} # Partitioning is better 152 | req{2*n+n/2 <= na} # Not enough space, quick rejection 153 | 154 | def alloc{sz} = { 155 | req{sz<=na} 156 | if (dst != src) set{dst, src, n} # Finally we commit and clear up aux 157 | aux 158 | } 159 | r:U = dist{dn}{U, minv, maxv} 160 | rh_main{dn, alloc, {pos}=>1}{dst, n, minv, maxv, r} 161 | 162 | return{} 163 | setlabel{exit} 164 | } 165 | -------------------------------------------------------------------------------- /src/small.singeli: -------------------------------------------------------------------------------- 1 | # Sorting small arrays, less than 32 elements 2 | # Follows quadsort: https://github.com/scandum/quadsort 3 | 4 | # The general strategy is to sort an initial portion (at least half the 5 | # array) with a power-of-two length using merge sort, then add the rest 6 | # of the elements with insertion sort. 7 | # These methods are mostly branchless, with some shortcuts that make 8 | # them adapt slightly to already-sorted data 9 | 10 | local include './network' 11 | 12 | # Sort exactly 2 elements branchlessly 13 | def sort_2{dn, dst, src} = { 14 | c := src >*{dn} src+1 15 | t := src->(~c) 16 | dst <-{0} src->c 17 | dst <-{1} t 18 | c 19 | } 20 | def sort_2{dn, ptr} = sort_2{dn, ptr, ptr} 21 | 22 | # 0 to 3 elements, could be considered a bubble sort or insertion sort 23 | def sort_lt4{dn, dst, src, n} = { 24 | def mov = not is{dst,src} 25 | if (n > 1) { 26 | sort_2{dn, dst, src} 27 | if (n > 2) { 28 | if (mov) dst <-{2} src->2 29 | sort_2{dn, dst+1} 30 | sort_2{dn, dst} 31 | } 32 | } else if (mov and n == 1) { 33 | dst <- src->0 34 | } 35 | } 36 | 37 | # The adaptive quad swap 38 | def sort_4_quad{dn, dst, src} = { 39 | def s22{dst, src} = @for_const (i to 2) sort_2{dn, dst+2*i, src+2*i} 40 | s22{dst, src} 41 | if (sort_2{dn, dst+1}) { 42 | s22{dst, dst} 43 | sort_2{dn, dst+1} 44 | } 45 | } 46 | 47 | # Specialized parity merging for 8 or 16 elements 48 | # Always use 2 rounds of merges to get from ptr to swap and back 49 | def sort_8_16_parity{n, sort_q, merge_h}{dn, dst, src} = { 50 | def T = scaltype{dst} 51 | # n elements of swap space 52 | def swap = eachrec{{p:*T}=>{s:*T = copy{n,0}}, dst} 53 | 54 | # Sort groups of 2 or 4 elements 55 | def q = n/4 56 | @for_const (i to 4) sort_q{dn, dst + q*i, src + q*i} 57 | 58 | # Check to see if these groups need to be merged at all 59 | def chk{i} = dst+(i*q - 1) >*{dn} dst+(i*q) 60 | if (chk{1} or chk{2} or chk{3}) { 61 | # Two rounds of merging: dst to swap in two parts, then back 62 | def h = n/2 63 | @for_const (i to 2) parity_merge_const{dn, h, swap + h*i, dst + h*i} 64 | merge_h{dn, dst, swap} 65 | } 66 | } 67 | def sort_8_parity = sort_8_16_parity{ 68 | 8, sort_2, 69 | {dn,d,s}=>parity_merge_const{dn, 8, d, s} 70 | } 71 | def sort_16_parity = sort_8_16_parity{ 72 | 16, sort_4_quad, 73 | {dn,d,s}=>parity_merge_fn{dn, 0, d, s, 8, 16} 74 | } 75 | 76 | def sort_lt32{dn, dst, src, n:U, max} = { 77 | def un = can_use_unstable{src} and isint{scaltype{src}} 78 | def use{l, sort_un, sort_stable} = { 79 | (if (un) sort_un else sort_stable){dn, dst, src} 80 | def cpy{dst,src} = { 81 | if (isid{src} or (not dst===src and dst!=src)) { 82 | @for (dst,src over _ from l to n) dst=src 83 | } 84 | } 85 | eachrec{cpy, dst, src} 86 | insertion_finish{dn, dst, dst, l, n} 87 | } 88 | if (max <= 4 or n < 4) sort_lt4{dn, dst, src, n} 89 | else if (max <= 8 or n < 8) use{ 4, network_sort_4, sort_4_quad} 90 | else if (max <= 16 or n < 16) use{ 8, network_sort_8, sort_8_parity} 91 | else use{16, network_sort_16, sort_16_parity} 92 | } 93 | def sort_lt32{dn, dst, src, n:U} = sort_lt32{dn, dst, src, n, 32} 94 | def sort_lt32{dn, ptr, n:U, max if knum{max}} = sort_lt32{dn, ptr, ptr, n, max} 95 | def sort_lt32{dn, ptr, n:U} = sort_lt32{dn, ptr, ptr, n} 96 | -------------------------------------------------------------------------------- /src/sort.singeli: -------------------------------------------------------------------------------- 1 | # Main file that defines the exported sorting algorithms 2 | 3 | include './base' 4 | include './common' 5 | include './ins' 6 | include './merge' 7 | include './count' 8 | include './radix' 9 | include './rh' 10 | include './small' 11 | include './quicksort' 12 | include './glide' 13 | 14 | # For 1-byte inputs, counting sort is best except at small sizes 15 | fn sort8{dn, T}(x:*T, n:U, aux:*void) : void = { 16 | if (n < 16) { 17 | sort_lt32{dn, x, n, 16} 18 | } else if (radix_fits{dn}{n,u8}) { 19 | radix_inplace{dn, x, n, *T~~aux, *u8~~aux + n} 20 | } else { 21 | count_sort{dn, x, n, aux} 22 | } 23 | } 24 | 25 | # For 2-byte inputs, use radix or counting sort 26 | fn sort16{dn, T}(x:*T, n:U, aux:*void) : void = { 27 | if (n < 24) { 28 | sort_lt32{dn, x, n} 29 | } else if (n < 1<<15) { 30 | radix{dn, x, n, aux} 31 | } else { 32 | count_sort{dn, x, n, aux} 33 | } 34 | } 35 | 36 | # Reasonable for both 1-byte and 2-byte inputs 37 | fn grade8_16{dn, T,I}(g:*I, x:*T, n:U, aux:*void) : void = { 38 | if (n < 16) { 39 | sort_lt32{dn, tup{x,g}, tup{x,I~~0}, n, 16} 40 | } else if (width{T} == 8) { 41 | # Bucket sort; final x isn't needed and initial g is implicit 42 | radix_grade_inplace{dn, x, 0, g, 0, n, *U~~aux} 43 | } else { 44 | xa := *T~~aux 45 | ga := *I~~(xa+n) 46 | aa := *U~~(ga+n) 47 | radix_grade_inplace{dn, x, xa, g, ga, n, aa} 48 | } 49 | } 50 | 51 | local def up = 0 52 | local def down = 1 53 | 54 | export{'sort8', sort8 {up, i8}} 55 | export{'sort16', sort16{up, i16}} 56 | 57 | export{'grade8_64', grade8_16{up, i8 , u64}} 58 | export{'grade16_64', grade8_16{up, i16, u64}} 59 | 60 | export{'rhsort32', rh_sort{up, i32, findrange}} 61 | 62 | export{'sort32', glide_sort{up, i32}} 63 | export{'sort_u64', glide_sort{up, u64}} 64 | export{'sort_f64', glide_sort{up, f64}} 65 | -------------------------------------------------------------------------------- /src/xorshift.singeli: -------------------------------------------------------------------------------- 1 | # Xorshift generator, e.g. s^=s<<13; s^=s>>17; s^=s<<5; 2 | # All three steps for one value is expensive, so use intermediates too. 3 | # One run of the result calls action{} on and outputs 3 values 4 | def make_split_xorshift{shifts, s, m, action} = { 5 | seed := u32<~s; def doseed{op,a}{} = {seed ^= op{seed,a}} 6 | mask := u32<~m 7 | def updates = each{doseed, tup{<<,>>,<<}, shifts} 8 | {start:U, inc:U} => { 9 | def run{upd} = { 10 | def r = action{clone{start + U^~(seed & mask)}} 11 | upd{} 12 | start += inc 13 | r 14 | } 15 | each{run, updates} 16 | } 17 | } 18 | def make_split_xorshift{sh,s,m} = make_split_xorshift{sh,s,m,{v}=>v} 19 | --------------------------------------------------------------------------------