├── README.md
├── shelfsort.cpp
└── times.md


/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | 
 4 | ## Shelfsort: a new sorting algorithm
 5 | 
 6 | #### What basic properties does the Shelfsort algorithm have?
 7 | - O(N * log(N)) time
 8 | - O(sqrt(N)) memory (with constant pointer size)
 9 | - adaptive (fast for mostly pre-sorted inputs)
10 | - stable (order of equal elements is unchanged)
11 | 
12 | #### What's novel about the Shelfsort algorithm?
13 | The core idea is:
14 | 1. Divide the data into blocks of sqrt(N) size, and sort each block.
15 | 2. Use sqrt(N) indices to track which block is where.
16 | 3. Clear some blocks by merging data into scratch memory.
17 | 4. Merge blocks of data from where they are into whichever blocks are empty.
18 | 
19 | Here's an example of merging blocks:
20 | 
21 | 1. The data is divided into blocks.  
22 | ``aaaabbbb``
23 | 
24 | 2. Merging of {a, b} into {C} starts. The blocks of {a, b} are merged into a buffer, starting with the first ones. When blocks are used completely, they're cleared in the main array. Here, {b} is first in the sort, so blocks of {b} are cleared first.  
25 | ``aaaa__bb``
26 | 
27 | 3. The merged run data {C} is written into the space left by cleared {b} blocks. The location indices are stored in a locations array. Here, locations[i] is the index where the ith output data block can be found.  
28 | ``aaaaCC_b``  
29 | ``45______``
30 | 
31 | 4. Here, {a, b} are fully merged into {C} and the locations are stored. If another merge is done, blocks are used from their current positions.  
32 | ``CCCCCCCC``  
33 | ``45670123``
34 | 
35 | 5. Once all data has been merged, the blocks are sorted according to their location indices.
36 | 
37 | #### How fast is Shelfsort?
38 | For large arrays, compared to std::stable_sort with unlimited memory, this implementation takes: 
39 | - ~1.1x the time for random inputs
40 | - ~0.1x the time for sorted inputs
41 | 
42 | [Here are some times](times.md) I got from running the included code.
43 | 
44 | #### Why name it "Shelfsort"?
45 | The block locations are fixed while elements move between them, like books being moved between shelves. Also, "block sort" was taken.
46 | 
47 | #### Why a new sorting algorithm?
48 | Most sorting algorithms are either unstable, slow, or require O(N) memory. There are merge sorts using binary search to partition blocks to reduce memory consumption, such as [blitsort](https://github.com/scandum/blitsort/), but this is a different approach that doesn't use binary search.
49 | 
50 | #### No, I meant, why did you work on this?
51 | - It's an intellectual challenge that's "pure" in a way that I like.
52 | - It's a way to test yourself against some of the best minds of history.
53 | - If you can develop something situationally useful, you can leave some small mark on the world.
54 | - I'm currently unemployed, and the weather's been bad, so I had some free time.
55 | - It will probably get HR people to not throw out my resume. <- this is sarcasm
56 | 
57 | #### How optimized is this implementation?
58 | It's written in low-level C++ and does some branchless operations, but apart from that, it's not very optimized. Things I *haven't* done include:
59 | - trying initial sorts for small segments that handle >4 elements
60 | - looking at the assembly code
61 | - manual vectorization
62 | 
63 | Micro-optimization isn't my specialty or my interest, but feel free to try doing some.
64 | 
65 | #### What limitations does this implementation have?
66 | - To simplify implementation and discourage people from using this code in production, it currently only works on power-of-2 array sizes.
67 | - It hasn't been tested thoroughly, and could have bugs on some inputs.
68 | 
69 | #### Have you made other open-source stuff?
70 | I also programmed:
71 | - a [fork of a roguelike game](https://github.com/b-crawl/bcrawl/)
72 | - a [proof of concept](https://github.com/bhauth/multilevel-ternary-hash-table) for a new hashtable algorithm
73 | 
74 | #### How can I contact you?
75 | b αt bhauth dot com
76 | 
77 | 
78 | 
79 | 
80 | 


--------------------------------------------------------------------------------
/shelfsort.cpp:
--------------------------------------------------------------------------------
  1 | #include <iostream>
  2 | #include <time.h>
  3 | #include <algorithm>
  4 | 
  5 | #define COPY(dest,src,elements) for (int copy_index = elements - 1; copy_index >= 0; copy_index--) { dest[copy_index] = src[copy_index]; }
  6 | 
  7 | #define ELEMENT int64_t
  8 | #define LESSEQ(a,b) a <= b
  9 | #define LESS(a,b) a < b
 10 | #define SMALLSORT_SIZE 4
 11 | 
 12 | inline void SmallSort(ELEMENT* arr) {
 13 | 	auto a = arr[0];
 14 | 	auto b = arr[1];
 15 | 	auto c = arr[2];
 16 | 	auto d = arr[3];
 17 | 	
 18 | 	bool less1 = LESS(b,a);
 19 | 	auto a2 = less1 ? b : a;
 20 | 	auto b2 = less1 ? a : b;
 21 | 	bool less2 = LESS(d,c);
 22 | 	auto c2 = less2 ? d : c;
 23 | 	auto d2 = less2 ? c : d;
 24 | 	
 25 | 	if (LESSEQ(b2, c2)) {
 26 | 		arr[0] = a2;
 27 | 		arr[1] = b2;
 28 | 		arr[2] = c2;
 29 | 		arr[3] = d2;
 30 | 		return;
 31 | 		}
 32 | 	
 33 | 	bool x = LESSEQ(a2, c2);
 34 | 	auto b3 = x ? c2 : a2;
 35 | 	bool y = LESSEQ(b2, d2);
 36 | 	auto c3 = y ? b2 : d2;
 37 | 	bool z = LESSEQ(a2, d2);
 38 | 	arr[0] = x ? a2 : c2;
 39 | 	arr[1] = z ? b3 : c3;
 40 | 	arr[2] = z ? c3 : b3;
 41 | 	arr[3] = y ? d2 : b2;
 42 | 	}
 43 | 
 44 | void MergePair(ELEMENT* p1, ELEMENT* p2, ELEMENT* output, int n) { // n = size-1
 45 | 	int i_1 = n;
 46 | 	int i_2 = n;
 47 | 
 48 | 	for (int i = n * 2 + 1; i_1 >= 0 && i_2 >= 0; i--) {
 49 | 		auto x = p1[i_1];
 50 | 		auto y = p2[i_2];
 51 | 		bool higher = LESS(y, x);
 52 | 		bool lower = !higher;
 53 | 		i_1 -= higher;
 54 | 		i_2 -= lower;
 55 | 		output[i] = higher ? x : y;
 56 | 		}
 57 | 
 58 | 	if (i_1 >= 0) { COPY(output, p1, (i_1 + 1)); }
 59 | 	else { COPY(output, p2, (i_2 + 1)); }
 60 | 	}
 61 | 
 62 | 
 63 | void BlockMerge(ELEMENT* start_p, ELEMENT* scratch, uint32_t* indices, uint32_t* indices_out, int block_count_1, int block_count_2, int block_size) {
 64 | 	int index_index_1 = block_count_1 - 1;
 65 | 	int index_index_2 = block_count_2 - 1;
 66 | 	int block_id_1 = indices[index_index_1];
 67 | 	int block_id_2 = indices[block_count_1 + index_index_2];
 68 | 	ELEMENT* p1 = &start_p[block_id_1 * block_size];
 69 | 	ELEMENT* p2 = &start_p[(block_count_1 + block_id_2) * block_size];
 70 | 	
 71 | 	// merge first blocks into scratch
 72 | 	ELEMENT* output = scratch;
 73 | 	int output_block_counter = block_count_1 + block_count_2 - 2;
 74 | 	int clear_block_id = 0;
 75 | 	int next_clear_block_id = 0;
 76 | 	
 77 | 	int i = block_size * 2 - 1;  // scratch is 2 blocks in size
 78 | 	int i_1 = block_size - 1;
 79 | 	int i_2 = i_1;
 80 | 
 81 | 	// adaptive sort: when in order, just adjust indices
 82 | 	ELEMENT last_of_first = p1[(indices[0] * block_size) + block_size - 1];
 83 | 	ELEMENT first_of_last = start_p[(block_count_1 + indices[block_count_1]) * block_size];
 84 | 	if (LESSEQ(last_of_first, first_of_last)) {
 85 | 		for (int i = 0; i < block_count_1; i++) {
 86 | 			indices_out[i] = indices[i];
 87 | 			}
 88 | 		for (int i = block_count_1; i < block_count_1 + block_count_2; i++) {
 89 | 			indices_out[i] = indices[i] + block_count_1;
 90 | 			}
 91 | 		return;
 92 | 		}
 93 | 
 94 | block_merging:
 95 | 		while (i_1 >= 0 && i_2 >= 0 && i >= 0) {
 96 | 			auto x = p1[i_1];
 97 | 			auto y = p2[i_2];
 98 | 			bool higher = LESS(y, x);
 99 | 			bool lower = !higher;
100 | 			output[i] = higher ? x : y;
101 | 			i_1 -= higher;
102 | 			i_2 -= lower;
103 | 			i--;
104 | 			}
105 | 		
106 | 		if (i < 0) {
107 | 			output = &start_p[next_clear_block_id * block_size];
108 | 			output_block_counter--;
109 | 			indices_out[output_block_counter] = next_clear_block_id;
110 | 			i = block_size - 1;
111 | 			}
112 | 		if (i_1 < 0) {
113 | 			next_clear_block_id = block_id_1;
114 | 			index_index_1--;
115 | 			if (index_index_1 < 0) {
116 | 				goto finish_left;
117 | 				}
118 | 			block_id_1 = indices[index_index_1];
119 | 			p1 = &start_p[block_id_1 * block_size];
120 | 			i_1 = block_size - 1;
121 | 			}
122 | 		if (i_2 < 0) {
123 | 			next_clear_block_id = block_count_1 + block_id_2;
124 | 			index_index_2--;
125 | 			if (index_index_2 < 0) {
126 | 				goto finish_right;
127 | 				}
128 | 			block_id_2 = indices[block_count_1 + index_index_2];
129 | 			p2 = &start_p[(block_count_1 + block_id_2) * block_size];
130 | 			i_2 = block_size - 1;
131 | 			}
132 | 		
133 | 		goto block_merging;
134 | 
135 | 
136 | finish_left:
137 | 		while (i_2 >= 0 && i >= 0) {
138 | 			output[i] = p2[i_2];
139 | 			i_2--; i--;
140 | 			}
141 | 		
142 | 		if (i < 0) {
143 | 			clear_block_id = next_clear_block_id;
144 | 			output = &start_p[next_clear_block_id * block_size];
145 | 			if (i_2 >= 0) {
146 | 				indices_out[output_block_counter] = next_clear_block_id;
147 | 				}
148 | 			output_block_counter--;
149 | 			i = block_size - 1;
150 | 			}
151 | 		if (i_2 < 0) {
152 | 			next_clear_block_id = block_count_1 + block_id_2;
153 | 			index_index_2--;
154 | 			if (index_index_2 < 0) {
155 | 				goto unload_scratch;
156 | 				}
157 | 			block_id_2 = indices[block_count_1 + index_index_2];
158 | 			p2 = &start_p[(block_count_1 + block_id_2) * block_size];
159 | 			i_2 = block_size - 1;
160 | 			}
161 | 		
162 | 		goto finish_left;
163 | 
164 | 
165 | finish_right:
166 | 		while (i_1 >= 0 && i >= 0) {
167 | 			output[i] = p1[i_1];
168 | 			i_1--; i--;
169 | 			}
170 | 		
171 | 		if (i < 0) {
172 | 			clear_block_id = next_clear_block_id;
173 | 			output = &start_p[next_clear_block_id * block_size];
174 | 			if (i_1 >= 0) {
175 | 				indices_out[output_block_counter] = next_clear_block_id;
176 | 				}
177 | 			output_block_counter--;
178 | 			i = block_size - 1;
179 | 			}
180 | 		if (i_1 < 0) {
181 | 			next_clear_block_id = block_id_1;
182 | 			index_index_1--;
183 | 			if (index_index_1 < 0) {
184 | 				goto unload_scratch;
185 | 				}
186 | 			block_id_1 = indices[index_index_1];
187 | 			p1 = &start_p[block_id_1 * block_size];
188 | 			i_1 = block_size - 1;
189 | 			}
190 | 		
191 | 		goto finish_right;
192 | 
193 | 
194 | unload_scratch: {
195 | 		int block_bytes = block_size * sizeof(ELEMENT);
196 | 		memcpy(output, scratch, block_bytes);
197 | 		memcpy(&start_p[next_clear_block_id * block_size], &scratch[block_size], block_bytes);
198 | 		int last = block_count_1 + block_count_2 - 1;
199 | 		indices_out[last - 1] = clear_block_id;
200 | 		indices_out[last] = next_clear_block_id;
201 | 		}
202 | 	}
203 | 
204 | 
205 | void FinalBlockSorting(ELEMENT* p1, ELEMENT* scratch, uint32_t* indices, int blocks, int block_size) {
206 | 	for (int b = 0; b < blocks; b++) {
207 | 		int ix = indices[b];
208 | 		if (ix != b) {
209 | 			int block_bytes = block_size * sizeof(ELEMENT);
210 | 			memcpy(scratch, &p1[b * block_size], block_bytes);
211 | 			int empty_block = b;
212 | 		
213 | 			while (ix != b) {
214 | 				memcpy(&p1[empty_block * block_size], &p1[ix * block_size], block_bytes);
215 | 				indices[empty_block] = empty_block;
216 | 				empty_block = ix;
217 | 				ix = indices[ix];
218 | 				}
219 | 			memcpy(&p1[empty_block * block_size], scratch, block_bytes);
220 | 			indices[empty_block] = empty_block;
221 | 			}
222 | 		}
223 | 	}
224 | 
225 | 
226 | void ShelfSort(ELEMENT* arr, unsigned int size) {
227 | 	// determine memory size
228 | 	unsigned int v = size;
229 | 	unsigned int log_size = 0;
230 | 	while (v >>= 1) { log_size++; }
231 | 	unsigned int scratch_size = 1 << (2 + (log_size + 1) / 2);
232 | 
233 | 	// allocate memory
234 | 	char* allocated_memory = reinterpret_cast<char*> (malloc(scratch_size *
235 | 			(sizeof(ELEMENT)
236 | 			+ std::max(sizeof(ELEMENT), 2 * sizeof(uint32_t)))));
237 | 	ELEMENT* scratch = reinterpret_cast<ELEMENT*> (allocated_memory);
238 | 	uint32_t* indices_a = reinterpret_cast<uint32_t*> (&allocated_memory[scratch_size * (sizeof(ELEMENT))]);
239 | 	uint32_t* indices_b = &indices_a[scratch_size];
240 | 
241 | 	// test other small sorts?
242 | 	for (int i = 0; i < size; i += SMALLSORT_SIZE) {
243 | 		SmallSort(&arr[i]);
244 | 		}
245 | 	unsigned int sorted_zone_size = SMALLSORT_SIZE;
246 | 	
247 | 	// pingpong quad merge
248 | 	while (sorted_zone_size <= (scratch_size / 2)) {
249 | 		unsigned int run_len = sorted_zone_size;
250 | 		sorted_zone_size *= 2;
251 | 		for (int i = 0; i < size; i += sorted_zone_size * 2) {
252 | 			ELEMENT* p1 = &arr[i];
253 | 			ELEMENT* p2 = &arr[i + run_len];
254 | 			bool less1 = LESSEQ(p1[run_len-1], p2[0]);
255 | 			if (!less1) {  // skip if already sorted
256 | 				MergePair(p1, p2, scratch, run_len-1);
257 | 				}
258 | 		
259 | 			ELEMENT* p3 = &p2[run_len];
260 | 			ELEMENT* p4 = &p3[run_len];
261 | 			ELEMENT* scratch2 = &scratch[sorted_zone_size];
262 | 			bool less2 = LESSEQ(p3[run_len-1], p4[0]);
263 | 			if (!less2) {
264 | 				MergePair(p3, p4, scratch2, run_len-1);
265 | 				}
266 | 
267 | 			if (less1 || less2) {
268 | 				if (!less1) {COPY(scratch2, p3, sorted_zone_size);}
269 | 				else if (!less2) {COPY(scratch, p1, sorted_zone_size);}
270 | 				else {
271 | 					bool less3 = LESSEQ(p1[sorted_zone_size-1], p3[0]);
272 | 					if (less3) { continue; } else {
273 | 						COPY(scratch, p1, sorted_zone_size * 2);
274 | 						}
275 | 					}
276 | 				}
277 | 			
278 | 			MergePair(scratch, scratch2, p1, sorted_zone_size-1);
279 | 			}
280 | 		sorted_zone_size *= 2;
281 | 		}
282 | 
283 | 	// initialize block indices
284 | 	unsigned int block_size = scratch_size / 2;
285 | 	int total_blocks = size / block_size;
286 | 	int blocks_per_run = sorted_zone_size / block_size;
287 | 	for (int i = 0; i < total_blocks; i += blocks_per_run)
288 | 		for (int j = 0; j < blocks_per_run; j += 1)
289 | 			indices_a[i + j] = j;
290 | 	
291 | 	// do the block sorting runs
292 | 	while (sorted_zone_size <= (size / 2)) {
293 | 		unsigned int run_len = sorted_zone_size;
294 | 		int blocks1 = (sorted_zone_size / block_size);
295 | 		int blocks2 = blocks1;
296 | 		sorted_zone_size *= 2;
297 | 		
298 | 		for (int i = 0; i < total_blocks; i += (blocks1 + blocks2)) {
299 | 			BlockMerge(&arr[i * block_size], scratch, &indices_a[i], &indices_b[i], blocks1, blocks2, block_size);
300 | 			}
301 | 
302 | 		std::swap(indices_a, indices_b);
303 | 		}
304 | 
305 | 	FinalBlockSorting(arr, scratch, indices_a, (sorted_zone_size / block_size), block_size);
306 | 	free(allocated_memory);
307 | 	}
308 | 
309 | #define OUT(data) std::cout << data
310 | 
311 | double TimedTest_Shelfsort(ELEMENT* test_data, int test_size, int cycles) {
312 | 	double startTime = (double)clock() / CLOCKS_PER_SEC;
313 | 	for (int i=0; i<cycles; i++) {
314 | 		ShelfSort(&test_data[i * test_size], test_size);
315 | 		}
316 | 	double endTime = (double)clock() / CLOCKS_PER_SEC;
317 | 	return endTime - startTime;
318 | 	}
319 | 
320 | double TimedTest_stable_sort(ELEMENT* test_data, int test_size, int cycles) {
321 | 	double startTime = (double)clock() / CLOCKS_PER_SEC;
322 | 	for (int i=0; i<cycles; i++) {
323 | 		std::stable_sort(&test_data[i * test_size], &test_data[(i + 1) * test_size]);
324 | 		}
325 | 	double endTime = (double)clock() / CLOCKS_PER_SEC;
326 | 	return endTime - startTime;
327 | 	}
328 | 
329 | void WriteTime(double time, int total_size) {
330 | 	double ns = (time * 1000000000) / total_size;
331 | 	OUT((int)ns); OUT("."); OUT(((int)(ns * 100)) % 100);
332 | 	}
333 | 
334 | void RunTest(int log_size, int total_size) {
335 | 	int test_size = 1 << log_size;
336 | 	total_size = std::max(test_size, total_size);
337 | 	int cycles = total_size / test_size;
338 | 	total_size = cycles * test_size;
339 | 	ELEMENT* test_data = reinterpret_cast<ELEMENT*> (malloc(total_size * (sizeof(ELEMENT))));
340 | 	ELEMENT* test_data2 = reinterpret_cast<ELEMENT*> (malloc(total_size * (sizeof(ELEMENT))));
341 | 
342 |     OUT("beginning test: 2^"); OUT(log_size); OUT(" size blocks, ");
343 | 	OUT(cycles);  OUT(" cycles\n");
344 | 
345 | 	srand(time(0));
346 | 	for (int i = 0; i < test_size; i++) {
347 | 		ELEMENT x = rand();
348 | 		test_data[i] = x;
349 | 		test_data2[i] = x;
350 | 		}
351 | 	
352 | 	double time1 = TimedTest_Shelfsort(test_data, test_size, cycles);
353 | 	double time1_sorted = -1;
354 | 	time1_sorted = TimedTest_Shelfsort(test_data, test_size, cycles);
355 | 	
356 | 	double time_std = TimedTest_stable_sort(test_data2, test_size, cycles);
357 | 	double time_std_sorted = -1;
358 | 	time_std_sorted = TimedTest_stable_sort(test_data2, test_size, cycles);
359 | 
360 | 	bool correct_sort = true;
361 | 	for (int i=0; i < total_size; i++) {
362 | 		if (test_data[i] != test_data2[i]) {
363 | 			correct_sort = false;
364 | 			}
365 | 		}
366 | 
367 | 	OUT("shelfsort time: "); WriteTime(time1, total_size);
368 | 	OUT(" / presorted: "); WriteTime(time1_sorted, total_size); OUT("\n");
369 | 	OUT("std::stable_sort time: "); WriteTime(time_std, total_size);
370 | 	OUT(" / presorted: "); WriteTime(time_std_sorted, total_size); OUT("\n");
371 | 	OUT("matching sort = "); OUT(correct_sort); OUT("\n\n");
372 | 	
373 | 	free(test_data);
374 | 	free(test_data2);
375 | }
376 | 
377 | int main() {
378 | 	OUT("Shelfsort speed test\n");
379 | 	OUT("times are in ns/item\n\n");
380 | 	int total_size = 1 << 22;
381 | 	for (int size = 16; size <= 24; size++)
382 | 		{ RunTest(size, total_size); }
383 | }
384 | 
385 | 


--------------------------------------------------------------------------------
/times.md:
--------------------------------------------------------------------------------
 1 | Shelfsort speed test  
 2 | times are in ns/item
 3 | 
 4 | **test: 2^16 size blocks, 64 cycles**  
 5 | shelfsort time: 5.24 / presorted: 1.90  
 6 | std::stable_sort time: 12.39 / presorted: 9.29
 7 | 
 8 | **test: 2^17 size blocks, 32 cycles**  
 9 | shelfsort time: 6.67 / presorted: 1.90  
10 | std::stable_sort time: 15.97 / presorted: 12.15
11 | 
12 | **test: 2^18 size blocks, 16 cycles**  
13 | shelfsort time: 9.53 / presorted: 1.90  
14 | std::stable_sort time: 18.83 / presorted: 12.63
15 | 
16 | **test: 2^19 size blocks, 8 cycles**  
17 | shelfsort time: 15.25 / presorted: 2.14  
18 | std::stable_sort time: 24.8 / presorted: 14.6
19 | 
20 | **test: 2^20 size blocks, 4 cycles**  
21 | shelfsort time: 28.37 / presorted: 2.14  
22 | std::stable_sort time: 33.85 / presorted: 15.49
23 | 
24 | **test: 2^21 size blocks, 2 cycles**  
25 | shelfsort time: 51.2 / presorted: 2.14  
26 | std::stable_sort time: 53.64 / presorted: 19.7
27 | 
28 | **test: 2^22 size blocks, 1 cycles**  
29 | shelfsort time: 100.13 / presorted: 2.14  
30 | std::stable_sort time: 88.69 / presorted: 21.21
31 | 
32 | **test: 2^23 size blocks, 1 cycles**  
33 | shelfsort time: 103.35 / presorted: 2.26  
34 | std::stable_sort time: 92.2 / presorted: 23.0
35 | 
36 | **test: 2^24 size blocks, 1 cycles**  
37 | shelfsort time: 108.65 / presorted: 2.26  
38 | std::stable_sort time: 92.26 / presorted: 23.54
39 | 


--------------------------------------------------------------------------------