├── AESRand_Paper ├── .gitignore ├── README.md ├── Makefile ├── Makefile.power9 ├── Makefile.arm ├── UnitTest.cpp ├── AESRand.h ├── test.c ├── AESRand.hpp └── AESRand.cpp ├── AESRand ├── AESRand │ ├── pch.h │ ├── pch.cpp │ ├── AESRand.cpp │ ├── AESRand.vcxproj.filters │ ├── others.cpp │ └── AESRand.vcxproj ├── FloatTest │ ├── pch.h │ ├── pch.cpp │ ├── FloatTest.cpp │ ├── FloatTest.vcxproj.filters │ └── FloatTest.vcxproj ├── IntegerRangeTest │ ├── pch.h │ ├── pch.cpp │ ├── IntegerRangeTest.cpp │ ├── IntegerRangeTest.vcxproj.filters │ └── IntegerRangeTest.vcxproj └── AESRand.sln ├── AESRand_Linux ├── .gitignore ├── Makefile ├── AESRand.cpp ├── AESRand_BigCrush.cpp ├── README └── AESRand_BigCrush2.cpp ├── LICENSE ├── BenchmarkResults.md ├── .gitignore ├── PractRand.md └── README.md /AESRand_Paper/.gitignore: -------------------------------------------------------------------------------- 1 | AESRand 2 | AESRand.o 3 | AESRand.s 4 | *.swp 5 | UnitTest 6 | -------------------------------------------------------------------------------- /AESRand/AESRand/pch.h: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/AESRand/pch.h -------------------------------------------------------------------------------- /AESRand/AESRand/pch.cpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/AESRand/pch.cpp -------------------------------------------------------------------------------- /AESRand/FloatTest/pch.h: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/FloatTest/pch.h -------------------------------------------------------------------------------- /AESRand_Linux/.gitignore: -------------------------------------------------------------------------------- 1 | AESRand_BigCrush 2 | AESRand_BigCrush2 3 | AESRand 4 | AESRand.s 5 | *.swp 6 | -------------------------------------------------------------------------------- /AESRand/FloatTest/pch.cpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/FloatTest/pch.cpp -------------------------------------------------------------------------------- /AESRand/AESRand/AESRand.cpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/AESRand/AESRand.cpp -------------------------------------------------------------------------------- /AESRand/FloatTest/FloatTest.cpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/FloatTest/FloatTest.cpp -------------------------------------------------------------------------------- /AESRand/IntegerRangeTest/pch.h: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/IntegerRangeTest/pch.h -------------------------------------------------------------------------------- /AESRand/IntegerRangeTest/pch.cpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/IntegerRangeTest/pch.cpp -------------------------------------------------------------------------------- /AESRand/IntegerRangeTest/IntegerRangeTest.cpp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dragontamer/AESRand/HEAD/AESRand/IntegerRangeTest/IntegerRangeTest.cpp -------------------------------------------------------------------------------- /AESRand_Paper/README.md: -------------------------------------------------------------------------------- 1 | While the other directories were made for testing or prototyping, this directory will be the final 2 | version for the expected paper. 3 | 4 | The hope is for this single directory to be portable across ARM, x86, and Power9 5 | 6 | -------------------------------------------------------------------------------- /AESRand_Paper/Makefile: -------------------------------------------------------------------------------- 1 | all: AESRand.s AESRand 2 | 3 | clean: 4 | rm AESRand.s AESRand AESRand.o UnitTest 5 | 6 | AESRand.s: AESRand.cpp 7 | g++ -maes -S AESRand.cpp -o AESRand.s 8 | 9 | AESRand.o: AESRand.cpp 10 | g++ -g -c -maes AESRand.cpp -o AESRand.o 11 | 12 | UnitTest: UnitTest.cpp AESRand.o 13 | g++ -g -maes UnitTest.cpp AESRand.o -o UnitTest 14 | -------------------------------------------------------------------------------- /AESRand_Paper/Makefile.power9: -------------------------------------------------------------------------------- 1 | all: AESRand.s AESRand 2 | 3 | clean: 4 | rm AESRand.s AESRand AESRand.o UnitTest 5 | 6 | AESRand.s: AESRand.cpp 7 | g++ -mcrypto -O0 -S AESRand.cpp -o AESRand.s 8 | 9 | AESRand.o: AESRand.cpp 10 | g++ -g -c -mcrypto -O0 AESRand.cpp -o AESRand.o 11 | 12 | UnitTest: UnitTest.cpp AESRand.o 13 | g++ -g -mcrypto -O0 UnitTest.cpp AESRand.o -o UnitTest 14 | -------------------------------------------------------------------------------- /AESRand_Paper/Makefile.arm: -------------------------------------------------------------------------------- 1 | GPP=g++-8.3 2 | 3 | all: AESRand.s AESRand 4 | 5 | clean: 6 | rm AESRand.s AESRand AESRand.o UnitTest 7 | 8 | AESRand.s: AESRand.cpp 9 | $(GPP) -std=c++11 -march=armv8-a+simd+crypto -S AESRand.cpp -o AESRand.s 10 | 11 | AESRand.o: AESRand.cpp 12 | $(GPP) -g -c -std=c++11 -march=armv8-a+simd+crypto AESRand.cpp -o AESRand.o 13 | 14 | UnitTest: UnitTest.cpp AESRand.o 15 | $(GPP) -g -std=c++11 -march=armv8-a+simd+crypto UnitTest.cpp AESRand.o -o UnitTest 16 | -------------------------------------------------------------------------------- /AESRand_Linux/Makefile: -------------------------------------------------------------------------------- 1 | all: AESRand.s AESRand AESRand_BigCrush2 2 | 3 | clean: 4 | rm AESRand.s AESRand AESRand_BigCrush 5 | 6 | AESRand.s: AESRand.cpp 7 | g++ -march=westmere -O2 -S AESRand.cpp -o AESRand.s 8 | 9 | AESRand: AESRand.cpp 10 | g++ -march=westmere -O2 AESRand.cpp -o AESRand 11 | 12 | AESRand_BigCrush: AESRand_BigCrush.cpp 13 | g++ -march=westmere -O2 AESRand_BigCrush.cpp -o AESRand_BigCrush -ltestu01 14 | 15 | AESRand_BigCrush2: AESRand_BigCrush2.cpp 16 | g++ -march=westmere -O2 AESRand_BigCrush2.cpp -o AESRand_BigCrush2 -ltestu01 17 | -------------------------------------------------------------------------------- /AESRand_Paper/UnitTest.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include "AESRand.h" 3 | #include 4 | 5 | int main(){ 6 | simd128 state = AESRand_init(); 7 | AESRand_increment(state); 8 | std::array ints = AESRand_rand_uint32(state); 9 | std::array matches = 10 | { 11 | 0x12e826e6, 12 | 0x6c302fd5, 13 | 0x83155f50, 14 | 0xc33a3964, 15 | 0x337eacb1, 16 | 0xe74bf1c4, 17 | 0xbf8be05e, 18 | 0x5068aca6, 19 | }; 20 | 21 | if(memcmp((void*)&ints[0], (void*)&matches[0], sizeof(ints)) == 0){ 22 | std::cout << "Unit Test passed" << std::endl; 23 | } else { 24 | std::cout << "Unit Test failed" << std::endl; 25 | for(int i=0; i<8; i++){ 26 | std::cout << std::hex << ints[i] << " " << matches[i] << std::endl; 27 | } 28 | } 29 | 30 | 31 | } 32 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 dragontamer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /AESRand_Linux/AESRand.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | 5 | 6 | __m128i AESRand_init(){ 7 | return _mm_setzero_si128(); 8 | } 9 | 10 | __m128i increment = _mm_set_epi8(0x2f, 0x2b, 0x29, 0x25, 0x1f, 0x1d, 0x17, 0x13, 11 | 0x11, 0x0D, 0x0B, 0x07, 0x05, 0x03, 0x02, 0x01); 12 | 13 | void AESRand_increment(__m128i& state){ 14 | state += increment; 15 | } 16 | 17 | std::array<__m128i, 2> AESRand_rand(const __m128i state){ 18 | __m128i penultimate = _mm_aesenc_si128(state, increment); 19 | return {_mm_aesenc_si128(penultimate, increment), _mm_aesdec_si128(penultimate, increment)}; 20 | } 21 | 22 | int main(){ 23 | std::cout << "Running 5-billion iterations (160 Billion-bytes of Random Data)" << std::endl; 24 | __m128i state = AESRand_init(); 25 | __m128i total = _mm_setzero_si128(); 26 | __m128i total2 = _mm_setzero_si128(); 27 | 28 | for(long long i=0; i<5000000000; i++){ 29 | AESRand_increment(state); 30 | auto rands = AESRand_rand(state); 31 | total += rands[0]; 32 | total2 += rands[1]; 33 | } 34 | 35 | total += total2; 36 | std::cout << "Dummy print to negate optimizer: " << total[0] << std::endl; 37 | } 38 | -------------------------------------------------------------------------------- /AESRand_Linux/AESRand_BigCrush.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | extern "C"{ 7 | #include 8 | #include 9 | } 10 | 11 | 12 | __m128i AESRand_init(){ 13 | return _mm_setzero_si128(); 14 | } 15 | 16 | __m128i increment = _mm_set_epi8(0x2f, 0x2b, 0x29, 0x25, 0x1f, 0x1d, 0x17, 0x13, 17 | 0x11, 0x0D, 0x0B, 0x07, 0x05, 0x03, 0x02, 0x01); 18 | 19 | void AESRand_increment(__m128i& state){ 20 | state += increment; 21 | } 22 | 23 | std::array<__m128i, 2> AESRand_rand(const __m128i state){ 24 | __m128i penultimate = _mm_aesenc_si128(state, increment); 25 | return {_mm_aesenc_si128(penultimate, increment), _mm_aesdec_si128(penultimate, increment)}; 26 | } 27 | 28 | __m128i state = _mm_setzero_si128(); 29 | 30 | unsigned int AESRand_gen(void){ 31 | AESRand_increment(state); 32 | auto rands = AESRand_rand(state); 33 | return _mm_extract_epi32(rands[0], 0); 34 | } 35 | 36 | int main(){ 37 | // Thanks to http://www.pcg-random.org/posts/how-to-test-with-testu01.html 38 | unif01_Gen* gen = unif01_CreateExternGenBits("AESRand Bottom32", AESRand_gen); 39 | bbattery_BigCrush(gen); 40 | unif01_DeleteExternGenBits(gen); 41 | return 0; 42 | } 43 | -------------------------------------------------------------------------------- /AESRand/FloatTest/FloatTest.vcxproj.filters: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | {4FC737F1-C7A5-4376-A066-2A32D752A2FF} 6 | cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx 7 | 8 | 9 | {93995380-89BD-4b04-88EB-625FBE52EBFB} 10 | h;hh;hpp;hxx;hm;inl;inc;ipp;xsd 11 | 12 | 13 | {67DA6AB6-F800-4c08-8B7A-83BB121AAD01} 14 | rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms 15 | 16 | 17 | 18 | 19 | Header Files 20 | 21 | 22 | 23 | 24 | Source Files 25 | 26 | 27 | Source Files 28 | 29 | 30 | -------------------------------------------------------------------------------- /AESRand/IntegerRangeTest/IntegerRangeTest.vcxproj.filters: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | {4FC737F1-C7A5-4376-A066-2A32D752A2FF} 6 | cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx 7 | 8 | 9 | {93995380-89BD-4b04-88EB-625FBE52EBFB} 10 | h;hh;hpp;hxx;hm;inl;inc;ipp;xsd 11 | 12 | 13 | {67DA6AB6-F800-4c08-8B7A-83BB121AAD01} 14 | rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms 15 | 16 | 17 | 18 | 19 | Header Files 20 | 21 | 22 | 23 | 24 | Source Files 25 | 26 | 27 | Source Files 28 | 29 | 30 | -------------------------------------------------------------------------------- /AESRand_Linux/README: -------------------------------------------------------------------------------- 1 | This Linux version serves as a compact and simplified proof-of-concept 2 | with far less verbosity than the Windows implementation. 3 | 4 | For more notes on why certain values were chosen, read the Windows 5 | implementation 6 | 7 | This Linux version is designed with maximum comptability, with 8 | -march=westmere in the Makefile. -march=skylake (or any other AVX 9 | computer) generates cleaner AVX instructions, but doesn't seem 10 | to affect this test very much. Westmere CPUs were first released 11 | on January 7, 2010, so I expect that most people's computers today 12 | would be able to run this RNGAES code. 13 | 14 | Try running with the "time" command. An example run on my machine is: 15 | 16 | time ./AESRand 17 | Running 5-billion iterations (160 Billion-bytes of Random Data) 18 | Dummy print to negate optimizer: -535139616294573357 19 | 20 | real 0m4.818s 21 | user 0m4.797s 22 | sys 0m0.000s 23 | 24 | This gives a speed of 1.04 billion iterations/sec, or 33.2 GBps of 25 | random data. My computer varies between 3.4GHz and 4GHz, so the code 26 | runs somewhere between 3.5 cycles per iteration, to 4.15 cycles per 27 | iteration. 28 | 29 | -------- 30 | 31 | AESRandGenerator is a 2nd program I wrote to be tested with TestU01's 32 | "BigCrush". Its simply the generator that pipes its output to stdout 33 | -------------------------------------------------------------------------------- /AESRand_Paper/AESRand.h: -------------------------------------------------------------------------------- 1 | #ifndef AESRAND_H 2 | #define AESRAND_H 3 | 4 | // I expect ifdefs galore in this file 5 | 6 | #if __amd64__ 7 | #include 8 | typedef __m128i simd128; 9 | typedef __m128 simd128_float; 10 | typedef __m128i simd128_uint32; 11 | #endif 12 | 13 | #if _ARCH_PPC64 14 | #include 15 | typedef vector unsigned long long simd128; 16 | typedef vector float simd128_float; 17 | typedef vector unsigned int simd128_uint32; 18 | #endif 19 | 20 | #if __aarch64__ 21 | #include 22 | typedef uint8x16_t simd128; 23 | typedef float32x4_t simd128_float; 24 | typedef uint32x4_t simd128_uint32; 25 | #endif 26 | 27 | #include 28 | #include 29 | 30 | simd128 AESRand_init(); 31 | void AESRand_increment(simd128& state); 32 | std::array AESRand_rand(const simd128 state); 33 | 34 | std::array AESRand_rand_float(const simd128 state); 35 | std::array AESRand_rand_uint32(const simd128 state); 36 | 37 | /* 38 | std::array AESRand_randInt_range16(const simd128 state, uint16_t lower_bound, uint16_t upper_bound); 39 | std::array AESRand_randInt_range32(const simd128 state, uint32_t lower_bound, uint32_t upper_bound); 40 | uint64_t AESRand_randInt_range64(const simd128 state, uint64_t lower_bound, uint64_t upper_bound); 41 | */ 42 | 43 | #endif 44 | -------------------------------------------------------------------------------- /AESRand/AESRand/AESRand.vcxproj.filters: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | {4FC737F1-C7A5-4376-A066-2A32D752A2FF} 6 | cpp;c;cc;cxx;def;odl;idl;hpj;bat;asm;asmx 7 | 8 | 9 | {93995380-89BD-4b04-88EB-625FBE52EBFB} 10 | h;hh;hpp;hxx;hm;inl;inc;ipp;xsd 11 | 12 | 13 | {67DA6AB6-F800-4c08-8B7A-83BB121AAD01} 14 | rc;ico;cur;bmp;dlg;rc2;rct;bin;rgs;gif;jpg;jpeg;jpe;resx;tiff;tif;png;wav;mfcribbon-ms 15 | 16 | 17 | 18 | 19 | Header Files 20 | 21 | 22 | 23 | 24 | Source Files 25 | 26 | 27 | Source Files 28 | 29 | 30 | Source Files 31 | 32 | 33 | -------------------------------------------------------------------------------- /AESRand_Linux/AESRand_BigCrush2.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #include 3 | #include 4 | #include 5 | 6 | extern "C"{ 7 | #include 8 | #include 9 | } 10 | 11 | 12 | __m128i AESRand_init(){ 13 | return _mm_setzero_si128(); 14 | } 15 | 16 | __m128i increment = _mm_set_epi8(0x2f, 0x2b, 0x29, 0x25, 0x1f, 0x1d, 0x17, 0x13, 17 | 0x11, 0x0D, 0x0B, 0x07, 0x05, 0x03, 0x02, 0x01); 18 | 19 | void AESRand_increment(__m128i& state){ 20 | state += increment; 21 | } 22 | 23 | std::array<__m128i, 2> AESRand_rand(const __m128i state){ 24 | __m128i penultimate = _mm_aesenc_si128(state, increment); 25 | return {_mm_aesenc_si128(penultimate, increment), _mm_aesdec_si128(penultimate, increment)}; 26 | } 27 | 28 | __m128i state = _mm_setzero_si128(); 29 | uint32_t buffer[8] __attribute__ ((aligned (16))); 30 | int buffer_state=8; 31 | 32 | // This 2nd test, will test all 8 numbers that comes through 33 | unsigned int AESRand_gen(void){ 34 | if(buffer_state>=8){ 35 | AESRand_increment(state); 36 | auto rands = AESRand_rand(state); 37 | _mm_storeu_si128((__m128i*)&buffer[0], rands[0]); 38 | _mm_storeu_si128((__m128i*)&buffer[4], rands[1]); 39 | buffer_state = 0; 40 | } 41 | 42 | return static_cast(&buffer[0])[buffer_state++]; 43 | } 44 | 45 | int main(){ 46 | // Thanks to http://www.pcg-random.org/posts/how-to-test-with-testu01.html 47 | unif01_Gen* gen = unif01_CreateExternGenBits("AESRand All 8xint32", AESRand_gen); 48 | bbattery_BigCrush(gen); 49 | unif01_DeleteExternGenBits(gen); 50 | return 0; 51 | } 52 | -------------------------------------------------------------------------------- /AESRand_Paper/test.c: -------------------------------------------------------------------------------- 1 | #if 0 2 | #include 3 | #include 4 | 5 | void printArray(char array[16]){ 6 | for(int i=0; i<16; i++){ 7 | printf("%02x ", array[i]); 8 | } 9 | printf("\n"); 10 | } 11 | 12 | int main(){ 13 | char array[16]; 14 | vector unsigned long long simd128 = {0, 1}; 15 | memcpy(array, &simd128, 16); 16 | printArray(array); 17 | simd128 = vec_perm(simd128, simd128, (vector unsigned char){15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}); 18 | simd128 = __builtin_crypto_vcipher(simd128, (vector unsigned long long){0,0}); 19 | simd128 = vec_perm(simd128, simd128, (vector unsigned char){15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}); 20 | memcpy(array, &simd128, 16); 21 | printArray(array); 22 | } 23 | #endif 24 | 25 | #include 26 | #include 27 | #include 28 | 29 | void printArray(uint8_t array[16]){ 30 | for(int i=0; i<16; i++){ 31 | printf("%02x ", array[i]); 32 | } 33 | printf("\n"); 34 | } 35 | 36 | int main(){ 37 | //uint8_t increment[16] = {0x2f, 0x2b, 0x29, 0x25, 0x1f, 0x1d, 0x17, 0x13, 38 | // 0x11, 0x0D, 0x0B, 0x07, 0x05, 0x03, 0x02, 0x01}; 39 | uint8_t increment[16] = {0x01, 0x02, 0x03, 0x05, 0x07, 0x0B, 0x0D, 0x11, 40 | 0x13, 0x17, 0x1d, 0x1f, 0x25, 0x29, 0x2b, 0x2f}; 41 | uint8_t array[16] = { 42 | 0, 0, 0, 0, 43 | 0, 0, 0, 0, 44 | 1, 0, 0, 0, 45 | 0, 0, 0, 0, 46 | }; 47 | uint8x16_t simd128 = vld1q_u8(array); 48 | printArray(array); 49 | 50 | simd128 = vaesmcq_u8(vaeseq_u8(simd128, vdupq_n_u8(0))); 51 | memcpy(array, &simd128, 16); 52 | printArray(array); 53 | 54 | simd128 ^= vld1q_u8(increment); 55 | memcpy(array, &simd128, 16); 56 | printArray(array); 57 | 58 | /* 59 | simd128 = vec_perm(simd128, simd128, (vector unsigned char){15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}); 60 | simd128 = __builtin_crypto_vcipher(simd128, (vector unsigned long long){0,0}); 61 | simd128 = vec_perm(simd128, simd128, (vector unsigned char){15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}); 62 | memcpy(array, &simd128, 16); 63 | printArray(array); 64 | */ 65 | } 66 | -------------------------------------------------------------------------------- /BenchmarkResults.md: -------------------------------------------------------------------------------- 1 | I redid the test while locking my CPU to 3.4 GHz (AMD Zen options: P0 state 3.4 GHz, P1+ was disabled, Zen Core Performance Boost off). I manually verified in Ryzen Master that the CPU remained at 3.4GHz during the test. 2 | 3 | Pattern for calculating cycles per iteration: (number of seconds) * 3.4GHz clock speed / (number of iterations). Number of iterations is either 10-billion, or 5-billion, depending on how the specific benchmark worked. 4 | 5 | * Single-State AESRand: 3.69 cycles per iteration (32-bytes) 6 | * Double-State AESRand: 2.95 cycles per iteration (32-bytes) 7 | * mt19937: 14.76 cycles per iteration (4-bytes) 8 | * pcg32 (Unrolled): 5.46 cycles per iteration (4-bytes) 9 | * xoshiro256plus (Unrolled): 3.32 cycles per iteration (8-bytes) 10 | * PlusOne XMM-registers: 1.59 cycles per iteration (Dummy Control, like BogoMIPS) 11 | 12 | The "overhead" of the AESRand benchmark are: 13 | 1. The "For" loop: one-add per iteration (i++), and the cmp/jnz instruction (i<=ITERATIONS). Unrolling reduced this overhead, but modern CPUs are good at executing the loop-logic in parallel, which mitigates this overhead. 14 | 2. Two SIMD-adds in Single-State AESRand for the "Dummy Print". 15 | 3. Four SIMD-adds for Double-state AESRand 16 | 4. One 32-bit add for mt19937 17 | 5. One 32-bit add for pcg32 18 | 6. One 64-bit add for xoshiro256plus 19 | 20 | Raw Results (AMD Threadripper 1950x locked to 3.4GHz) 21 | =========== 22 | 23 | Beginning Single-state 'serial' test 24 | Total Seconds: 5.4278 25 | GBps: 27.4534 26 | Dummy Benchmark anti-optimizer print: 1706011378085583560 27 | Beginning Parallel (2x) test: instruction-level parallelism 28 | Time: 8.67641 29 | GBps: 34.3487 30 | Dummy Benchmark anti-optimizer print: 1283732354369314394 31 | 32 | Testing mt19937 33 | Time: 21.7154 34 | GBps: 0.857752 35 | Dummy Benchmark anti-optimizer print: 1680273558 36 | 37 | Testing pcg32 Unrolled x4 38 | Time: 8.0232 39 | GBps: 2.32157 40 | Dummy Benchmark anti-optimizer print: 2362602604 41 | 42 | Testing pcg32 43 | Time: 8.22974 44 | GBps: 2.26331 45 | Dummy Benchmark anti-optimizer print: 757965796 46 | 47 | Testing xoshiro256plus Unrolled x4 48 | Time: 4.88474 49 | GBps: 7.62639 50 | Dummy Benchmark anti-optimizer print: 2202972135473059297 51 | 52 | Testing xoshiro256plus 53 | Time: 5.09052 54 | GBps: 7.31809 55 | Dummy Benchmark anti-optimizer print: 5290432412060736627 56 | 57 | Beginning PlusOne XMM-registers Test 58 | Total Seconds: 2.3376 59 | GBps: 63.7456 60 | Dummy Benchmark anti-optimizer print: 6553255931290448384 -------------------------------------------------------------------------------- /AESRand/AESRand.sln: -------------------------------------------------------------------------------- 1 | 2 | Microsoft Visual Studio Solution File, Format Version 12.00 3 | # Visual Studio 15 4 | VisualStudioVersion = 15.0.28010.2003 5 | MinimumVisualStudioVersion = 10.0.40219.1 6 | Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "AESRand", "AESRand\AESRand.vcxproj", "{F91B1300-34D7-459B-B40C-3479AF111436}" 7 | EndProject 8 | Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "FloatTest", "FloatTest\FloatTest.vcxproj", "{F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}" 9 | EndProject 10 | Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "IntegerRangeTest", "IntegerRangeTest\IntegerRangeTest.vcxproj", "{14562E75-9BAB-4663-BFE3-51C96298EC81}" 11 | EndProject 12 | Global 13 | GlobalSection(SolutionConfigurationPlatforms) = preSolution 14 | Debug|x64 = Debug|x64 15 | Debug|x86 = Debug|x86 16 | Release|x64 = Release|x64 17 | Release|x86 = Release|x86 18 | EndGlobalSection 19 | GlobalSection(ProjectConfigurationPlatforms) = postSolution 20 | {F91B1300-34D7-459B-B40C-3479AF111436}.Debug|x64.ActiveCfg = Debug|x64 21 | {F91B1300-34D7-459B-B40C-3479AF111436}.Debug|x64.Build.0 = Debug|x64 22 | {F91B1300-34D7-459B-B40C-3479AF111436}.Debug|x86.ActiveCfg = Debug|Win32 23 | {F91B1300-34D7-459B-B40C-3479AF111436}.Debug|x86.Build.0 = Debug|Win32 24 | {F91B1300-34D7-459B-B40C-3479AF111436}.Release|x64.ActiveCfg = Release|x64 25 | {F91B1300-34D7-459B-B40C-3479AF111436}.Release|x64.Build.0 = Release|x64 26 | {F91B1300-34D7-459B-B40C-3479AF111436}.Release|x86.ActiveCfg = Release|Win32 27 | {F91B1300-34D7-459B-B40C-3479AF111436}.Release|x86.Build.0 = Release|Win32 28 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Debug|x64.ActiveCfg = Debug|x64 29 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Debug|x64.Build.0 = Debug|x64 30 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Debug|x86.ActiveCfg = Debug|Win32 31 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Debug|x86.Build.0 = Debug|Win32 32 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Release|x64.ActiveCfg = Release|x64 33 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Release|x64.Build.0 = Release|x64 34 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Release|x86.ActiveCfg = Release|Win32 35 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35}.Release|x86.Build.0 = Release|Win32 36 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Debug|x64.ActiveCfg = Debug|x64 37 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Debug|x64.Build.0 = Debug|x64 38 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Debug|x86.ActiveCfg = Debug|Win32 39 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Debug|x86.Build.0 = Debug|Win32 40 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Release|x64.ActiveCfg = Release|x64 41 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Release|x64.Build.0 = Release|x64 42 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Release|x86.ActiveCfg = Release|Win32 43 | {14562E75-9BAB-4663-BFE3-51C96298EC81}.Release|x86.Build.0 = Release|Win32 44 | EndGlobalSection 45 | GlobalSection(SolutionProperties) = preSolution 46 | HideSolutionNode = FALSE 47 | EndGlobalSection 48 | GlobalSection(ExtensibilityGlobals) = postSolution 49 | SolutionGuid = {B018B1BB-6411-45E9-B712-56A5A37D7ACA} 50 | EndGlobalSection 51 | EndGlobal 52 | -------------------------------------------------------------------------------- /AESRand/AESRand/others.cpp: -------------------------------------------------------------------------------- 1 | #include "pch.h" 2 | 3 | // Sorry for the mess. I copy/pasted http://vigna.di.unimi.it/xorshift/xoshiro256plus.c and 4 | // http://www.pcg-random.org/download.html#minimal-c-implementation into this file. 5 | 6 | 7 | #include 8 | 9 | typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t; 10 | 11 | uint32_t pcg32_random_r(pcg32_random_t* rng) 12 | { 13 | uint64_t oldstate = rng->state; 14 | // Advance internal state 15 | rng->state = oldstate * 6364136223846793005ULL + (rng->inc | 1); 16 | // Calculate output function (XSH RR), uses old state for max ILP 17 | uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u; 18 | uint32_t rot = oldstate >> 59u; 19 | return (xorshifted >> rot) | (xorshifted << ((-static_cast(rot)) & 31)); 20 | } 21 | 22 | #include 23 | 24 | /* This is xoshiro256+ 1.0, our best and fastest generator for floating-point 25 | numbers. We suggest to use its upper bits for floating-point 26 | generation, as it is slightly faster than xoshiro256**. It passes all 27 | tests we are aware of except for the lowest three bits, which might 28 | fail linearity tests (and just those), so if low linear complexity is 29 | not considered an issue (as it is usually the case) it can be used to 30 | generate 64-bit outputs, too. 31 | 32 | We suggest to use a sign test to extract a random Boolean value, and 33 | right shifts to extract subsets of bits. 34 | 35 | The state must be seeded so that it is not everywhere zero. If you have 36 | a 64-bit seed, we suggest to seed a splitmix64 generator and use its 37 | output to fill s. */ 38 | 39 | 40 | static inline uint64_t rotl(const uint64_t x, int k) { 41 | return (x << k) | (x >> (64 - k)); 42 | } 43 | 44 | 45 | uint64_t s[4]; 46 | 47 | uint64_t next(void) { 48 | const uint64_t result_plus = s[0] + s[3]; 49 | 50 | const uint64_t t = s[1] << 17; 51 | 52 | s[2] ^= s[0]; 53 | s[3] ^= s[1]; 54 | s[1] ^= s[2]; 55 | s[0] ^= s[3]; 56 | 57 | s[2] ^= t; 58 | 59 | s[3] = rotl(s[3], 45); 60 | 61 | return result_plus; 62 | } 63 | 64 | 65 | /* This is the jump function for the generator. It is equivalent 66 | to 2^128 calls to next(); it can be used to generate 2^128 67 | non-overlapping subsequences for parallel computations. */ 68 | 69 | void jump(void) { 70 | static const uint64_t JUMP[] = { 0x180ec6d33cfd0aba, 0xd5a61266f0c9392c, 0xa9582618e03fc9aa, 0x39abdc4529b1661c }; 71 | 72 | uint64_t s0 = 0; 73 | uint64_t s1 = 0; 74 | uint64_t s2 = 0; 75 | uint64_t s3 = 0; 76 | for (int i = 0; i < sizeof JUMP / sizeof *JUMP; i++) 77 | for (int b = 0; b < 64; b++) { 78 | if (JUMP[i] & UINT64_C(1) << b) { 79 | s0 ^= s[0]; 80 | s1 ^= s[1]; 81 | s2 ^= s[2]; 82 | s3 ^= s[3]; 83 | } 84 | next(); 85 | } 86 | 87 | s[0] = s0; 88 | s[1] = s1; 89 | s[2] = s2; 90 | s[3] = s3; 91 | } 92 | 93 | 94 | /* This is the long-jump function for the generator. It is equivalent to 95 | 2^192 calls to next(); it can be used to generate 2^64 starting points, 96 | from each of which jump() will generate 2^64 non-overlapping 97 | subsequences for parallel distributed computations. */ 98 | 99 | void long_jump(void) { 100 | static const uint64_t LONG_JUMP[] = { 0x76e15d3efefdcbbf, 0xc5004e441c522fb3, 0x77710069854ee241, 0x39109bb02acbe635 }; 101 | 102 | uint64_t s0 = 0; 103 | uint64_t s1 = 0; 104 | uint64_t s2 = 0; 105 | uint64_t s3 = 0; 106 | for (int i = 0; i < sizeof LONG_JUMP / sizeof *LONG_JUMP; i++) 107 | for (int b = 0; b < 64; b++) { 108 | if (LONG_JUMP[i] & UINT64_C(1) << b) { 109 | s0 ^= s[0]; 110 | s1 ^= s[1]; 111 | s2 ^= s[2]; 112 | s3 ^= s[3]; 113 | } 114 | next(); 115 | } 116 | 117 | s[0] = s0; 118 | s[1] = s1; 119 | s[2] = s2; 120 | s[3] = s3; 121 | } 122 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ## Ignore Visual Studio temporary files, build results, and 2 | ## files generated by popular Visual Studio add-ons. 3 | ## 4 | ## Get latest from https://github.com/github/gitignore/blob/master/VisualStudio.gitignore 5 | 6 | # User-specific files 7 | *.suo 8 | *.user 9 | *.userosscache 10 | *.sln.docstates 11 | 12 | # User-specific files (MonoDevelop/Xamarin Studio) 13 | *.userprefs 14 | 15 | # Build results 16 | [Dd]ebug/ 17 | [Dd]ebugPublic/ 18 | [Rr]elease/ 19 | [Rr]eleases/ 20 | x64/ 21 | x86/ 22 | bld/ 23 | [Bb]in/ 24 | [Oo]bj/ 25 | [Ll]og/ 26 | 27 | # Visual Studio 2015/2017 cache/options directory 28 | .vs/ 29 | # Uncomment if you have tasks that create the project's static files in wwwroot 30 | #wwwroot/ 31 | 32 | # Visual Studio 2017 auto generated files 33 | Generated\ Files/ 34 | 35 | # MSTest test Results 36 | [Tt]est[Rr]esult*/ 37 | [Bb]uild[Ll]og.* 38 | 39 | # NUNIT 40 | *.VisualState.xml 41 | TestResult.xml 42 | 43 | # Build Results of an ATL Project 44 | [Dd]ebugPS/ 45 | [Rr]eleasePS/ 46 | dlldata.c 47 | 48 | # Benchmark Results 49 | BenchmarkDotNet.Artifacts/ 50 | 51 | # .NET Core 52 | project.lock.json 53 | project.fragment.lock.json 54 | artifacts/ 55 | **/Properties/launchSettings.json 56 | 57 | # StyleCop 58 | StyleCopReport.xml 59 | 60 | # Files built by Visual Studio 61 | *_i.c 62 | *_p.c 63 | *_i.h 64 | *.ilk 65 | *.meta 66 | *.obj 67 | *.iobj 68 | *.pch 69 | *.pdb 70 | *.ipdb 71 | *.pgc 72 | *.pgd 73 | *.rsp 74 | *.sbr 75 | *.tlb 76 | *.tli 77 | *.tlh 78 | *.tmp 79 | *.tmp_proj 80 | *.log 81 | *.vspscc 82 | *.vssscc 83 | .builds 84 | *.pidb 85 | *.svclog 86 | *.scc 87 | 88 | # Chutzpah Test files 89 | _Chutzpah* 90 | 91 | # Visual C++ cache files 92 | ipch/ 93 | *.aps 94 | *.ncb 95 | *.opendb 96 | *.opensdf 97 | *.sdf 98 | *.cachefile 99 | *.VC.db 100 | *.VC.VC.opendb 101 | 102 | # Visual Studio profiler 103 | *.psess 104 | *.vsp 105 | *.vspx 106 | *.sap 107 | 108 | # Visual Studio Trace Files 109 | *.e2e 110 | 111 | # TFS 2012 Local Workspace 112 | $tf/ 113 | 114 | # Guidance Automation Toolkit 115 | *.gpState 116 | 117 | # ReSharper is a .NET coding add-in 118 | _ReSharper*/ 119 | *.[Rr]e[Ss]harper 120 | *.DotSettings.user 121 | 122 | # JustCode is a .NET coding add-in 123 | .JustCode 124 | 125 | # TeamCity is a build add-in 126 | _TeamCity* 127 | 128 | # DotCover is a Code Coverage Tool 129 | *.dotCover 130 | 131 | # AxoCover is a Code Coverage Tool 132 | .axoCover/* 133 | !.axoCover/settings.json 134 | 135 | # Visual Studio code coverage results 136 | *.coverage 137 | *.coveragexml 138 | 139 | # NCrunch 140 | _NCrunch_* 141 | .*crunch*.local.xml 142 | nCrunchTemp_* 143 | 144 | # MightyMoose 145 | *.mm.* 146 | AutoTest.Net/ 147 | 148 | # Web workbench (sass) 149 | .sass-cache/ 150 | 151 | # Installshield output folder 152 | [Ee]xpress/ 153 | 154 | # DocProject is a documentation generator add-in 155 | DocProject/buildhelp/ 156 | DocProject/Help/*.HxT 157 | DocProject/Help/*.HxC 158 | DocProject/Help/*.hhc 159 | DocProject/Help/*.hhk 160 | DocProject/Help/*.hhp 161 | DocProject/Help/Html2 162 | DocProject/Help/html 163 | 164 | # Click-Once directory 165 | publish/ 166 | 167 | # Publish Web Output 168 | *.[Pp]ublish.xml 169 | *.azurePubxml 170 | # Note: Comment the next line if you want to checkin your web deploy settings, 171 | # but database connection strings (with potential passwords) will be unencrypted 172 | *.pubxml 173 | *.publishproj 174 | 175 | # Microsoft Azure Web App publish settings. Comment the next line if you want to 176 | # checkin your Azure Web App publish settings, but sensitive information contained 177 | # in these scripts will be unencrypted 178 | PublishScripts/ 179 | 180 | # NuGet Packages 181 | *.nupkg 182 | # The packages folder can be ignored because of Package Restore 183 | **/[Pp]ackages/* 184 | # except build/, which is used as an MSBuild target. 185 | !**/[Pp]ackages/build/ 186 | # Uncomment if necessary however generally it will be regenerated when needed 187 | #!**/[Pp]ackages/repositories.config 188 | # NuGet v3's project.json files produces more ignorable files 189 | *.nuget.props 190 | *.nuget.targets 191 | 192 | # Microsoft Azure Build Output 193 | csx/ 194 | *.build.csdef 195 | 196 | # Microsoft Azure Emulator 197 | ecf/ 198 | rcf/ 199 | 200 | # Windows Store app package directories and files 201 | AppPackages/ 202 | BundleArtifacts/ 203 | Package.StoreAssociation.xml 204 | _pkginfo.txt 205 | *.appx 206 | 207 | # Visual Studio cache files 208 | # files ending in .cache can be ignored 209 | *.[Cc]ache 210 | # but keep track of directories ending in .cache 211 | !*.[Cc]ache/ 212 | 213 | # Others 214 | ClientBin/ 215 | ~$* 216 | *~ 217 | *.dbmdl 218 | *.dbproj.schemaview 219 | *.jfm 220 | *.pfx 221 | *.publishsettings 222 | orleans.codegen.cs 223 | 224 | # Including strong name files can present a security risk 225 | # (https://github.com/github/gitignore/pull/2483#issue-259490424) 226 | #*.snk 227 | 228 | # Since there are multiple workflows, uncomment next line to ignore bower_components 229 | # (https://github.com/github/gitignore/pull/1529#issuecomment-104372622) 230 | #bower_components/ 231 | 232 | # RIA/Silverlight projects 233 | Generated_Code/ 234 | 235 | # Backup & report files from converting an old project file 236 | # to a newer Visual Studio version. Backup files are not needed, 237 | # because we have git ;-) 238 | _UpgradeReport_Files/ 239 | Backup*/ 240 | UpgradeLog*.XML 241 | UpgradeLog*.htm 242 | ServiceFabricBackup/ 243 | *.rptproj.bak 244 | 245 | # SQL Server files 246 | *.mdf 247 | *.ldf 248 | *.ndf 249 | 250 | # Business Intelligence projects 251 | *.rdl.data 252 | *.bim.layout 253 | *.bim_*.settings 254 | *.rptproj.rsuser 255 | 256 | # Microsoft Fakes 257 | FakesAssemblies/ 258 | 259 | # GhostDoc plugin setting file 260 | *.GhostDoc.xml 261 | 262 | # Node.js Tools for Visual Studio 263 | .ntvs_analysis.dat 264 | node_modules/ 265 | 266 | # Visual Studio 6 build log 267 | *.plg 268 | 269 | # Visual Studio 6 workspace options file 270 | *.opt 271 | 272 | # Visual Studio 6 auto-generated workspace file (contains which files were open etc.) 273 | *.vbw 274 | 275 | # Visual Studio LightSwitch build output 276 | **/*.HTMLClient/GeneratedArtifacts 277 | **/*.DesktopClient/GeneratedArtifacts 278 | **/*.DesktopClient/ModelManifest.xml 279 | **/*.Server/GeneratedArtifacts 280 | **/*.Server/ModelManifest.xml 281 | _Pvt_Extensions 282 | 283 | # Paket dependency manager 284 | .paket/paket.exe 285 | paket-files/ 286 | 287 | # FAKE - F# Make 288 | .fake/ 289 | 290 | # JetBrains Rider 291 | .idea/ 292 | *.sln.iml 293 | 294 | # CodeRush 295 | .cr/ 296 | 297 | # Python Tools for Visual Studio (PTVS) 298 | __pycache__/ 299 | *.pyc 300 | 301 | # Cake - Uncomment if you are using it 302 | # tools/** 303 | # !tools/packages.config 304 | 305 | # Tabs Studio 306 | *.tss 307 | 308 | # Telerik's JustMock configuration file 309 | *.jmconfig 310 | 311 | # BizTalk build output 312 | *.btp.cs 313 | *.btm.cs 314 | *.odx.cs 315 | *.xsd.cs 316 | 317 | # OpenCover UI analysis results 318 | OpenCover/ 319 | 320 | # Azure Stream Analytics local run output 321 | ASALocalRun/ 322 | 323 | # MSBuild Binary and Structured Log 324 | *.binlog 325 | 326 | # NVidia Nsight GPU debugger configuration file 327 | *.nvuser 328 | 329 | # MFractors (Xamarin productivity tool) working folder 330 | .mfractor/ 331 | -------------------------------------------------------------------------------- /AESRand_Paper/AESRand.hpp: -------------------------------------------------------------------------------- 1 | #ifdef __amd64__ 2 | 3 | simd128 AESRand_init(){ 4 | return _mm_setzero_si128(); 5 | } 6 | 7 | static simd128 increment = _mm_set_epi8(0x2f, 0x2b, 0x29, 0x25, 0x1f, 0x1d, 0x17, 0x13, 8 | 0x11, 0x0D, 0x0B, 0x07, 0x05, 0x03, 0x02, 0x01); 9 | 10 | void AESRand_increment(simd128& state){ 11 | state += increment; 12 | } 13 | 14 | std::array AESRand_rand(const simd128 state){ 15 | simd128 penultimate = _mm_aesenc_si128(state, increment); 16 | return {_mm_aesenc_si128(penultimate, increment), _mm_aesdec_si128(penultimate, increment)}; 17 | } 18 | 19 | static __m128 toFloats(__m128i input){ 20 | // Isolate the sign and exponent bits 21 | __m128i isolate = _mm_andnot_si128(_mm_set1_epi32(0xff800000), input); 22 | 23 | // 0x3f800000 is the magic number representing floating point 1.0 24 | __m128i addExponent = _mm_or_si128(_mm_set1_epi32(0x3f800000), isolate); 25 | 26 | // Numbers are now in between [1.0, 2.0) 27 | __m128 one = _mm_set1_ps(1.0); 28 | 29 | // Result is now in [0, 1), but we may have lost some bits. 30 | // We could return now, but... we can regain 9-lost bits without much effort 31 | __m128 fastResult = _mm_sub_ps(_mm_castsi128_ps(addExponent), one); 32 | 33 | return fastResult; 34 | 35 | #if 0 36 | // This code takes the 9-unused bits and uses them in the bottom sometimes. This 37 | // may add extra bits of precision in some cases at the cost of possibly returning 38 | // a denormal. 39 | __m128i unused9bits = _mm_and_si128(_mm_set1_epi32(0xff800000), input); 40 | unused9bits = _mm_srli_epi32(unused9bits, 23); 41 | 42 | //Doing an _mm_xor_ps with those 9-bits results in a NAN error. Do xors 43 | // in the integer domain, then convert back. 44 | 45 | return _mm_xor_ps(fastResult, _mm_castsi128_ps(unused9bits)); 46 | #endif 47 | } 48 | 49 | std::array AESRand_rand_uint32(const simd128 state){ 50 | auto rands = AESRand_rand(state); 51 | 52 | std::array toReturn; 53 | _mm_storeu_si128((__m128i*)&toReturn[0], rands[0]); 54 | _mm_storeu_si128((__m128i*)&toReturn[4], rands[1]); 55 | return toReturn; 56 | } 57 | 58 | std::array AESRand_rand_float(const simd128 state){ 59 | auto rands = AESRand_rand(state); 60 | __m128 simd0 = toFloats(rands[0]); 61 | __m128 simd1 = toFloats(rands[1]); 62 | 63 | std::array toReturn; 64 | _mm_storeu_ps(&toReturn[0], simd0); 65 | _mm_storeu_ps(&toReturn[4], simd1); 66 | return toReturn; 67 | } 68 | 69 | #endif //amd64 70 | 71 | #ifdef _ARCH_PPC64 72 | 73 | // PPC Intrinsics defined in "64-bit ELF V2 ABI Specification", chapter 6 and Appendix A. 74 | // GCC defines the crypto-extension 75 | 76 | // PowerPC operates on big-endian FIPS 197 compatible AES-vectors 77 | // Convert to big-endian and back to retain compatibility with AMD64 78 | static simd128 endianConv(simd128 in){ 79 | return vec_perm(in, in, (vector unsigned char){15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}); 80 | } 81 | 82 | simd128 AESRand_init(){ 83 | return (simd128) {0, 0}; 84 | } 85 | 86 | static simd128 increment = {0x110d0b0705030201, 0x2f2b29251f1d1713}; 87 | 88 | void AESRand_increment(simd128& state){ 89 | state += increment; 90 | //state = vec_add(state, increment); 91 | } 92 | 93 | std::array AESRand_rand(const simd128 state){ 94 | simd128 state_endian = endianConv(state); 95 | simd128 increment_endian = endianConv(increment); 96 | simd128 penultimate = __builtin_crypto_vcipher(state_endian, increment_endian); 97 | simd128 first_ret = __builtin_crypto_vcipher(penultimate, increment_endian); 98 | simd128 second_ret = __builtin_crypto_vncipher(penultimate, (vector unsigned long long) {0,0}); 99 | 100 | // Note: this is suboptimal. A "MixColumns" can be applied at compile-time to the 101 | // increment_endian value to combine this XOR with the above vncipher command. Depends how much 102 | // we care about optimization... 103 | second_ret ^= increment_endian; 104 | return {endianConv(first_ret), endianConv(second_ret)}; 105 | } 106 | 107 | std::array AESRand_rand_uint32(const simd128 state){ 108 | auto rands = AESRand_rand(state); 109 | 110 | std::array toReturn; 111 | toReturn[0] = rands[0][0]; 112 | toReturn[1] = rands[0][0] >> 32; 113 | toReturn[2] = rands[0][1]; 114 | toReturn[3] = rands[0][1] >> 32; 115 | toReturn[4] = rands[1][0]; 116 | toReturn[5] = rands[1][0] >> 32; 117 | toReturn[6] = rands[1][1]; 118 | toReturn[7] = rands[1][1] >> 32; 119 | // _mm_storeu_si128((__m128i*)&toReturn[0], rands[0]); 120 | // _mm_storeu_si128((__m128i*)&toReturn[4], rands[1]); 121 | return toReturn; 122 | } 123 | 124 | #endif 125 | 126 | #if __aarch64__ 127 | 128 | simd128 AESRand_init(){ 129 | simd128 arb; 130 | return veorq_u8(arb, arb); 131 | } 132 | 133 | // Endian is reversed compared to Intel. Completely backwards... 134 | uint8_t increment[16] = {0x01, 0x02, 0x03, 0x05, 0x07, 0x0B, 0x0D, 0x11, 135 | 0x13, 0x17, 0x1d, 0x1f, 0x25, 0x29, 0x2b, 0x2f}; 136 | 137 | void AESRand_increment(simd128& state){ 138 | simd128 inc = vld1q_u8(increment); 139 | state = vaddq_u8(state, inc); 140 | } 141 | 142 | std::array AESRand_rand(const simd128 state){ 143 | simd128 inc = vld1q_u8(increment); 144 | simd128 penultimate_intel = vaesmcq_u8(vaeseq_u8(state, vdupq_n_u8(0))); 145 | simd128 penultimate_arm_enc = vaesmcq_u8(vaeseq_u8(penultimate_intel, (inc))); 146 | simd128 penultimate_arm_dec = vaesimcq_u8(vaesdq_u8(penultimate_intel, (inc))); 147 | return {veorq_u8(penultimate_arm_enc, (inc)), veorq_u8(penultimate_arm_dec, inc)}; 148 | } 149 | 150 | std::array AESRand_rand_uint32(const simd128 state){ 151 | auto rands = AESRand_rand(state); 152 | 153 | std::array toReturn; 154 | vst1q_u8((uint8_t*) &toReturn[0], rands[0]); 155 | vst1q_u8((uint8_t*) &toReturn[4], rands[1]); 156 | // _mm_storeu_si128((__m128i*)&toReturn[0], rands[0]); 157 | // _mm_storeu_si128((__m128i*)&toReturn[4], rands[1]); 158 | return toReturn; 159 | } 160 | 161 | /* 162 | static __m128 toFloats(__m128i input){ 163 | // Isolate the sign and exponent bits 164 | __m128i isolate = _mm_andnot_si128(_mm_set1_epi32(0xff800000), input); 165 | 166 | // 0x3f800000 is the magic number representing floating point 1.0 167 | __m128i addExponent = _mm_or_si128(_mm_set1_epi32(0x3f800000), isolate); 168 | 169 | // Numbers are now in between [1.0, 2.0) 170 | __m128 one = _mm_set1_ps(1.0); 171 | 172 | // Result is now in [0, 1), but we may have lost some bits. 173 | // We could return now, but... we can regain 9-lost bits without much effort 174 | __m128 fastResult = _mm_sub_ps(_mm_castsi128_ps(addExponent), one); 175 | 176 | return fastResult; 177 | 178 | #if 0 179 | // This code takes the 9-unused bits and uses them in the bottom sometimes. This 180 | // may add extra bits of precision in some cases at the cost of possibly returning 181 | // a denormal. 182 | __m128i unused9bits = _mm_and_si128(_mm_set1_epi32(0xff800000), input); 183 | unused9bits = _mm_srli_epi32(unused9bits, 23); 184 | 185 | //Doing an _mm_xor_ps with those 9-bits results in a NAN error. Do xors 186 | // in the integer domain, then convert back. 187 | 188 | return _mm_xor_ps(fastResult, _mm_castsi128_ps(unused9bits)); 189 | #endif 190 | } 191 | 192 | std::array AESRand_rand_float(const simd128 state){ 193 | auto rands = AESRand_rand(state); 194 | __m128 simd0 = toFloats(rands[0]); 195 | __m128 simd1 = toFloats(rands[1]); 196 | 197 | std::array toReturn; 198 | _mm_storeu_ps(&toReturn[0], simd0); 199 | _mm_storeu_ps(&toReturn[4], simd1); 200 | return toReturn; 201 | } 202 | */ 203 | #endif 204 | -------------------------------------------------------------------------------- /AESRand_Paper/AESRand.cpp: -------------------------------------------------------------------------------- 1 | #include "AESRand.h" 2 | 3 | #ifdef __amd64__ 4 | 5 | simd128 AESRand_init(){ 6 | return _mm_setzero_si128(); 7 | } 8 | 9 | static simd128 increment = _mm_set_epi8(0x2f, 0x2b, 0x29, 0x25, 0x1f, 0x1d, 0x17, 0x13, 10 | 0x11, 0x0D, 0x0B, 0x07, 0x05, 0x03, 0x02, 0x01); 11 | 12 | void AESRand_increment(simd128& state){ 13 | state += increment; 14 | } 15 | 16 | std::array AESRand_rand(const simd128 state){ 17 | simd128 penultimate = _mm_aesenc_si128(state, increment); 18 | return {_mm_aesenc_si128(penultimate, increment), _mm_aesdec_si128(penultimate, increment)}; 19 | } 20 | 21 | static __m128 toFloats(__m128i input){ 22 | // Isolate the sign and exponent bits 23 | __m128i isolate = _mm_andnot_si128(_mm_set1_epi32(0xff800000), input); 24 | 25 | // 0x3f800000 is the magic number representing floating point 1.0 26 | __m128i addExponent = _mm_or_si128(_mm_set1_epi32(0x3f800000), isolate); 27 | 28 | // Numbers are now in between [1.0, 2.0) 29 | __m128 one = _mm_set1_ps(1.0); 30 | 31 | // Result is now in [0, 1), but we may have lost some bits. 32 | // We could return now, but... we can regain 9-lost bits without much effort 33 | __m128 fastResult = _mm_sub_ps(_mm_castsi128_ps(addExponent), one); 34 | 35 | return fastResult; 36 | 37 | #if 0 38 | // This code takes the 9-unused bits and uses them in the bottom sometimes. This 39 | // may add extra bits of precision in some cases at the cost of possibly returning 40 | // a denormal. 41 | __m128i unused9bits = _mm_and_si128(_mm_set1_epi32(0xff800000), input); 42 | unused9bits = _mm_srli_epi32(unused9bits, 23); 43 | 44 | //Doing an _mm_xor_ps with those 9-bits results in a NAN error. Do xors 45 | // in the integer domain, then convert back. 46 | 47 | return _mm_xor_ps(fastResult, _mm_castsi128_ps(unused9bits)); 48 | #endif 49 | } 50 | 51 | std::array AESRand_rand_uint32(const simd128 state){ 52 | auto rands = AESRand_rand(state); 53 | 54 | std::array toReturn; 55 | _mm_storeu_si128((__m128i*)&toReturn[0], rands[0]); 56 | _mm_storeu_si128((__m128i*)&toReturn[4], rands[1]); 57 | return toReturn; 58 | } 59 | 60 | std::array AESRand_rand_float(const simd128 state){ 61 | auto rands = AESRand_rand(state); 62 | __m128 simd0 = toFloats(rands[0]); 63 | __m128 simd1 = toFloats(rands[1]); 64 | 65 | std::array toReturn; 66 | _mm_storeu_ps(&toReturn[0], simd0); 67 | _mm_storeu_ps(&toReturn[4], simd1); 68 | return toReturn; 69 | } 70 | 71 | #endif //amd64 72 | 73 | #ifdef _ARCH_PPC64 74 | 75 | // PPC Intrinsics defined in "64-bit ELF V2 ABI Specification", chapter 6 and Appendix A. 76 | // GCC defines the crypto-extension 77 | 78 | // PowerPC operates on big-endian FIPS 197 compatible AES-vectors 79 | // Convert to big-endian and back to retain compatibility with AMD64 80 | static simd128 endianConv(simd128 in){ 81 | return vec_perm(in, in, (vector unsigned char){15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0}); 82 | } 83 | 84 | simd128 AESRand_init(){ 85 | return (simd128) {0, 0}; 86 | } 87 | 88 | static simd128 increment = {0x110d0b0705030201, 0x2f2b29251f1d1713}; 89 | 90 | void AESRand_increment(simd128& state){ 91 | state += increment; 92 | //state = vec_add(state, increment); 93 | } 94 | 95 | std::array AESRand_rand(const simd128 state){ 96 | simd128 state_endian = endianConv(state); 97 | simd128 increment_endian = endianConv(increment); 98 | simd128 penultimate = __builtin_crypto_vcipher(state_endian, increment_endian); 99 | simd128 first_ret = __builtin_crypto_vcipher(penultimate, increment_endian); 100 | simd128 second_ret = __builtin_crypto_vncipher(penultimate, (vector unsigned long long) {0,0}); 101 | 102 | // Note: this is suboptimal. A "MixColumns" can be applied at compile-time to the 103 | // increment_endian value to combine this XOR with the above vncipher command. Depends how much 104 | // we care about optimization... 105 | second_ret ^= increment_endian; 106 | return {endianConv(first_ret), endianConv(second_ret)}; 107 | } 108 | 109 | std::array AESRand_rand_uint32(const simd128 state){ 110 | auto rands = AESRand_rand(state); 111 | 112 | std::array toReturn; 113 | toReturn[0] = rands[0][0]; 114 | toReturn[1] = rands[0][0] >> 32; 115 | toReturn[2] = rands[0][1]; 116 | toReturn[3] = rands[0][1] >> 32; 117 | toReturn[4] = rands[1][0]; 118 | toReturn[5] = rands[1][0] >> 32; 119 | toReturn[6] = rands[1][1]; 120 | toReturn[7] = rands[1][1] >> 32; 121 | // _mm_storeu_si128((__m128i*)&toReturn[0], rands[0]); 122 | // _mm_storeu_si128((__m128i*)&toReturn[4], rands[1]); 123 | return toReturn; 124 | } 125 | 126 | #endif 127 | 128 | #if __aarch64__ 129 | 130 | simd128 AESRand_init(){ 131 | simd128 arb; 132 | return veorq_u8(arb, arb); 133 | } 134 | 135 | // Endian is reversed compared to Intel. Completely backwards... 136 | uint8_t increment[16] = {0x01, 0x02, 0x03, 0x05, 0x07, 0x0B, 0x0D, 0x11, 137 | 0x13, 0x17, 0x1d, 0x1f, 0x25, 0x29, 0x2b, 0x2f}; 138 | 139 | void AESRand_increment(simd128& state){ 140 | simd128 inc = vld1q_u8(increment); 141 | state = vaddq_u8(state, inc); 142 | } 143 | 144 | std::array AESRand_rand(const simd128 state){ 145 | simd128 inc = vld1q_u8(increment); 146 | simd128 penultimate_intel = vaesmcq_u8(vaeseq_u8(state, vdupq_n_u8(0))); 147 | simd128 penultimate_arm_enc = vaesmcq_u8(vaeseq_u8(penultimate_intel, (inc))); 148 | simd128 penultimate_arm_dec = vaesimcq_u8(vaesdq_u8(penultimate_intel, (inc))); 149 | return {veorq_u8(penultimate_arm_enc, (inc)), veorq_u8(penultimate_arm_dec, inc)}; 150 | } 151 | 152 | std::array AESRand_rand_uint32(const simd128 state){ 153 | auto rands = AESRand_rand(state); 154 | 155 | std::array toReturn; 156 | vst1q_u8((uint8_t*) &toReturn[0], rands[0]); 157 | vst1q_u8((uint8_t*) &toReturn[4], rands[1]); 158 | // _mm_storeu_si128((__m128i*)&toReturn[0], rands[0]); 159 | // _mm_storeu_si128((__m128i*)&toReturn[4], rands[1]); 160 | return toReturn; 161 | } 162 | 163 | /* 164 | static __m128 toFloats(__m128i input){ 165 | // Isolate the sign and exponent bits 166 | __m128i isolate = _mm_andnot_si128(_mm_set1_epi32(0xff800000), input); 167 | 168 | // 0x3f800000 is the magic number representing floating point 1.0 169 | __m128i addExponent = _mm_or_si128(_mm_set1_epi32(0x3f800000), isolate); 170 | 171 | // Numbers are now in between [1.0, 2.0) 172 | __m128 one = _mm_set1_ps(1.0); 173 | 174 | // Result is now in [0, 1), but we may have lost some bits. 175 | // We could return now, but... we can regain 9-lost bits without much effort 176 | __m128 fastResult = _mm_sub_ps(_mm_castsi128_ps(addExponent), one); 177 | 178 | return fastResult; 179 | 180 | #if 0 181 | // This code takes the 9-unused bits and uses them in the bottom sometimes. This 182 | // may add extra bits of precision in some cases at the cost of possibly returning 183 | // a denormal. 184 | __m128i unused9bits = _mm_and_si128(_mm_set1_epi32(0xff800000), input); 185 | unused9bits = _mm_srli_epi32(unused9bits, 23); 186 | 187 | //Doing an _mm_xor_ps with those 9-bits results in a NAN error. Do xors 188 | // in the integer domain, then convert back. 189 | 190 | return _mm_xor_ps(fastResult, _mm_castsi128_ps(unused9bits)); 191 | #endif 192 | } 193 | 194 | std::array AESRand_rand_float(const simd128 state){ 195 | auto rands = AESRand_rand(state); 196 | __m128 simd0 = toFloats(rands[0]); 197 | __m128 simd1 = toFloats(rands[1]); 198 | 199 | std::array toReturn; 200 | _mm_storeu_ps(&toReturn[0], simd0); 201 | _mm_storeu_ps(&toReturn[4], simd1); 202 | return toReturn; 203 | } 204 | */ 205 | #endif 206 | -------------------------------------------------------------------------------- /PractRand.md: -------------------------------------------------------------------------------- 1 | Initial Results from PractRand 2 | 3 | Preliminary PractRand Results on AESRand_increment 4 | ------------ 5 | 6 | RNG_test using PractRand version 0.94 7 | RNG = RNG_stdin, seed = unknown 8 | test set = core, folding = standard(unknown format) 9 | 10 | rng=RNG_stdin, seed=unknown 11 | length= 256 megabytes (2^28 bytes), time= 2.1 seconds 12 | no anomalies in 213 test result(s) 13 | 14 | rng=RNG_stdin, seed=unknown 15 | length= 512 megabytes (2^29 bytes), time= 4.1 seconds 16 | no anomalies in 229 test result(s) 17 | 18 | rng=RNG_stdin, seed=unknown 19 | length= 1 gigabyte (2^30 bytes), time= 7.5 seconds 20 | no anomalies in 248 test result(s) 21 | 22 | rng=RNG_stdin, seed=unknown 23 | length= 2 gigabytes (2^31 bytes), time= 14.2 seconds 24 | no anomalies in 266 test result(s) 25 | 26 | rng=RNG_stdin, seed=unknown 27 | length= 4 gigabytes (2^32 bytes), time= 28.0 seconds 28 | no anomalies in 282 test result(s) 29 | 30 | rng=RNG_stdin, seed=unknown 31 | length= 8 gigabytes (2^33 bytes), time= 53.5 seconds 32 | no anomalies in 299 test result(s) 33 | 34 | rng=RNG_stdin, seed=unknown 35 | length= 16 gigabytes (2^34 bytes), time= 108 seconds 36 | no anomalies in 315 test result(s) 37 | 38 | rng=RNG_stdin, seed=unknown 39 | length= 32 gigabytes (2^35 bytes), time= 208 seconds 40 | no anomalies in 328 test result(s) 41 | 42 | rng=RNG_stdin, seed=unknown 43 | length= 64 gigabytes (2^36 bytes), time= 437 seconds 44 | no anomalies in 344 test result(s) 45 | 46 | rng=RNG_stdin, seed=unknown 47 | length= 128 gigabytes (2^37 bytes), time= 844 seconds 48 | no anomalies in 359 test result(s) 49 | 50 | rng=RNG_stdin, seed=unknown 51 | length= 256 gigabytes (2^38 bytes), time= 1620 seconds 52 | no anomalies in 372 test result(s) 53 | 54 | rng=RNG_stdin, seed=unknown 55 | length= 512 gigabytes (2^39 bytes), time= 3484 seconds 56 | no anomalies in 387 test result(s) 57 | 58 | rng=RNG_stdin, seed=unknown 59 | length= 1 terabyte (2^40 bytes), time= 7024 seconds 60 | no anomalies in 401 test result(s) 61 | 62 | rng=RNG_stdin, seed=unknown 63 | length= 2 terabytes (2^41 bytes), time= 13255 seconds 64 | no anomalies in 413 test result(s) 65 | 66 | rng=RNG_stdin, seed=unknown 67 | length= 4 terabytes (2^42 bytes), time= 27845 seconds 68 | no anomalies in 426 test result(s) 69 | 70 | rng=RNG_stdin, seed=unknown 71 | length= 8 terabytes (2^43 bytes), time= 56894 seconds 72 | no anomalies in 438 test result(s) 73 | 74 | Preliminary PractRand Results on AESRand_parallelStream "Plus Pi" 75 | ------------------------------------------------------- 76 | 77 | This version uses: 78 | 79 | __m128i AESRand_parallelStream(__m128i originalStream) { 80 | __m128i copy = originalStream; 81 | copy.m128i_u64[1] += 0x3141592653589793; 82 | return copy; 83 | } 84 | 85 | One "unusual" result at 4GB of test, but not unusual enough 86 | to fail PractRand's default settings. 87 | 88 | RNG_test using PractRand version 0.94 89 | RNG = RNG_stdin, seed = unknown 90 | test set = core, folding = standard(unknown format) 91 | 92 | rng=RNG_stdin, seed=unknown 93 | length= 256 megabytes (2^28 bytes), time= 2.3 seconds 94 | no anomalies in 213 test result(s) 95 | 96 | rng=RNG_stdin, seed=unknown 97 | length= 512 megabytes (2^29 bytes), time= 4.4 seconds 98 | no anomalies in 229 test result(s) 99 | 100 | rng=RNG_stdin, seed=unknown 101 | length= 1 gigabyte (2^30 bytes), time= 8.0 seconds 102 | no anomalies in 248 test result(s) 103 | 104 | rng=RNG_stdin, seed=unknown 105 | length= 2 gigabytes (2^31 bytes), time= 15.0 seconds 106 | no anomalies in 266 test result(s) 107 | 108 | rng=RNG_stdin, seed=unknown 109 | length= 4 gigabytes (2^32 bytes), time= 28.3 seconds 110 | Test Name Raw Processed Evaluation 111 | BCFN(2+2,13-0,T) R= +8.2 p = 6.6e-4 unusual 112 | ...and 281 test result(s) without anomalies 113 | 114 | rng=RNG_stdin, seed=unknown 115 | length= 8 gigabytes (2^33 bytes), time= 57.6 seconds 116 | no anomalies in 299 test result(s) 117 | 118 | rng=RNG_stdin, seed=unknown 119 | length= 16 gigabytes (2^34 bytes), time= 117 seconds 120 | no anomalies in 315 test result(s) 121 | 122 | rng=RNG_stdin, seed=unknown 123 | length= 32 gigabytes (2^35 bytes), time= 220 seconds 124 | no anomalies in 328 test result(s) 125 | 126 | rng=RNG_stdin, seed=unknown 127 | length= 64 gigabytes (2^36 bytes), time= 461 seconds 128 | no anomalies in 344 test result(s) 129 | 130 | rng=RNG_stdin, seed=unknown 131 | length= 128 gigabytes (2^37 bytes), time= 900 seconds 132 | no anomalies in 359 test result(s) 133 | 134 | rng=RNG_stdin, seed=unknown 135 | length= 256 gigabytes (2^38 bytes), time= 1733 seconds 136 | no anomalies in 372 test result(s) 137 | 138 | rng=RNG_stdin, seed=unknown 139 | length= 512 gigabytes (2^39 bytes), time= 3646 seconds 140 | no anomalies in 387 test result(s) 141 | 142 | rng=RNG_stdin, seed=unknown 143 | length= 1 terabyte (2^40 bytes), time= 7479 seconds 144 | no anomalies in 401 test result(s) 145 | 146 | rng=RNG_stdin, seed=unknown 147 | length= 2 terabytes (2^41 bytes), time= 14248 seconds 148 | no anomalies in 413 test result(s) 149 | 150 | rng=RNG_stdin, seed=unknown 151 | length= 4 terabytes (2^42 bytes), time= 29950 seconds 152 | no anomalies in 426 test result(s) 153 | 154 | rng=RNG_stdin, seed=unknown 155 | length= 8 terabytes (2^43 bytes), time= 60801 seconds 156 | Test Name Raw Processed Evaluation 157 | BRank(12):64K(1) R= +1078 p~= 1.1e-325 FAIL !!!!!! 158 | ...and 437 test result(s) without anomalies 159 | 160 | Preliminary PractRand Results on AESRand_parallelStream Knuth LCGRNG 161 | ---------------- 162 | 163 | RNG_test using PractRand version 0.94 164 | RNG = RNG_stdin, seed = unknown 165 | test set = core, folding = standard(unknown format) 166 | 167 | rng=RNG_stdin, seed=unknown 168 | length= 256 megabytes (2^28 bytes), time= 2.1 seconds 169 | no anomalies in 213 test result(s) 170 | 171 | rng=RNG_stdin, seed=unknown 172 | length= 512 megabytes (2^29 bytes), time= 3.9 seconds 173 | no anomalies in 229 test result(s) 174 | 175 | rng=RNG_stdin, seed=unknown 176 | length= 1 gigabyte (2^30 bytes), time= 7.2 seconds 177 | no anomalies in 248 test result(s) 178 | 179 | rng=RNG_stdin, seed=unknown 180 | length= 2 gigabytes (2^31 bytes), time= 13.6 seconds 181 | no anomalies in 266 test result(s) 182 | 183 | rng=RNG_stdin, seed=unknown 184 | length= 4 gigabytes (2^32 bytes), time= 25.4 seconds 185 | no anomalies in 282 test result(s) 186 | 187 | rng=RNG_stdin, seed=unknown 188 | length= 8 gigabytes (2^33 bytes), time= 52.4 seconds 189 | no anomalies in 299 test result(s) 190 | 191 | rng=RNG_stdin, seed=unknown 192 | length= 16 gigabytes (2^34 bytes), time= 109 seconds 193 | no anomalies in 315 test result(s) 194 | 195 | rng=RNG_stdin, seed=unknown 196 | length= 32 gigabytes (2^35 bytes), time= 210 seconds 197 | no anomalies in 328 test result(s) 198 | 199 | rng=RNG_stdin, seed=unknown 200 | length= 64 gigabytes (2^36 bytes), time= 439 seconds 201 | no anomalies in 344 test result(s) 202 | 203 | rng=RNG_stdin, seed=unknown 204 | length= 128 gigabytes (2^37 bytes), time= 861 seconds 205 | no anomalies in 359 test result(s) 206 | 207 | rng=RNG_stdin, seed=unknown 208 | length= 256 gigabytes (2^38 bytes), time= 1637 seconds 209 | no anomalies in 372 test result(s) 210 | 211 | rng=RNG_stdin, seed=unknown 212 | length= 512 gigabytes (2^39 bytes), time= 3481 seconds 213 | no anomalies in 387 test result(s) 214 | 215 | rng=RNG_stdin, seed=unknown 216 | length= 1 terabyte (2^40 bytes), time= 7016 seconds 217 | no anomalies in 401 test result(s) 218 | 219 | rng=RNG_stdin, seed=unknown 220 | length= 2 terabytes (2^41 bytes), time= 13248 seconds 221 | no anomalies in 413 test result(s) 222 | 223 | rng=RNG_stdin, seed=unknown 224 | length= 4 terabytes (2^42 bytes), time= 27881 seconds 225 | no anomalies in 426 test result(s) 226 | 227 | rng=RNG_stdin, seed=unknown 228 | length= 8 terabytes (2^43 bytes), time= 56968 seconds 229 | Test Name Raw Processed Evaluation 230 | BRank(12):64K(1) R= +1078 p~= 1.1e-325 FAIL !!!!!! 231 | ...and 437 test result(s) without anomalies 232 | -------------------------------------------------------------------------------- /AESRand/FloatTest/FloatTest.vcxproj: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Debug 6 | Win32 7 | 8 | 9 | Release 10 | Win32 11 | 12 | 13 | Debug 14 | x64 15 | 16 | 17 | Release 18 | x64 19 | 20 | 21 | 22 | 15.0 23 | {F86DAFE3-4A80-4F98-B2BF-63D36B17BA35} 24 | Win32Proj 25 | FloatTest 26 | 10.0.17134.0 27 | 28 | 29 | 30 | Application 31 | true 32 | v141 33 | Unicode 34 | 35 | 36 | Application 37 | false 38 | v141 39 | true 40 | Unicode 41 | 42 | 43 | Application 44 | true 45 | v141 46 | Unicode 47 | 48 | 49 | Application 50 | false 51 | v141 52 | true 53 | Unicode 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | false 75 | 76 | 77 | true 78 | 79 | 80 | true 81 | 82 | 83 | false 84 | 85 | 86 | 87 | Use 88 | Level3 89 | MaxSpeed 90 | true 91 | true 92 | true 93 | NDEBUG;_CONSOLE;%(PreprocessorDefinitions) 94 | true 95 | pch.h 96 | 97 | 98 | Console 99 | true 100 | true 101 | true 102 | 103 | 104 | 105 | 106 | Use 107 | Level3 108 | Disabled 109 | true 110 | WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) 111 | true 112 | pch.h 113 | 114 | 115 | Console 116 | true 117 | 118 | 119 | 120 | 121 | Use 122 | Level3 123 | Disabled 124 | true 125 | _DEBUG;_CONSOLE;%(PreprocessorDefinitions) 126 | true 127 | pch.h 128 | 129 | 130 | Console 131 | true 132 | 133 | 134 | 135 | 136 | Use 137 | Level3 138 | MaxSpeed 139 | true 140 | true 141 | true 142 | WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) 143 | true 144 | pch.h 145 | 146 | 147 | Console 148 | true 149 | true 150 | true 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | Create 160 | Create 161 | Create 162 | Create 163 | 164 | 165 | 166 | 167 | 168 | -------------------------------------------------------------------------------- /AESRand/IntegerRangeTest/IntegerRangeTest.vcxproj: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Debug 6 | Win32 7 | 8 | 9 | Release 10 | Win32 11 | 12 | 13 | Debug 14 | x64 15 | 16 | 17 | Release 18 | x64 19 | 20 | 21 | 22 | 15.0 23 | {14562E75-9BAB-4663-BFE3-51C96298EC81} 24 | Win32Proj 25 | IntegerRangeTest 26 | 10.0.17134.0 27 | 28 | 29 | 30 | Application 31 | true 32 | v141 33 | Unicode 34 | 35 | 36 | Application 37 | false 38 | v141 39 | true 40 | Unicode 41 | 42 | 43 | Application 44 | true 45 | v141 46 | Unicode 47 | 48 | 49 | Application 50 | false 51 | v141 52 | true 53 | Unicode 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | false 75 | 76 | 77 | true 78 | 79 | 80 | true 81 | 82 | 83 | false 84 | 85 | 86 | 87 | Use 88 | Level3 89 | MaxSpeed 90 | true 91 | true 92 | true 93 | NDEBUG;_CONSOLE;%(PreprocessorDefinitions) 94 | true 95 | pch.h 96 | stdcpp17 97 | 98 | 99 | Console 100 | true 101 | true 102 | true 103 | 104 | 105 | 106 | 107 | Use 108 | Level3 109 | Disabled 110 | true 111 | WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) 112 | true 113 | pch.h 114 | 115 | 116 | Console 117 | true 118 | 119 | 120 | 121 | 122 | Use 123 | Level3 124 | Disabled 125 | true 126 | _DEBUG;_CONSOLE;%(PreprocessorDefinitions) 127 | true 128 | pch.h 129 | stdcpp17 130 | 131 | 132 | Console 133 | true 134 | 135 | 136 | 137 | 138 | Use 139 | Level3 140 | MaxSpeed 141 | true 142 | true 143 | true 144 | WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) 145 | true 146 | pch.h 147 | 148 | 149 | Console 150 | true 151 | true 152 | true 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | Create 162 | Create 163 | Create 164 | Create 165 | 166 | 167 | 168 | 169 | 170 | -------------------------------------------------------------------------------- /AESRand/AESRand/AESRand.vcxproj: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Debug 6 | Win32 7 | 8 | 9 | Release 10 | Win32 11 | 12 | 13 | Debug 14 | x64 15 | 16 | 17 | Release 18 | x64 19 | 20 | 21 | 22 | 15.0 23 | {F91B1300-34D7-459B-B40C-3479AF111436} 24 | Win32Proj 25 | AESRand 26 | 10.0.17134.0 27 | 28 | 29 | 30 | Application 31 | true 32 | v141 33 | Unicode 34 | 35 | 36 | Application 37 | false 38 | v141 39 | true 40 | Unicode 41 | 42 | 43 | Application 44 | true 45 | v141 46 | Unicode 47 | 48 | 49 | Application 50 | false 51 | v141 52 | true 53 | Unicode 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | true 75 | 76 | 77 | true 78 | 79 | 80 | false 81 | 82 | 83 | false 84 | 85 | 86 | 87 | Use 88 | Level3 89 | Disabled 90 | true 91 | WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions) 92 | true 93 | pch.h 94 | 95 | 96 | Console 97 | true 98 | 99 | 100 | 101 | 102 | Use 103 | Level3 104 | Disabled 105 | true 106 | _DEBUG;_CONSOLE;%(PreprocessorDefinitions) 107 | true 108 | pch.h 109 | AdvancedVectorExtensions2 110 | All 111 | 112 | 113 | Console 114 | true 115 | 116 | 117 | 118 | 119 | Use 120 | Level3 121 | MaxSpeed 122 | true 123 | true 124 | true 125 | WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions) 126 | true 127 | pch.h 128 | 129 | 130 | Console 131 | true 132 | true 133 | true 134 | 135 | 136 | 137 | 138 | Use 139 | Level3 140 | MaxSpeed 141 | true 142 | true 143 | true 144 | NDEBUG;_CONSOLE;%(PreprocessorDefinitions) 145 | true 146 | pch.h 147 | AdvancedVectorExtensions2 148 | All 149 | 150 | 151 | Console 152 | true 153 | true 154 | true 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | Create 165 | Create 166 | Create 167 | Create 168 | 169 | 170 | 171 | 172 | 173 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AESRand 2 | A Prototype implementation of Pseudo-RNG based on hardware-accelerated AES instructions and 128-bit SIMD 3 | 4 | TL;DR 5 | -------- 6 | * State: 128 bits (One XMM register) 7 | * 256-bits / 32-bytes generated per iteration 8 | * Incredible speed: roughly 3.7 CPU cycles per iteration 9 | * Cycle Length: 2^64 10 | * Independent Streams: 2^64 11 | * Core RNG passes 8TB+ tests. Parallel-stream generator fails at 8TB of PractRand (passes 4TB) 12 | * Tested: ~29.2 GBps (Gigabytes per second) single-thread / single-core. 13 | * Dual-stream version achieves 37.1 GBps 14 | * A throughput of ~8.5 Bytes per cycle. Or roughly 3.73 cycles per 256-bit iteration. 15 | * Faster than xoshiro256plus, pcg32, and std::mt19937 16 | 17 | The shortest sample code is in the [Simplified Linux Version](AESRand_Linux/AESRand.cpp). 18 | Commentary is provided in this README, as well as through many comments in the Windows version. 19 | The Windows version contains a self-included benchmark to compare against the speed of 20 | xoshiro256, pcg32, and std::mt19937. 21 | 22 | Design Principles 23 | ------- 24 | 25 | 1. AESRound-based. Every x86 CPUs since Intel Westmere (2010) and AMD Bulldozer (2011) can execute 26 | not only a singular "aesenc" instruction (AES Encode Round)... but they can also 27 | execute them incredibly quickly: at least one per cycle. AMD Ryzen / EPYC CPUs 28 | can even execute them TWICE per cycle if they are independent. With a latency 29 | of roughly 4-cycles on modern Intel Skylake CPUs, the 128-bit AES-encode 30 | instruction is faster than a 64-bit multiply. 31 | 32 | Note: AESRound is implemented on all major CPUs of the modern (Nov 2018) era. 33 | Power9 has the vcipher instruction, which seems to be identical to the x86 aesenc 34 | instruction. ARM unfortunately plays a bit differently, but a sequence of AESE, 35 | AESMC, and XOR would replecate the x86 "aesenc" instruction. 36 | 37 | AESD, AESIMC, and XOR would together be equivalent to an x86 "aesdec" instruction. 38 | It is time to take advantage of the universal AES-hardware instructions 39 | embedded in all of our CPUs, even our Cell Phones can do this in 2018. 40 | 41 | 2. SIMD-acceleration -- Modern computers are 128-bit, 256-bit, or even 512-bit machines. 42 | Because AES is only defined for 128-bits, I stick with 128-bit. Power9 and ARM machines 43 | also support 128-bit SIMD easily. Future CPUs will probably be more SIMD-heavy. If anyone 44 | can think of how to extend this concept out to 256-bit (YMM) registers and beyond, they 45 | probably can beat the results I have here! 46 | 47 | 3. PCG-random.org "Simple counter" + "Mixer" design -- PCG-random.org has a two-step 48 | RNG process. The "counter" (which was a multiply-based LCGRNG in the pcg32_random_r code), and 49 | a "mixer" (which was a simple shift add xor hash-function). AESRand_increment serves as 50 | the "counter", while AESRand_rand serves as the "mixer". 51 | 52 | 4. Minimum latency on the "Counter" -- The latency of the counter-portion of this RNG 53 | (AESRand_increment) is the absolute limit to the speed of any RNG. If it takes 5-cycles to 54 | update the state, your RNG will take 5-cycles (or more) per iteration. I've minimized 55 | the latency of AESRand_increment to 1-cycle, the absolute minimum latency. 56 | 57 | 5. Instruction level parallelism (ILP) -- All instructions of the "mixer" portion of the RNG 58 | (AESRand_rand) have a throughput of 1-per-cycle or more. AMD Zen can execute two AES 59 | instructions per clock (and thus has a throughput of 2-per-cycle!!). Notice the 60 | signature of AESRand_rand(const \__m128i state). The state MUST be a constant to take 61 | advantage of ILP. Aside from the counter-latency, each iteration i can execute in parallel 62 | with future iterations i+1, i+2, i+3, etc. etc. Modern CPUs are incredibly good at capturing 63 | this parallelism and internally pipelining the AES-instructions of the mixer. ILP allows you 64 | to beat the latency-characteristics of your instructions. For example, every iteration 65 | has a latency of 4 cycles per AESENC, or 8-cycles of latency total. However, I've tested 66 | 3.7 cycles per iteration. The magic of ILP makes this possible. 67 | 68 | 6. Full invertibility -- http://www.burtleburtle.net/bob/hash/doobs.html The JOAAT hash has a concept 69 | of a "bit funnel", which is a BAD thing for hashes. If you provably have full-invertibility, it means you 70 | never lose information. Its kind of a hard concept to describe, but it is fundamental to the design 71 | of RNGs, Cryptography, and so forth. The entirety of GF(2) fields are all based around 72 | the concept of invertible operations. The XOR, Add, and AES-encode instructions all have inverts 73 | (XOR, Subtract, and AES-decode respectively), and therefore have the greatest chance of passing 74 | statistical tests... as long as the bits are "shuffled" enough. 75 | 76 | Benchmark Results 77 | -------- 78 | 79 | [Click here](BenchmarkResults.md) for the latest benchmark results. 80 | 81 | This is a very simple timer-based benchmark, where I simply run the various RNGs to be tested 82 | (AESRand, mt19937, pcg32, and xoshiro256plus) in a tight loop of 5-billion iterations. To ensure that 83 | the optimizer does NOT remove the RNG code, I have a "total" value that adds up every output 84 | of the RNG, and eventually prints it out to the screen. 85 | 86 | Before and after the 5-billion long loop, I run Window's 'QueryPerformanceCounter" to log the time. 87 | 88 | I checked the generated assembly (After building in VS2017, check the "AESRand.cod" file). 89 | The "mt19937" code was NOT inlined. Which may be a disadvantage, and why its so much slower than 90 | the other RNGs. 91 | 92 | PCG32 and xoshiro256plus were both inlined well. I wasn't sure how well they'd adapt to ILP, so I 93 | created a 4x manually unrolled version for the both of them. The unrolled versions don't seem to be 94 | faster or slower. I admit that I haven't used those RNGs before, so I'm not entirely sure if I've 95 | set up their ideal conditions. 96 | 97 | 98 | Weaknesses and Future Work 99 | ---------------- 100 | AESRand is surprisngly BAD at 1-bit changes. If I changed the increment to a single-bit change 101 | like [0x1, 0, 0, 0, ...], it would take 4, maybe 5 aesenc instructions before the code could get 102 | above 8GB of tests in PractRand. 103 | 104 | I experimented with various other reversible functions documented on Lemire's blog 105 | https://lemire.me/blog/2016/08/09/how-many-reversible-integer-operations-do-you-know/. XOR, Adds, 106 | bitshifts, multiplies-with-odd numbers, and more are all interesting, but the AES-instructions 107 | seemed to mix bits better than any of the primitive instructions. 108 | 109 | The one instruction that holds a lot of promise is PCLMULQDQ (Carry-less Multiply). This is a 110 | 64-bit x 64-bit polynomial multiply on 128-bit XMM registers. Roughly 3 or 4 PCLMULQDQ, along 111 | with some bitshifts and XORs, could implement the 128-bit carryless multiply used in GCM 112 | (galois counter mode). And this seems to be a very good way to "disperse bits" and create 113 | an avalanche-effect. 114 | 115 | Furthermore, 64-bit carryless multiply is implemented on x86 (PCLMULQDQ), ARMv8 (PMULL and PMULL2 116 | on ARM64, VMULL on ARM32), and Power9 (vpmsumh: Vector Polynomial Multiply-Sum). These instructions serve 117 | as the basis for GCM-mode, Eliptical Curve Cryptography, and other important developments in the modern 118 | cipher world. I expect all future CPUs to have carryless-multiply implemented due to their importance 119 | to the cryptography community. 120 | 121 | However, my Threadripper 1950x appears to run the PCLMULQDQ instruction as microcode, and thus it only has 122 | a throughput of one-PCLMULQDQ every TWO instructions (4x less throughput than AESenc). In effect, running 123 | aesenc 4x in a row has more throughput, on my machine at least. Intel machines are documented to run 124 | PCLMULQDQ per cycle, and thus PCLMULQDQ may be a faster base to use on Intel machines. Further investigation 125 | into the relative speeds of these cryptography instructions, across the different modern CPUs could be important. 126 | 127 | aesenc has 4 steps: SubBytes, ShiftRows, MixColumns, and XOR Round Key. SubBytes is absolutely excellent for 128 | RNG work. ShiftRows is useful, but only with multiple AES-instructions in a row. MixColumns is unfortunately 129 | only a 32-bit operation, albeit parallel across 4-different 32-bit values. Still, a single aesenc or aesdec 130 | disperses bits across 32-bits of the state. After two rounds of AES, any particular input bit only 131 | affects half of the bits: 64-bits per 128-bit XMM register (or a total of ~128-bits of the 256-bit output) 132 | 133 | So 2-rounds of AES is NOT sufficient to have a proper avalanche (defined as a 50% chance to flip any bit of 134 | the output). I get around the severe 1-bit weakness by ensuring that all 128-bits of state changes on every 135 | iteration. 136 | 137 | The "Parallel Stream" generator only changes the top 64-bits of the input. This is the "weak direction" of the 138 | random number generator, which fails after 8TB of testing in PractRand. Nonetheless, the ability to support 139 | parallel streams is important in today's world of highly-parallelized simulations. Passing 4TB of PractRand 140 | means that 34-Billion parallel streams were created, and PractRand was unable to detect 141 | any statistical correlation between their start points. So at least 2^35 high-quality parallel streams are 142 | available to use. 143 | 144 | Thanks and Notes 145 | ------------ 146 | The core algorithm is based on pcg32, documented here: http://www.pcg-random.org/. The idea to 147 | split "counter" with "mixer" is an incredibly effective design on modern machines with large amounts of 148 | instruction-level parallelism. 149 | 150 | The theory of hashing by Bob Jenkins is what most made me "get" cipher design. Bob Jenkin's 151 | page is absolutely excellent, and his "theory of funnels" put me on the right track. 152 | http://www.burtleburtle.net/bob/hash/doobs.html 153 | 154 | Daniel Lemire's blog is filled to the brim with SIMD tips and tricks. His article here also documents 155 | MANY reversible functions. While none of these reversible operations ended up in this implementation, 156 | the page served as a valuable reference in my experiments. 157 | https://lemire.me/blog/2016/08/09/how-many-reversible-integer-operations-do-you-know/ 158 | 159 | Donald Knuth's "The Art of Computer Programming", volume 2, serves as a great introduction to the 160 | overall theory of RNGs. 161 | 162 | PractRand: http://pracrand.sourceforge.net/ for making an incredibly awesome RNG-testing utility that 163 | actually works on Windows (and works easily!). 164 | 165 | Agner Fog's instruction tables: I was constantly referencing Agner Fog's latency and throughput tables 166 | throughout the coding of this RNG: https://www.agner.org/optimize/ 167 | 168 | PractRand Results 169 | ------------ 170 | 171 | [Click here](PractRand.md) for PractRand results. 172 | 173 | 174 | BigCrush Results 175 | ------------ 176 | 177 | AESRand_Linux contains two BigCrush tests, which require TestU01 in order to be run. The "primary" AESRand generator passes BigCrush through multiple means: reversed bits, forward bits and so forth. TestU01 is limited to 32-bit tests, so it is a bit odd to try to adapt a 256-bit generator like AESRand to TestU01's interface. 178 | --------------------------------------------------------------------------------