├── .travis.yml ├── README.md ├── bitutil.c ├── bitutil.h ├── conf.h ├── makefile ├── makefile.vs ├── sse_neon.h ├── time_.h ├── tpbench.c ├── transpose.c ├── transpose.h └── vs ├── getopt.c ├── getopt.h ├── inttypes.h └── stdint.h /.travis.yml: -------------------------------------------------------------------------------- 1 | language: c 2 | 3 | compiler: 4 | - gcc 5 | - clang 6 | 7 | branches: 8 | only: 9 | - master 10 | 11 | script: 12 | - make 13 | 14 | matrix: 15 | include: 16 | - name: Linux arm 17 | os: linux 18 | arch: arm64 19 | compiler: gcc 20 | 21 | - name: Windows-MinGW 22 | os: windows 23 | script: 24 | - mingw32-make 25 | 26 | - name: macOS, xcode 27 | os: osx 28 | 29 | # - name: Linux amd64 30 | # os: linux 31 | # arch: amd64 32 | # - name: Power ppc64le 33 | # os: linux-ppc64le 34 | # compiler: gcc 35 | 36 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Integer + Floating Point Compression Filter[![Build Status](https://travis-ci.org/powturbo/TurboTranspose.svg?branch=master)](https://travis-ci.org/powturbo/TurboTranspose) 2 | ====================================== 3 | * **Fastest transpose/shuffle** 4 | * :new: (2019.11) **ALL** TurboTranspose functions now available under **64 bits ARMv8** including **NEON** SIMD. 5 | * **Byte/Nibble** transpose/shuffle for improving compression of binary data (ex. floating point data) 6 | * :sparkles: **Scalar/SIMD** Transpose/Shuffle 8,16,32,64,... bits 7 | * :+1: Dynamic CPU detection and **JIT scalar/sse/avx2** switching 8 | * 100% C (C++ headers), usage as simple as memcpy 9 | * **Byte Transpose** 10 | * **Fastest** byte transpose 11 | * :new: (2019.11) **2D,3D,4D** transpose 12 | * **Nibble Transpose** 13 | * nearly as fast as byte transpose 14 | * more efficient, up to **10 times!** faster than [Bitshuffle](#bitshuffle) 15 | * :new: better compression (w/ lz77) and
**10 times!** faster than one of the best floating-point compressors [SPDP](#spdp) 16 | * can compress/decompress (w/ lz77) better and faster than other domain specific floating point compressors 17 | * Scalar and SIMD **Transform** 18 | * **Delta** encoding for sorted lists 19 | * **Zigzag** encoding for unsorted lists 20 | * **Xor** encoding 21 | * :new: **lossy** floating point compression with user-defined error 22 | 23 | ### Transpose Benchmark: 24 | - Benchmark Intel CPU: Skylake i7-6700 3.4GHz gcc 9.2 **single** thread 25 | - Benchmark ARM: ARMv8 A73-ODROID-N2 1.8GHz 26 | 27 | #### - Speed test 28 | ##### Benchmark w/ 16k buffer 29 | 30 | **BOLD** = pareto frontier.
31 | E:Encode, D:Decode
32 | 33 | ./tpbench -s# file -B16K (# = 8,4,2) 34 | |E cycles/byte|D cycles/byte|Transpose 64 bits **AVX2**| 35 | |------:|------:|-----------------------------------| 36 | |.199|**.134**|**TurboTranspose Byte**| 37 | |.326|.201|Blosc byteshuffle| 38 | |**.394**|**.260**|**TurboTranspose Nibble**| 39 | |.848|.478|Bitshuffle 8| 40 | 41 | |E cycles/byte|D cycles/byte|Transpose 32 bits **AVX2**| 42 | |------:|------:|-----------------------------------| 43 | |**.121**|**.102**|**TurboTranspose Byte**| 44 | |.451|.139|Blosc byteshuffle| 45 | |**.345**|**.229**|**TurboTranspose Nibble**| 46 | |.773|.476|Bitshuffle| 47 | 48 | |E cycles/byte|D cycles/byte|Transpose 16 bits **AVX2**| 49 | |------:|------:|-----------------------------------| 50 | |**.095**|**.071**|**TurboTranspose Byte**| 51 | |.640|.108|Blosc byteshuffle| 52 | |**.329**|**.198**|**TurboTranspose Nibble**| 53 | |.758|1.177|Bitshuffle 2| 54 | |**.067**|**.067**|memcpy| 55 | ---------------------------------------------------------------- 56 | |E MB/s| D MB/s| 16 bits **ARM** 2019.11| 57 | |--------:|---------:|-----------------------------------| 58 | |**8192**|**16384**|**TurboTranspose Byte**| 59 | | 8192| 8192| blosc byteshuffle | 60 | | **1638**| **2341**|**TurboTranspose Nibble**| 61 | | 356| 287| blosc bitshuffle| 62 | | 16384| 16384| memcpy | 63 | 64 | | E MB/s | D MB/s| 32 bits **ARM** 2019.11| 65 | |--------:|---------:|-----------------------------------| 66 | |**8192**|**8192**|**TurboTranspose Byte**| 67 | | 8192| 8192| blosc byteshuffle| 68 | |**1820**|**2341**|**TurboTranspose Nibble**| 69 | | 372| 252| blosc bitshuffle| 70 | 71 | | E MB/s | D MB/s| 64 bits **ARM** 2019.11| 72 | |--------:|---------:|-----------------------------------| 73 | | 4096| **8192**|**TurboTranspose Byte**| 74 | |**5461**| 5461|**blosc byteshuffle**| 75 | |**1490**|**1490**|**TurboTranspose Nibble**| 76 | | 372| 260| blosc bitshuffle| 77 | 78 | #### Transpose/Shuffle benchmark w/ **large** files (100MB). 79 | 80 | MB/s: 1,000,000 bytes/second
81 | 82 | ./tpbench -s# file (# = 8,4,2) 83 | E MB/s|D MB/s|Transpose 16 bits **AVX2** 2019.11| 84 | |------:|------:|-----------------------------------| 85 | |**9208**|**9795**|**TurboTranspose Byte**| 86 | |8382|7689|Blosc byteshuffle| 87 | |**9377**|**9584**|**TurboTranspose Nibble**| 88 | |2750|2530|Blosc bitshuffle| 89 | |13725|13900|memcpy| 90 | 91 | |E MB/s|D MB/s|Transpose 32 bits **AVX2** 2019.11| 92 | |------:|------:|-----------------------------------| 93 | |**9718**|**9713**|**TurboTranspose Byte**| 94 | |9181|9030|Blosc byteshuffle| 95 | |**8750**|**9472**|**TurboTranspose Nibble**| 96 | |2767|2942|Blosc bitshuffle 4| 97 | 98 | |E MB/s|D MB/s|Transpose 64 bits **AVX2** 2019.11| 99 | |------:|------:|-----------------------------------| 100 | |**8998**|**9573**|**TurboTranspose Byte**| 101 | |8721|8586|Blosc byteshuffle 2| 102 | |**8252**|**9222**|**TurboTranspose Nibble**| 103 | |2711|2053|Blosc bitshuffle 2| 104 | 105 | ---------------------------------------------------------- 106 | | E MB/s | D MB/s| 16 bits ARM 2019.11| 107 | |--------:|---------:|-----------------------------------| 108 | |**872**|**3998**|**TurboTranspose Byte**| 109 | | 678| 3852| blosc byteshuffle| 110 | |**1365**|**2195**|**TurboTranspose Nibble**| 111 | | 357| 280| blosc bitshuffle| 112 | | 3921| 3913| memcpy| 113 | 114 | | E MB/s | D MB/s| 32 bits ARM 2019.11| 115 | |--------:|---------:|-----------------------------------| 116 | |**1828**|**3768**|**TurboTranspose Byte**| 117 | |1769|3713|blosc byteshuffle| 118 | |**1456**|**2299**|**TurboTranspose Nibble**| 119 | | 374 | 243| blosc bitshuffle| 120 | 121 | | E MB/s | D MB/s| 64 bits ARM 2019.11| 122 | |--------:|---------:|-----------------------------------| 123 | |**1793**|**3572**|**TurboTranspose Byte** 124 | |1784| 3544|**blosc byteshuffle** 125 | |**1176**|**1267**|**TurboTranspose Nibble** 126 | | 331 | 203| blosc bitshuffle 127 | 128 | #### - Compression test (transpose/shuffle+lz4) 129 | :new: Download [IcApp](https://sites.google.com/site/powturbo/downloads) a new benchmark for [TurboPFor](https://github.com/powturbo/TurboPFor)+TurboTranspose
130 | for testing allmost all integer and floating point file types.
131 | Note: Lossy compression benchmark with icapp only. 132 | 133 | - [Scientific IEEE 754 32-Bit Single-Precision Floating-Point Datasets](http://cs.txstate.edu/~burtscher/research/datasets/FPsingle/) 134 | 135 | ###### - Speed test (file msg_sweep3d) 136 | 137 | C size |ratio %|C MB/s |D MB/s|Name AVX2| 138 | ---------:|------:|------:|-----:|:--------------| 139 | 11,348,554 |18.1|**2276**|**4425**|**TurboTranspose Nibble+lz**| 140 | 22,489,691 |35.8| 1670|3881|TurboTranspose Byte+lz | 141 | 43,471,376 |69.2| 348| 402|SPDP | 142 | 44,626,407 |71.0| 1065|2101|bitshuffle+lz| 143 | 62,865,612 |100.0|13300|13300|memcpy| 144 | 145 | ./tpbench -s4 -z *.sp 146 | 147 | |File |File size|lz %|Tp8lz|Tp4lz|[BS](#bitshuffle)lz|[spdp1](#spdp)||[spdp9](#spdp)|Tp4lzt|eTp4lzt| 148 | |:---------|--------:|----:|------:|--------:|-------:|-----:|-|-------:|-------:|----:| 149 | msg_bt |133194716| 94.3|70.4|**66.4**|73.9 | 70.0|` `|67.4|**54.7**|*32.4*| 150 | msg_lu | 97059484|100.4|77.1 |**70.4**|75.4 | 76.8|` `|74.0|**61.0**|*42.2*| 151 | msg_sppm |139497932| 11.7|**11.6**|12.6 |15.4 | 14.4|` `|13.7|**9.0**|*5.6*| 152 | msg_sp |145052928|100.3|68.8 |**63.7**|68.1 | 67.9|` `|65.3|**52.6**|*24.9*| 153 | msg_sweep3d| 62865612| 98.7|35.8 |**18.1**|71.0 | 69.6|` `|13.7|**9.8**|*3.8*| 154 | num_brain | 70920000|100.4|76.5 |**71.1**|77.4 | 79.1|` `|73.9|**63.4**|*32.6*| 155 | num_comet | 53673984| 92.4|79.0 |**77.6**|82.1 | 84.5|` `|84.6|**70.1**|*41.7*| 156 | num_control| 79752372| 99.4|89.5 |90.7 |**88.1** | 98.3|` `|98.5|**81.4**|*51.2*| 157 | num_plasma | 17544800|100.4| 0.7 |**0.7** |75.5 | 30.7|` `|2.9|**0.3**|*0.2*| 158 | obs_error | 31080408| 89.2|73.1 |**70.0**|76.9 | 78.3|` `|49.4|**20.5**|*12.2*| 159 | obs_info | 9465264| 93.6|70.2 |**61.9**|72.9 | 62.4|` `|43.8|**27.3**|*15.1*| 160 | obs_spitzer| 99090432| 98.3|**90.4** |95.6 |93.6 |100.1|` `|100.7|**80.2**|*52.3*| 161 | obs_temp | 19967136|100.4|**89.5**|92.4 |91.0 | 99.4|` `|100.1|**84.0**|*55.8*| 162 | 163 | Tp8=Byte transpose, Tp4=Nibble transpose, lz = lz4
164 | eTp4Lzt = lossy compression with lzturbo and allowed error = 0.0001 (1e-4)
165 | *Slow but best compression:* SPDP9 and [lzt = lzturbo,39](https://github.com/powturbo/TurboBench) 166 | 167 | - [Scientific IEEE 754 64-Bit Double-Precision Floating-Point Datasets](http://cs.txstate.edu/~burtscher/research/datasets/FPdouble/) 168 | 169 | ./tpbench -s8 -z *.trace 170 | 171 | |File |File size |lz %|Tp8lz|Tp4lz|[BS](#bitshuffle)lz|[spdp1](#spdp)||[spdp9](#spdp)|Tp4lzt|eTp4lzt| 172 | |:---------|----------:|----:|------:|--------:|-------:|-----:|-|-------:|-------:|----:| 173 | msg_bt |266389432|94.5|77.2|**76.5**|81.6| 77.9|` `|75.4|**69.9**|*16.0*| 174 | msg_lu |194118968|100.4|82.7|**81.0**|83.7|83.3|` `|79.6|**75.5**|*21.0*| 175 | msg_sppm |278995864|18.9|**14.5**|14.9|19.5| 21.5|` `|19.8|**11.2**|*2.8*| 176 | msg_sp |290105856|100.4|79.2|**77.5**|80.2|78.8|` `|77.1|**71.3**|*12.4*| 177 | msg_sweep3d|125731224|98.7|50.7|**36.7**|80.4| 76.2|` `|33.2|**27.3**|*1.9*| 178 | num_brain |141840000|100.4|82.6|**81.1**|84.5|87.8|` `|83.3|**77.0**|*16.3*| 179 | num_comet |107347968|92.8|83.3|78.8|**76.3**| 86.5|` `|86.0|**69.8**|*21.2*| 180 | num_control|159504744|99.6|92.2|90.9|**89.4**| 97.6|` `|98.9|**85.5**|*25.8*| 181 | num_plasma | 35089600|75.2|0.7|**0.7**|84.5| 77.3|` `|3.0|**0.3**|*0.1*| 182 | obs_error | 62160816|78.7|81.0|**77.5**|84.4| 87.9|` `|62.3|**23.4**|*6.3*| 183 | obs_info | 18930528|92.3|75.4|**70.6**|82.4| 81.7|` `|51.2|**33.1**|*7.7*| 184 | obs_spitzer|198180864|95.4|93.2|93.7|**86.4**|100.1|` `|102.4|**78.0**|*26.9*| 185 | obs_temp | 39934272|100.4|93.1|93.8|**91.7**|98.0|` `|97.4|**88.2**|*28.8*| 186 | 187 | eTp4Lzt = lossy compression with allowed error = 0.0001
188 | 189 | ### Compile: 190 | 191 | git clone git://github.com/powturbo/TurboTranspose.git 192 | cd TurboTranspose 193 | 194 | ##### Linux + Windows MingW 195 | 196 | make 197 | or 198 | make AVX2=1 199 | 200 | ##### Windows Visual C++ 201 | 202 | nmake /f makefile.vs 203 | or 204 | nmake AVX2=1 /f makefile.vs 205 | 206 | 207 | + benchmark with other libraries
208 | download or clone [bitshuffle](https://github.com/kiyo-masui/bitshuffle) or [blosc](https://github.com/Blosc/c-blosc) and type 209 | 210 | make AVX2=1 BLOSC=1 211 | or 212 | make AVX2=1 BITSHUFFLE=1 213 | 214 | ### Testing: 215 | + benchmark "transpose" functions
216 | 217 | ./tpbench [-s#] [-z] file 218 | s# = element size #=2,4,8,16,... (default 4) 219 | -z = only lz77 compression benchmark (bitshuffle package mandatory) 220 | 221 | 222 | ### Function usage: 223 | 224 | **Byte transpose:** 225 | >**void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
226 | void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)**
227 | in : input buffer
228 | n : number of bytes
229 | out : output buffer
230 | esize : element size in bytes (2,4,8,...)
231 | 232 | 233 | **Nibble transpose:** 234 | >**void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
235 | void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)**
236 | in : input buffer
237 | n : number of bytes
238 | out : output buffer
239 | esize : element size in bytes (2,4,8,...)
240 | 241 | ### Environment: 242 | 243 | ###### OS/Compiler (64 bits): 244 | - Linux: GNU GCC (>=4.6) 245 | - Linux: Clang (>=3.2) 246 | - Windows: MinGW-w64 makefile 247 | - Windows: Visual c++ (>=VS2008) - makefile.vs (for nmake) 248 | - Windows: Visual Studio project file - vs/vs2017 - Thanks to [PavelP](https://github.com/pps83) 249 | - Linux ARM: 64 bits aarch64 ARMv8: gcc (>=6.3) 250 | - Linux ARM: 64 bits aarch64 ARMv8: clang 251 | 252 | ###### Multithreading: 253 | - All TurboTranspose functions are thread safe 254 | 255 | ### References: 256 | - [BS - Bitshuffle: Filter for improving compression of typed binary data.](https://github.com/kiyo-masui/bitshuffle)
257 | :green_book:[ A compression scheme for radio data in high performance computing](https://arxiv.org/abs/1503.00638) 258 | - [Blosc: A blocking, shuffling and loss-less compression](https://github.com/Blosc/c-blosc) 259 | - [SPDP is a compression/decompression algorithm for binary IEEE 754 32/64 bits floating-point data](http://cs.txstate.edu/~burtscher/research/SPDPcompressor/)
260 | :green_book:[ SPDP - An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data](http://cs.txstate.edu/~mb92/papers/dcc18.pdf) + [DCC 2018](http://www.cs.brandeis.edu//~dcc/Programs/Program2018.pdf) 261 | - :green_book:[ FPC: A High-Speed Compressor for Double-Precision Floating-Point Data](http://www.cs.txstate.edu/~burtscher/papers/tc09.pdf) 262 | 263 | Last update: 25 Oct 2019 264 | -------------------------------------------------------------------------------- /bitutil.c: -------------------------------------------------------------------------------- 1 | /** 2 | Copyright (C) powturbo 2013-2019 3 | GPL v2 License 4 | 5 | This program is free software; you can redistribute it and/or modify 6 | it under the terms of the GNU General Public License as published by 7 | the Free Software Foundation; either version 2 of the License, or 8 | (at your option) any later version. 9 | 10 | This program is distributed in the hope that it will be useful, 11 | but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | GNU General Public License for more details. 14 | 15 | You should have received a copy of the GNU General Public License along 16 | with this program; if not, write to the Free Software Foundation, Inc., 17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 18 | 19 | - homepage : https://sites.google.com/site/powturbo/ 20 | - github : https://github.com/powturbo 21 | - twitter : https://twitter.com/powturbo 22 | - email : powturbo [_AT_] gmail [_DOT_] com 23 | **/ 24 | // "Integer Compression" utility - delta, for, zigzag / Floating point compression 25 | #include "conf.h" 26 | #define BITUTIL_IN 27 | #include "bitutil.h" 28 | 29 | //------------ 'or' for bitsize + 'xor' for all duplicate ------------------ 30 | #define BT(_i_) { o |= ip[_i_]; x |= ip[_i_] ^ u0; } 31 | #define BIT(_in_, _n_, _usize_) {\ 32 | u0 = _in_[0]; o = x = 0;\ 33 | for(ip = _in_; ip != _in_+(_n_&~(4-1)); ip += 4) { BT(0); BT(1); BT(2); BT(3); }\ 34 | for(;ip != _in_+_n_; ip++) BT(0);\ 35 | } 36 | 37 | uint8_t bit8( uint8_t *in, unsigned n, uint8_t *px) { uint8_t o,x,u0,*ip; BIT(in, n, 8); if(px) *px = x; return o; } 38 | uint64_t bit64(uint64_t *in, unsigned n, uint64_t *px) { uint64_t o,x,u0,*ip; BIT(in, n, 64); if(px) *px = x; return o; } 39 | 40 | uint16_t bit16(uint16_t *in, unsigned n, uint16_t *px) { 41 | uint16_t o, x, u0 = in[0], *ip; 42 | #if defined(__SSE2__) || defined(__ARM_NEON) 43 | __m128i vb0 = _mm_set1_epi16(u0), vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(), 44 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); 45 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0); 46 | __m128i v0 = _mm_loadu_si128((__m128i *) ip); 47 | __m128i v1 = _mm_loadu_si128((__m128i *)(ip+8)); 48 | vo0 = _mm_or_si128( vo0, v0); 49 | vo1 = _mm_or_si128( vo1, v1); 50 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0)); 51 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0)); 52 | } 53 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi16(vo0); 54 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi16(vx0); 55 | #else 56 | ip = in; o = x = 0; //BIT( in, n, 16); 57 | #endif 58 | for(; ip != in+n; ip++) BT(0); 59 | if(px) *px = x; 60 | return o; 61 | } 62 | 63 | uint32_t bit32(uint32_t *in, unsigned n, uint32_t *px) { 64 | uint32_t o,x,u0 = in[0], *ip; 65 | #if defined(__AVX2__) && defined(USE_AVX2) 66 | __m256i vb0 = _mm256_set1_epi32(*in), vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(), 67 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256(); 68 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0); 69 | __m256i v0 = _mm256_loadu_si256((__m256i *) ip); 70 | __m256i v1 = _mm256_loadu_si256((__m256i *)(ip+8)); 71 | vo0 = _mm256_or_si256(vo0, v0); 72 | vo1 = _mm256_or_si256(vo1, v1); 73 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0)); 74 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0)); 75 | } 76 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0); 77 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0); 78 | #elif defined(__SSE2__) || defined(__ARM_NEON) 79 | __m128i vb0 = _mm_set1_epi32(u0), vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(), 80 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); 81 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0); 82 | __m128i v0 = _mm_loadu_si128((__m128i *) ip); 83 | __m128i v1 = _mm_loadu_si128((__m128i *)(ip+4)); 84 | vo0 = _mm_or_si128(vo0, v0); 85 | vo1 = _mm_or_si128(vo1, v1); 86 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0)); 87 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0)); 88 | } 89 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0); 90 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0); 91 | #else 92 | ip = in; o = x = 0; //BIT( in, n, 32); 93 | #endif 94 | for(; ip != in+n; ip++) BT(0); 95 | if(px) *px = x; 96 | return o; 97 | } 98 | 99 | //----------------------------------------------------------- Delta ---------------------------------------------------------------- 100 | #define DE(_ip_,_i_) u = (_ip_[_i_]-start)-_md; start = _ip_[_i_]; 101 | #define BITDE(_t_, _in_, _n_, _md_, _act_) { _t_ _md = _md_, *_ip; o = x = 0;\ 102 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { DE(_ip,0);_act_; DE(_ip,1);_act_; DE(_ip,2);_act_; DE(_ip,3);_act_; }\ 103 | for(;_ip != _in_+_n_;_ip++) { DE(_ip,0); _act_; }\ 104 | } 105 | //---- (min. Delta = 0) 106 | //-- delta encoding 107 | uint8_t bitd8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t u, u0 = in[0]-start, o, x; BITDE(uint8_t, in, n, 0, o |= u; x |= u^u0); if(px) *px = x; return o; } 108 | uint64_t bitd64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t u, u0 = in[0]-start, o, x; BITDE(uint64_t, in, n, 0, o |= u; x |= u^u0); if(px) *px = x; return o; } 109 | 110 | uint16_t bitd16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { 111 | uint16_t o, x, *ip, u0 = in[0]-start; 112 | #if defined(__SSE2__) || defined(__ARM_NEON) 113 | __m128i vb0 = _mm_set1_epi16(u0), 114 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(), 115 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi16(start); 116 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0); 117 | __m128i vi0 = _mm_loadu_si128((__m128i *) ip); 118 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+8)); __m128i v0 = mm_delta_epi16(vi0,vs); vs = vi0; 119 | __m128i v1 = mm_delta_epi16(vi1,vs); vs = vi1; 120 | vo0 = _mm_or_si128(vo0, v0); 121 | vo1 = _mm_or_si128(vo1, v1); 122 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0)); 123 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0)); 124 | } start = _mm_cvtsi128_si16(_mm_srli_si128(vs,14)); 125 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi16(vo0); 126 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi16(vx0); 127 | #else 128 | ip = in; o = x = 0; 129 | #endif 130 | for(;ip != in+n; ip++) { 131 | uint16_t u = *ip - start; start = *ip; 132 | o |= u; 133 | x |= u ^ u0; 134 | } 135 | if(px) *px = x; 136 | return o; 137 | } 138 | 139 | uint32_t bitd32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { 140 | uint32_t o, x, *ip, u0 = in[0] - start; 141 | #if defined(__AVX2__) && defined(USE_AVX2) 142 | __m256i vb0 = _mm256_set1_epi32(u0), 143 | vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(), 144 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256(); __m256i vs = _mm256_set1_epi32(start); 145 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0); 146 | __m256i vi0 = _mm256_loadu_si256((__m256i *) ip); 147 | __m256i vi1 = _mm256_loadu_si256((__m256i *)(ip+8)); __m256i v0 = mm256_delta_epi32(vi0,vs); vs = vi0; 148 | __m256i v1 = mm256_delta_epi32(vi1,vs); vs = vi1; 149 | vo0 = _mm256_or_si256(vo0, v0); 150 | vo1 = _mm256_or_si256(vo1, v1); 151 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0)); 152 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0)); 153 | } start = (unsigned)_mm256_extract_epi32(vs, 7); 154 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0); 155 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0); 156 | #elif defined(__SSE2__) || defined(__ARM_NEON) 157 | __m128i vb0 = _mm_set1_epi32(u0), 158 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(), 159 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi32(start); 160 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0); 161 | __m128i vi0 = _mm_loadu_si128((__m128i *)ip); 162 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+4)); __m128i v0 = mm_delta_epi32(vi0,vs); vs = vi0; 163 | __m128i v1 = mm_delta_epi32(vi1,vs); vs = vi1; 164 | vo0 = _mm_or_si128(vo0, v0); 165 | vo1 = _mm_or_si128(vo1, v1); 166 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0)); 167 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0)); 168 | } start = _mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 169 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0); 170 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0); 171 | #else 172 | ip = in; o = x = 0; 173 | #endif 174 | for(;ip != in+n; ip++) { 175 | uint32_t u = *ip - start; start = *ip; 176 | o |= u; 177 | x |= u ^ u0; 178 | } 179 | if(px) *px = x; 180 | return o; 181 | } 182 | 183 | //----- Undelta: In-place prefix sum (min. Delta = 0) ------------------- 184 | #define DD(i) _ip[i] = (start += _ip[i] + _md); 185 | #define BITDD(_t_, _in_, _n_, _md_) { _t_ *_ip; const _md = _md_;\ 186 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { DD(0); DD(1); DD(2); DD(3); }\ 187 | for(;_ip != _in_+_n_; _ip++) DD(0);\ 188 | } 189 | 190 | void bitddec8( uint8_t *p, unsigned n, uint8_t start) { BITDD(uint8_t, p, n, 0); } 191 | void bitddec16(uint16_t *p, unsigned n, uint16_t start) { BITDD(uint16_t, p, n, 0); } 192 | void bitddec64(uint64_t *p, unsigned n, uint64_t start) { BITDD(uint64_t, p, n, 0); } 193 | void bitddec32(uint32_t *p, unsigned n, unsigned start) { 194 | #if defined(__AVX2__) && defined(USE_AVX2) 195 | __m256i vs = _mm256_set1_epi32(start); 196 | unsigned *ip; 197 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) { 198 | __m256i v = _mm256_loadu_si256((__m256i *)ip); 199 | vs = mm256_scan_epi32(v,vs); 200 | _mm256_storeu_si256((__m256i *)ip, vs); 201 | } 202 | start = (unsigned)_mm256_extract_epi32(vs, 7); 203 | while(ip != p+n) { 204 | *ip = (start += (*ip)); 205 | ip++; 206 | } 207 | #elif defined(__SSE2__) || defined(__ARM_NEON) 208 | __m128i vs = _mm_set1_epi32(start); 209 | unsigned *ip; 210 | for(ip = p; ip != p+(n&~(4-1)); ip += 4) { 211 | __m128i v = _mm_loadu_si128((__m128i *)ip); 212 | vs = mm_scan_epi32(v, vs); 213 | _mm_storeu_si128((__m128i *)ip, vs); 214 | } 215 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 216 | while(ip != p+n) { 217 | *ip = (start += (*ip)); 218 | ip++; 219 | } 220 | #else 221 | BITDD(uint32_t, p, n, 0); 222 | #endif 223 | } 224 | 225 | //----------- Zigzag of Delta -------------------------- 226 | #define ZDE(i, _usize_) d = (_ip[i]-start)-_md; u = TEMPLATE2(zigzagenc, _usize_)(d - startd); startd = d; start = _ip[i] 227 | #define BITZDE(_t_, _in_, _n_, _md_, _usize_, _act_) { _t_ *_ip, _md = _md_;\ 228 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZDE(0, _usize_);_act_; ZDE(1, _usize_);_act_; ZDE(2, _usize_);_act_; ZDE(3, _usize_);_act_; }\ 229 | for(;_ip != _in_+_n_;_ip++) { ZDE(0, _usize_); _act_; }\ 230 | } 231 | 232 | uint8_t bitzz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t o=0, x=0,d,startd=0,u; BITZDE(uint8_t, in, n, 1, 8, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; } 233 | uint16_t bitzz16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { uint16_t o=0, x=0,d,startd=0,u; BITZDE(uint16_t, in, n, 1, 16, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; } 234 | uint32_t bitzz32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { uint64_t o=0, x=0,d,startd=0,u; BITZDE(uint32_t, in, n, 1, 32, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; } 235 | uint64_t bitzz64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t o=0, x=0,d,startd=0,u; BITZDE(uint64_t, in, n, 1, 64, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; } 236 | uint8_t bitzzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta) { uint8_t o=0,*op = out,u,d,startd=0; BITZDE(uint8_t, in, n, mindelta, 8,o |= u;*op++ = u); return o;} 237 | uint16_t bitzzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta) { uint16_t o=0,*op = out,u,d,startd=0; BITZDE(uint16_t, in, n, mindelta, 16,o |= u;*op++ = u); return o;} 238 | uint32_t bitzzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta) { uint32_t o=0,*op = out,u,d,startd=0; BITZDE(uint32_t, in, n, mindelta, 32,o |= u;*op++ = u); return o;} 239 | uint64_t bitzzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta) { uint64_t o=0,*op = out,u,d,startd=0; BITZDE(uint64_t, in, n, mindelta, 64,o |= u;*op++ = u); return o;} 240 | 241 | #define ZDD(i) u = _ip[i]; d = u - start; _ip[i] = zigzagdec64(u)+(int64_t)startd+_md; startd = d; start = u 242 | #define BITZDD(_t_, _in_, _n_, _md_) { _t_ *_ip, startd=0,d,u; const _md = _md_;\ 243 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZDD(0); ZDD(1); ZDD(2); ZDD(3); }\ 244 | for(;_ip != _in_+_n_; _ip++) ZDD(0);\ 245 | } 246 | void bitzzdec8( uint8_t *p, unsigned n, uint8_t start) { BITZDD(uint8_t, p, n, 1); } 247 | void bitzzdec16(uint16_t *p, unsigned n, uint16_t start) { BITZDD(uint16_t, p, n, 1); } 248 | void bitzzdec64(uint64_t *p, unsigned n, uint64_t start) { BITZDD(uint64_t, p, n, 1); } 249 | void bitzzdec32(uint32_t *p, unsigned n, uint32_t start) { BITZDD(uint32_t, p, n, 1); } 250 | 251 | //-----Undelta: In-place prefix sum (min. Delta = 1) ------------------- 252 | uint8_t bitd18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t o=0,x=0,u,*ip; BITDE(uint8_t, in, n, 1, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; } 253 | uint16_t bitd116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { uint16_t o=0,x=0,u,*ip; BITDE(uint16_t, in, n, 1, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; } 254 | uint64_t bitd164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t o=0,x=0,u,*ip; BITDE(uint64_t, in, n, 1, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; } 255 | 256 | uint32_t bitd132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { 257 | uint32_t o, x, *ip, u0 = in[0]-start-1; 258 | #if defined(__AVX2__) && defined(USE_AVX2) 259 | __m256i vb0 = _mm256_set1_epi32(u0), 260 | vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(), 261 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256(); __m256i vs = _mm256_set1_epi32(start), cv = _mm256_set1_epi32(1); 262 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0); 263 | __m256i vi0 = _mm256_loadu_si256((__m256i *)ip); 264 | __m256i vi1 = _mm256_loadu_si256((__m256i *)(ip+8)); __m256i v0 = _mm256_sub_epi32(mm256_delta_epi32(vi0,vs),cv); vs = vi0; 265 | __m256i v1 = _mm256_sub_epi32(mm256_delta_epi32(vi1,vs),cv); vs = vi1; 266 | vo0 = _mm256_or_si256(vo0, v0); 267 | vo1 = _mm256_or_si256(vo1, v1); 268 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0)); 269 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0)); 270 | } start = (unsigned)_mm256_extract_epi32(vs, 7); 271 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0); 272 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0); 273 | #elif defined(__SSE2__) || defined(__ARM_NEON) 274 | __m128i vb0 = _mm_set1_epi32(u0), 275 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(), 276 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi32(start), cv = _mm_set1_epi32(1); 277 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0); 278 | __m128i vi0 = _mm_loadu_si128((__m128i *)ip); 279 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+4)); __m128i v0 = _mm_sub_epi32(mm_delta_epi32(vi0,vs),cv); vs = vi0; 280 | __m128i v1 = _mm_sub_epi32(mm_delta_epi32(vi1,vs),cv); vs = vi1; 281 | vo0 = _mm_or_si128(vo0, v0); 282 | vo1 = _mm_or_si128(vo1, v1); 283 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0)); 284 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0)); 285 | } start = _mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 286 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0); 287 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0); 288 | #else 289 | ip = in; o = x = 0; 290 | #endif 291 | for(;ip != in+n; ip++) { 292 | uint32_t u = ip[0] - start-1; start = *ip; 293 | o |= u; 294 | x |= u ^ u0; 295 | } 296 | if(px) *px = x; 297 | return o; 298 | } 299 | 300 | uint16_t bits128v16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { 301 | #if defined(__SSE2__) || defined(__ARM_NEON) 302 | unsigned *ip,b; __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi16(start), cv = _mm_set1_epi16(8); 303 | for(ip = in; ip != in+(n&~(4-1)); ip += 4) { 304 | __m128i iv = _mm_loadu_si128((__m128i *)ip); 305 | bv = _mm_or_si128(bv,_mm_sub_epi16(SUBI16x8(iv,vs),cv)); 306 | vs = iv; 307 | } 308 | start = (unsigned short)_mm_cvtsi128_si32(_mm_srli_si128(vs,14)); 309 | b = mm_hor_epi16(bv); 310 | if(px) *px = 0; 311 | return b; 312 | #endif 313 | } 314 | 315 | unsigned bits128v32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { 316 | #if defined(__SSE2__) || defined(__ARM_NEON) 317 | unsigned *ip,b; __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi32(start), cv = _mm_set1_epi32(4); 318 | for(ip = in; ip != in+(n&~(4-1)); ip += 4) { 319 | __m128i iv = _mm_loadu_si128((__m128i *)ip); 320 | bv = _mm_or_si128(bv,_mm_sub_epi32(SUBI32x4(iv,vs),cv)); 321 | vs = iv; 322 | } 323 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 324 | b = mm_hor_epi32(bv); 325 | if(px) *px = 0; 326 | return b; 327 | #endif 328 | } 329 | 330 | void bitd1dec8( uint8_t *p, unsigned n, uint8_t start) { BITDD(uint8_t, p, n, 1); } 331 | void bitd1dec16(uint16_t *p, unsigned n, uint16_t start) { BITDD(uint16_t, p, n, 1); } 332 | void bitd1dec64(uint64_t *p, unsigned n, uint64_t start) { BITDD(uint64_t, p, n, 1); } 333 | void bitd1dec32(uint32_t *p, unsigned n, uint32_t start) { 334 | #if defined(__AVX2__) && defined(USE_AVX2) 335 | __m256i vs = _mm256_set1_epi32(start),zv = _mm256_setzero_si256(), cv = _mm256_set_epi32(8,7,6,5,4,3,2,1); 336 | unsigned *ip; 337 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) { 338 | __m256i v = _mm256_loadu_si256((__m256i *)ip); vs = mm256_scani_epi32(v, vs, cv); 339 | _mm256_storeu_si256((__m256i *)ip, vs); 340 | } 341 | start = (unsigned)_mm256_extract_epi32(vs, 7); 342 | while(ip != p+n) { 343 | *ip = (start += (*ip) + 1); 344 | ip++; 345 | } 346 | #elif defined(__SSE2__) || defined(__ARM_NEON) 347 | __m128i vs = _mm_set1_epi32(start), cv = _mm_set_epi32(4,3,2,1); 348 | unsigned *ip; 349 | for(ip = p; ip != p+(n&~(4-1)); ip += 4) { 350 | __m128i v = _mm_loadu_si128((__m128i *)ip); 351 | vs = mm_scani_epi32(v, vs, cv); 352 | _mm_storeu_si128((__m128i *)ip, vs); 353 | } 354 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 355 | while(ip != p+n) { 356 | *ip = (start += (*ip) + 1); 357 | ip++; 358 | } 359 | #else 360 | BITDD(uint32_t, p, n, 1); 361 | #endif 362 | } 363 | 364 | //---------Delta encoding/decoding (min. Delta = mindelta) ------------------- 365 | //determine min. delta for encoding w/ bitdiencNN function 366 | #define DI(_ip_,_i_) u = _ip_[_i_] - start; start = _ip_[_i_]; if(u < mindelta) mindelta = u 367 | #define BITDIE(_in_, _n_) {\ 368 | for(_ip = _in_,mindelta = _ip[0]; _ip != _in_+(_n_&~(4-1)); _ip+=4) { DI(_ip,0); DI(_ip,1); DI(_ip,2); DI(_ip,3); }\ 369 | for(;_ip != _in_+_n_;_ip++) DI(_ip,0);\ 370 | } 371 | 372 | uint8_t bitdi8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; } 373 | uint16_t bitdi16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { uint16_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; } 374 | uint32_t bitdi32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { uint32_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; } 375 | uint64_t bitdi64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; } 376 | 377 | uint8_t bitdienc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta) { uint8_t o=0,x=0,*op = out,u,*ip; BITDE(uint8_t, in, n, mindelta, o |= u; x |= u ^ in[0]; *op++ = u); return o; } 378 | uint16_t bitdienc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta) { uint16_t o=0,x=0,*op = out,u,*ip; BITDE(uint16_t, in, n, mindelta, o |= u; x |= u ^ in[0]; *op++ = u); return o; } 379 | uint64_t bitdienc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta) { uint64_t o=0,x=0,*op = out,u,*ip; BITDE(uint64_t, in, n, mindelta, o |= u; x |= u ^ in[0]; *op++ = u); return o; } 380 | uint32_t bitdienc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta) { 381 | #if defined(__SSE2__) || defined(__ARM_NEON) 382 | unsigned *ip,b,*op = out; 383 | __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi32(start), cv = _mm_set1_epi32(mindelta), dv; 384 | for(ip = in; ip != in+(n&~(4-1)); ip += 4,op += 4) { 385 | __m128i iv = _mm_loadu_si128((__m128i *)ip); 386 | bv = _mm_or_si128(bv, dv = _mm_sub_epi32(mm_delta_epi32(iv,vs),cv)); 387 | vs = iv; 388 | _mm_storeu_si128((__m128i *)op, dv); 389 | } 390 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 391 | b = mm_hor_epi32(bv); 392 | while(ip != in+n) { 393 | unsigned x = *ip-start-mindelta; 394 | start = *ip++; 395 | b |= x; 396 | *op++ = x; 397 | } 398 | #else 399 | uint32_t b = 0,*op = out, x, *_ip; 400 | BITDE(uint32_t, in, n, mindelta, b |= x; *op++ = x); 401 | #endif 402 | return b; 403 | } 404 | 405 | void bitdidec8( uint8_t *p, unsigned n, uint8_t start, uint8_t mindelta) { BITDD(uint8_t, p, n, mindelta); } 406 | void bitdidec16( uint16_t *p, unsigned n, uint16_t start, uint16_t mindelta) { BITDD(uint16_t, p, n, mindelta); } 407 | void bitdidec32( uint32_t *p, unsigned n, uint32_t start, uint32_t mindelta) { BITDD(uint32_t, p, n, mindelta); } 408 | void bitdidec64( uint64_t *p, unsigned n, uint64_t start, uint64_t mindelta) { BITDD(uint64_t, p, n, mindelta); } 409 | 410 | //------------------- For ------------------------------ 411 | uint8_t bitf8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { if(px) *px = 0; return n?in[n-1] - start :0; } 412 | uint8_t bitf18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; } 413 | uint16_t bitf16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { if(px) *px = 0; return n?in[n-1] - start :0; } 414 | uint16_t bitf116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; } 415 | uint32_t bitf32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { if(px) *px = 0; return n?in[n-1] - start :0; } 416 | uint32_t bitf132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; } 417 | uint64_t bitf64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { if(px) *px = 0; return n?in[n-1] - start :0; } 418 | uint64_t bitf164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; } 419 | 420 | //------------------- Zigzag --------------------------- 421 | #define ZE(i,_it_,_usize_) u = TEMPLATE2(zigzagenc, _usize_)((_it_)_ip[i]-(_it_)start); start = _ip[i] 422 | #define BITZENC(_ut_, _it_, _usize_, _in_,_n_, _act_) { _ut_ *_ip; o = 0; x = -1;\ 423 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZE(0,_it_,_usize_);_act_; ZE(1,_it_,_usize_);_act_; ZE(2,_it_,_usize_);_act_; ZE(3,_it_,_usize_);_act_; }\ 424 | for(;_ip != _in_+_n_; _ip++) { ZE(0,_it_,_usize_); _act_; }\ 425 | } 426 | 427 | // 'or' bits for zigzag encoding 428 | uint8_t bitz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t o, u,x; BITZENC(uint8_t, int8_t, 8, in, n, o |= x); if(px) *px = 0; return o; } 429 | uint64_t bitz64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t o, u,x; BITZENC(uint64_t, int64_t,64,in, n, o |= x); if(px) *px = 0; return o; } 430 | 431 | uint16_t bitz16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { 432 | uint16_t o, x, *ip; uint32_t u0 = zigzagenc16((int)in[0] - (int)start); 433 | 434 | #if defined(__SSE2__) || defined(__ARM_NEON) 435 | __m128i vb0 = _mm_set1_epi16(u0), vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(), 436 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi16(start); 437 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0); 438 | __m128i vi0 = _mm_loadu_si128((__m128i *) ip); 439 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+8)); __m128i v0 = mm_delta_epi16(vi0,vs); vs = vi0; v0 = mm_zzage_epi16(v0); 440 | __m128i v1 = mm_delta_epi16(vi1,vs); vs = vi1; v1 = mm_zzage_epi16(v1); 441 | vo0 = _mm_or_si128(vo0, v0); 442 | vo1 = _mm_or_si128(vo1, v1); 443 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0)); 444 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0)); 445 | } start = _mm_cvtsi128_si16(_mm_srli_si128(vs,14)); 446 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi16(vo0); 447 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi16(vx0); 448 | #else 449 | ip = in; //uint16_t u; o=x=0; BITDE(uint16_t, in, n, 0, o |= u; x |= u^u0); //BITZENC(uint16_t, int16_t, 16, in, n, o |= u,x &= u^u0); 450 | #endif 451 | for(;ip != in+n; ip++) { 452 | uint16_t u = zigzagenc16((int)ip[0] - (int)start); //int i = ((int)(*ip) - (int)start); i = (i << 1) ^ (i >> 15); 453 | start = *ip; 454 | o |= u; 455 | x |= u ^ u0; 456 | } 457 | if(px) *px = x; 458 | return o; 459 | } 460 | 461 | uint32_t bitz32(unsigned *in, unsigned n, uint32_t *px, unsigned start) { 462 | uint32_t o, x, *ip; uint32_t u0 = zigzagenc32((int)in[0] - (int)start); 463 | #if defined(__AVX2__) && defined(USE_AVX2) 464 | __m256i vb0 = _mm256_set1_epi32(u0), vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(), 465 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256(); __m256i vs = _mm256_set1_epi32(start); 466 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0); 467 | __m256i vi0 = _mm256_loadu_si256((__m256i *) ip); 468 | __m256i vi1 = _mm256_loadu_si256((__m256i *)(ip+8)); __m256i v0 = mm256_delta_epi32(vi0,vs); vs = vi0; v0 = mm256_zzage_epi32(v0); 469 | __m256i v1 = mm256_delta_epi32(vi1,vs); vs = vi1; v1 = mm256_zzage_epi32(v1); 470 | vo0 = _mm256_or_si256(vo0, v0); 471 | vo1 = _mm256_or_si256(vo1, v1); 472 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0)); 473 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0)); 474 | } start = (unsigned)_mm256_extract_epi32(vs, 7); 475 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0); 476 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0); 477 | 478 | #elif defined(__SSE2__) || defined(__ARM_NEON) 479 | __m128i vb0 = _mm_set1_epi32(u0), 480 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(), 481 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi32(start); 482 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0); 483 | __m128i vi0 = _mm_loadu_si128((__m128i *) ip); 484 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+4)); __m128i v0 = mm_delta_epi32(vi0,vs); vs = vi0; v0 = mm_zzage_epi32(v0); 485 | __m128i v1 = mm_delta_epi32(vi1,vs); vs = vi1; v1 = mm_zzage_epi32(v1); 486 | vo0 = _mm_or_si128(vo0, v0); 487 | vo1 = _mm_or_si128(vo1, v1); 488 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0)); 489 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0)); 490 | } start = _mm_cvtsi128_si16(_mm_srli_si128(vs,12)); 491 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0); 492 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0); 493 | #else 494 | ip = in; o = x = 0; //uint32_t u; BITDE(uint32_t, in, n, 0, o |= u; x |= u^u0); 495 | #endif 496 | for(;ip != in+n; ip++) { 497 | uint32_t u = zigzagenc32((int)ip[0] - (int)start); start = *ip; //((int)(*ip) - (int)start); //i = (i << 1) ^ (i >> 31); 498 | o |= u; 499 | x |= u ^ u0; 500 | } 501 | if(px) *px = x; 502 | return o; 503 | } 504 | 505 | uint8_t bitzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta) { uint8_t o,x,u,*op = out; BITZENC(uint8_t, int8_t, 8,in, n, o |= u; *op++ = u); return o; } 506 | uint16_t bitzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta) { uint16_t o,x,u,*op = out; BITZENC(uint16_t, int16_t,16,in, n, o |= u; *op++ = u); return o; } 507 | uint64_t bitzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta) { uint64_t o,x,u,*op = out; BITZENC(uint64_t, int64_t,64,in, n, o |= u; *op++ = u); return o; } 508 | uint32_t bitzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta) { 509 | #if defined(__SSE2__) || defined(__ARM_NEON) 510 | unsigned *ip,b,*op = out; 511 | __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi32(start), dv; 512 | for(ip = in; ip != in+(n&~(4-1)); ip += 4,op += 4) { 513 | __m128i iv = _mm_loadu_si128((__m128i *)ip); 514 | dv = mm_delta_epi32(iv,vs); vs = iv; 515 | dv = mm_zzage_epi32(dv); 516 | bv = _mm_or_si128(bv, dv); 517 | _mm_storeu_si128((__m128i *)op, dv); 518 | } 519 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 520 | b = mm_hor_epi32(bv); 521 | while(ip != in+n) { 522 | int x = ((int)(*ip)-(int)start); 523 | x = (x << 1) ^ (x >> 31); 524 | start = *ip++; 525 | b |= x; 526 | *op++ = x; 527 | } 528 | #else 529 | uint32_t b = 0, *op = out,x; 530 | BITZENC(uint32_t, int32_t, 32,in, n, b |= x; *op++ = x); 531 | #endif 532 | return bsr32(b); 533 | } 534 | 535 | #define ZD(_t_, _usize_, i) { _t_ _z = _ip[i]; _ip[i] = (start += TEMPLATE2(zigzagdec, _usize_)(_z)); } 536 | #define BITZDEC(_t_, _usize_, _in_, _n_) { _t_ *_ip;\ 537 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZD(_t_, _usize_, 0); ZD(_t_, _usize_, 1); ZD(_t_, _usize_, 2); ZD(_t_, _usize_, 3); }\ 538 | for(;_ip != _in_+_n_;_ip++) ZD(_t_, _usize_, 0);\ 539 | } 540 | 541 | void bitzdec8( uint8_t *p, unsigned n, uint8_t start) { BITZDEC(uint8_t, 8, p, n); } 542 | void bitzdec64(uint64_t *p, unsigned n, uint64_t start) { BITZDEC(uint64_t, 64,p, n); } 543 | 544 | void bitzdec16(uint16_t *p, unsigned n, uint16_t start) { 545 | #if defined(__SSSE3__) || defined(__ARM_NEON) 546 | __m128i vs = _mm_set1_epi16(start); //, c1 = _mm_set1_epi32(1), cz = _mm_setzero_si128(); 547 | uint16_t *ip; 548 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) { 549 | __m128i iv = _mm_loadu_si128((__m128i *)ip); 550 | iv = mm_zzagd_epi16(iv); 551 | vs = mm_scan_epi16(iv, vs); 552 | _mm_storeu_si128((__m128i *)ip, vs); 553 | } 554 | start = (uint16_t)_mm_cvtsi128_si32(_mm_srli_si128(vs,14)); 555 | while(ip != p+n) { 556 | uint16_t z = *ip; 557 | *ip++ = (start += (z >> 1 ^ -(z & 1))); 558 | } 559 | #else 560 | BITZDEC(uint16_t, 16, p, n); 561 | #endif 562 | } 563 | 564 | void bitzdec32(unsigned *p, unsigned n, unsigned start) { 565 | #if defined(__AVX2__) && defined(USE_AVX2) 566 | __m256i vs = _mm256_set1_epi32(start); //, zv = _mm256_setzero_si256()*/; //, c1 = _mm_set1_epi32(1), cz = _mm_setzero_si128(); 567 | unsigned *ip; 568 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) { 569 | __m256i iv = _mm256_loadu_si256((__m256i *)ip); 570 | iv = mm256_zzagd_epi32(iv); 571 | vs = mm256_scan_epi32(iv,vs); 572 | _mm256_storeu_si256((__m256i *)ip, vs); 573 | } 574 | start = (unsigned)_mm256_extract_epi32(_mm256_srli_si256(vs,12), 4); 575 | while(ip != p+n) { 576 | unsigned z = *ip; 577 | *ip++ = (start += (z >> 1 ^ -(z & 1))); 578 | } 579 | #elif defined(__SSE2__) || defined(__ARM_NEON) 580 | __m128i vs = _mm_set1_epi32(start); //, c1 = _mm_set1_epi32(1), cz = _mm_setzero_si128(); 581 | unsigned *ip; 582 | for(ip = p; ip != p+(n&~(4-1)); ip += 4) { 583 | __m128i iv = _mm_loadu_si128((__m128i *)ip); 584 | iv = mm_zzagd_epi32(iv); 585 | vs = mm_scan_epi32(iv, vs); 586 | _mm_storeu_si128((__m128i *)ip, vs); 587 | } 588 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12)); 589 | while(ip != p+n) { 590 | unsigned z = *ip; 591 | *ip++ = (start += zigzagdec32(z)); 592 | } 593 | #else 594 | BITZDEC(uint32_t, 32, p, n); 595 | #endif 596 | } 597 | 598 | //----------------------- XOR : return max. bits --------------------------------- 599 | #define XE(i) x = _ip[i] ^ start; start = _ip[i] 600 | #define BITXENC(_t_, _in_, _n_, _act_) { _t_ *_ip;\ 601 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { XE(0);_act_; XE(1);_act_; XE(2);_act_; XE(3);_act_; }\ 602 | for( ; _ip != _in_+ _n_; _ip++ ) { XE(0);_act_; }\ 603 | } 604 | uint8_t bitxenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start) { uint8_t b = 0,*op = out,x; BITXENC(uint8_t, in, n, b |= x; *op++ = x); return b; } 605 | uint16_t bitxenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start) { uint16_t b = 0,*op = out,x; BITXENC(uint16_t, in, n, b |= x; *op++ = x); return b; } 606 | uint32_t bitxenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start) { uint32_t b = 0,*op = out,x; BITXENC(uint32_t, in, n, b |= x; *op++ = x); return b; } 607 | uint64_t bitxenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start) { uint64_t b = 0,*op = out,x; BITXENC(uint64_t, in, n, b |= x; *op++ = x); return b; } 608 | 609 | #define XD(i) _ip[i] = (start ^= _ip[i]) 610 | #define BITXDEC(_t_, _in_, _n_) { _t_ *_ip, _x;\ 611 | for(_ip = _in_;_ip != _in_+(_n_&~(4-1)); _ip += 4) { XD(0); XD(1); XD(2); XD(3); }\ 612 | for( ;_ip != _in_+ _n_ ; _ip++ ) XD(0);\ 613 | } 614 | 615 | void bitxdec8( uint8_t *p, unsigned n, uint8_t start) { BITXDEC(uint8_t, p, n); } 616 | void bitxdec16(uint16_t *p, unsigned n, uint16_t start) { BITXDEC(uint16_t, p, n); } 617 | void bitxdec32(uint32_t *p, unsigned n, uint32_t start) { BITXDEC(uint32_t, p, n); } 618 | void bitxdec64(uint64_t *p, unsigned n, uint64_t start) { BITXDEC(uint64_t, p, n); } 619 | 620 | //-------------- For : calc max. bits, min,max value ------------------------ 621 | #define FM(i) mi = _ip[i] < mi?_ip[i]:mi; mx = _ip[i] > mx?_ip[i]:mx 622 | #define BITFM(_t_, _in_,_n_) { _t_ *_ip; \ 623 | for(_ip = _in_, mi = mx = *_ip; _ip != _in_+(_n_&~(4-1)); _ip += 4) { FM(0); FM(1); FM(2); FM(3); }\ 624 | for(;_ip != _in_+_n_; _ip++) FM(0);\ 625 | } 626 | 627 | uint8_t bitfm8( uint8_t *in, unsigned n, uint8_t *px, uint8_t *pmin) { uint8_t mi,mx; BITFM(uint8_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; } 628 | uint16_t bitfm16(uint16_t *in, unsigned n, uint16_t *px, uint16_t *pmin) { uint16_t mi,mx; BITFM(uint16_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; } 629 | uint32_t bitfm32(uint32_t *in, unsigned n, uint32_t *px, uint32_t *pmin) { uint32_t mi,mx; BITFM(uint32_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; } 630 | uint64_t bitfm64(uint64_t *in, unsigned n, uint64_t *px, uint64_t *pmin) { uint64_t mi,mx; BITFM(uint64_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; } 631 | 632 | //----------- Lossy floating point conversion: pad the trailing mantissa bits with zero bits according to the relative error e (ex. 0.00001) ---------- 633 | #ifdef USE_FLOAT16 634 | // https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point 635 | #define ctof16(_cp_) (*(_Float16 *)(_cp_)) 636 | 637 | static inline _Float16 _fppad16(_Float16 d, float e, int lg2e) { 638 | uint16_t u, du = ctou16(&d); 639 | int b = (du>>10 & 0x1f)-15; // mantissa=10 bits, exponent=5bits, bias=15 640 | if ((b = 12 - b - lg2e) <= 0) return d; 641 | b = (b > 10) ? 10 : b; 642 | do { u = du & (~((1u<<(--b))-1)); } while (fabs((ctof16(&u) - d)/d) > e); 643 | return ctof16(&u); 644 | } 645 | 646 | void fppad16(_Float16 *in, size_t n, _Float16 *out, float e) { int lg2e = -log(e)/log(2.0); _Float16 *ip; for (ip = in; ip < in+n; ip++,out++) *out = _fppad16(*ip, e, lg2e); } 647 | #endif 648 | 649 | //do u = du & (~((1u<<(--b))-1)); while(fabsf((ctof32(&u) - d)/d) > e); 650 | #define OP(t,s) sign = du & ((t)1<<(s-1)); du &= ~((t)1<<(s-1)); d = TEMPLATE2(ctof,s)(&du);\ 651 | do u = du & (~(((t)1<<(--b))-1)); while(d - TEMPLATE2(ctof,s)(&u) > e*d);\ 652 | u |= sign;\ 653 | return TEMPLATE2(ctof,s)(&u); 654 | 655 | static inline float _fppad32(float d, float e, int lg2e) { 656 | uint32_t u, du = ctou32(&d), sign; 657 | int b = (du>>23 & 0xff)-0x7e; 658 | if((b = 25 - b - lg2e) <= 0) 659 | return d; 660 | b = b > 23?23:b; 661 | sign = du & (1<<31); 662 | du &= 0x7fffffffu; 663 | d = ctof32(&du); 664 | do u = du & (~((1u<<(--b))-1)); while(d - ctof32(&u) > e*d); 665 | u |= sign; 666 | return ctof32(&u); 667 | } 668 | 669 | void fppad32(float *in, size_t n, float *out, float e) { int lg2e = -log(e)/log(2.0); float *ip; for(ip = in; ip < in+n; ip++,out++) *out = _fppad32(*ip, e, lg2e); } 670 | 671 | static inline double _fppad64(double d, double e, int lg2e) { 672 | union r { uint64_t u; double d; } u,du; du.d = d; 673 | uint64_t sign; 674 | int b = (du.u>>52 & 0x7ff)-0x3fe; 675 | if((b = 54 - b - lg2e) <= 0) 676 | return d; 677 | b = b > 52?52:b; 678 | sign = du.u & (1ull<<63); du.u &= 0x7fffffffffffffffull; 679 | int _b = b; 680 | for(;;) { if((_b -= 8) <= 0) break; u.u = du.u & (~((1ull<<_b)-1)); if(d - u.d <= e*d) break; b = _b; } 681 | do u.u = du.u & (~((1ull<<(--b))-1)); while(d - u.d > e*d); 682 | u.u |= sign; 683 | return ctof64(&u); 684 | } 685 | 686 | void fppad64(double *in, size_t n, double *out, double e) { int lg2e = -log(e)/log(2.0); double *ip; for(ip = in; ip < in+n; ip++,out++) *out = _fppad64(*ip, e, lg2e); } 687 | -------------------------------------------------------------------------------- /bitutil.h: -------------------------------------------------------------------------------- 1 | /** 2 | Copyright (C) powturbo 2013-2019 3 | GPL v2 License 4 | 5 | This program is free software; you can redistribute it and/or modify 6 | it under the terms of the GNU General Public License as published by 7 | the Free Software Foundation; either version 2 of the License, or 8 | (at your option) any later version. 9 | 10 | This program is distributed in the hope that it will be useful, 11 | but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | GNU General Public License for more details. 14 | 15 | You should have received a copy of the GNU General Public License along 16 | with this program; if not, write to the Free Software Foundation, Inc., 17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 18 | 19 | - homepage : https://sites.google.com/site/powturbo/ 20 | - github : https://github.com/powturbo 21 | - twitter : https://twitter.com/powturbo 22 | - email : powturbo [_AT_] gmail [_DOT_] com 23 | **/ 24 | // "Integer Compression: max.bits, delta, zigzag, xor" 25 | 26 | #ifdef BITUTIL_IN 27 | #ifdef __AVX2__ 28 | #include 29 | #elif defined(__AVX__) 30 | #include 31 | #elif defined(__SSE4_1__) 32 | #include 33 | #elif defined(__SSSE3__) 34 | #include 35 | #elif defined(__SSE2__) 36 | #include 37 | #elif defined(__ARM_NEON) 38 | #include 39 | #endif 40 | #if defined(_MSC_VER) && _MSC_VER < 1600 41 | #include "vs/stdint.h" 42 | #else 43 | #include 44 | #endif 45 | #include "sse_neon.h" 46 | 47 | #ifdef __ARM_NEON 48 | #define PREFETCH(_ip_,_rw_) 49 | #else 50 | #define PREFETCH(_ip_,_rw_) __builtin_prefetch(_ip_,_rw_) 51 | #endif 52 | //------------------------ zigzag encoding ------------------------------------------------------------- 53 | static inline unsigned char zigzagenc8( signed char x) { return x << 1 ^ x >> 7; } 54 | static inline char zigzagdec8( unsigned char x) { return x >> 1 ^ -(x & 1); } 55 | 56 | static inline unsigned short zigzagenc16(short x) { return x << 1 ^ x >> 15; } 57 | static inline short zigzagdec16(unsigned short x) { return x >> 1 ^ -(x & 1); } 58 | 59 | static inline unsigned zigzagenc32(int x) { return x << 1 ^ x >> 31; } 60 | static inline int zigzagdec32(unsigned x) { return x >> 1 ^ -(x & 1); } 61 | 62 | static inline uint64_t zigzagenc64(int64_t x) { return x << 1 ^ x >> 63; } 63 | static inline int64_t zigzagdec64(uint64_t x) { return x >> 1 ^ -(x & 1); } 64 | 65 | #if defined(__SSE2__) || defined(__ARM_NEON) 66 | static ALWAYS_INLINE __m128i mm_zzage_epi16(__m128i v) { return _mm_xor_si128(_mm_slli_epi16(v,1), _mm_srai_epi16(v,15)); } 67 | static ALWAYS_INLINE __m128i mm_zzage_epi32(__m128i v) { return _mm_xor_si128(_mm_slli_epi32(v,1), _mm_srai_epi32(v,31)); } 68 | //static ALWAYS_INLINE __m128i mm_zzage_epi64(__m128i v) { return _mm_xor_si128(_mm_slli_epi64(v,1), _mm_srai_epi64(v,63)); } 69 | 70 | static ALWAYS_INLINE __m128i mm_zzagd_epi16(__m128i v) { return _mm_xor_si128(_mm_srli_epi16(v,1), _mm_srai_epi16(_mm_slli_epi16(v,15),15) ); } 71 | static ALWAYS_INLINE __m128i mm_zzagd_epi32(__m128i v) { return _mm_xor_si128(_mm_srli_epi32(v,1), _mm_srai_epi32(_mm_slli_epi32(v,31),31) ); } 72 | //static ALWAYS_INLINE __m128i mm_zzagd_epi64(__m128i v) { return _mm_xor_si128(_mm_srli_epi64(v,1), _mm_srai_epi64(_mm_slli_epi64(v,63),63) ); } 73 | 74 | #endif 75 | #ifdef __AVX2__ 76 | static ALWAYS_INLINE __m256i mm256_zzage_epi32(__m256i v) { return _mm256_xor_si256(_mm256_slli_epi32(v,1), _mm256_srai_epi32(v,31)); } 77 | static ALWAYS_INLINE __m256i mm256_zzagd_epi32(__m256i v) { return _mm256_xor_si256(_mm256_srli_epi32(v,1), _mm256_srai_epi32(_mm256_slli_epi32(v,31),31) ); } 78 | #endif 79 | 80 | //-------------- AVX2 delta + prefix sum (scan) / xor encode/decode --------------------------------------------------------------------------------------- 81 | #ifdef __AVX2__ 82 | static ALWAYS_INLINE __m256i mm256_delta_epi32(__m256i v, __m256i sv) { return _mm256_sub_epi32(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 12)); } 83 | static ALWAYS_INLINE __m256i mm256_delta_epi64(__m256i v, __m256i sv) { return _mm256_sub_epi64(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 8)); } 84 | static ALWAYS_INLINE __m256i mm256_xore_epi32( __m256i v, __m256i sv) { return _mm256_xor_si256(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 12)); } 85 | static ALWAYS_INLINE __m256i mm256_xore_epi64( __m256i v, __m256i sv) { return _mm256_xor_si256(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 8)); } 86 | 87 | static ALWAYS_INLINE __m256i mm256_scan_epi32(__m256i v, __m256i sv) { 88 | v = _mm256_add_epi32(v, _mm256_slli_si256(v, 4)); 89 | v = _mm256_add_epi32(v, _mm256_slli_si256(v, 8)); 90 | return _mm256_add_epi32( _mm256_permute2x128_si256( _mm256_shuffle_epi32(sv,_MM_SHUFFLE(3, 3, 3, 3)), sv, 0x11), 91 | _mm256_add_epi32(v, _mm256_permute2x128_si256(_mm256_setzero_si256(),_mm256_shuffle_epi32(v, _MM_SHUFFLE(3, 3, 3, 3)), 0x20))); 92 | } 93 | static ALWAYS_INLINE __m256i mm256_xord_epi32(__m256i v, __m256i sv) { 94 | v = _mm256_xor_si256(v, _mm256_slli_si256(v, 4)); 95 | v = _mm256_xor_si256(v, _mm256_slli_si256(v, 8)); 96 | return _mm256_xor_si256( _mm256_permute2x128_si256( _mm256_shuffle_epi32(sv,_MM_SHUFFLE(3, 3, 3, 3)), sv, 0x11), 97 | _mm256_xor_si256(v, _mm256_permute2x128_si256(_mm256_setzero_si256(),_mm256_shuffle_epi32(v, _MM_SHUFFLE(3, 3, 3, 3)), 0x20))); 98 | } 99 | 100 | static ALWAYS_INLINE __m256i mm256_scan_epi64(__m256i v, __m256i sv) { 101 | v = _mm256_add_epi64(v, _mm256_alignr_epi8(v, _mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), 8)); 102 | return _mm256_add_epi64(_mm256_permute4x64_epi64(sv, _MM_SHUFFLE(3, 3, 3, 3)), _mm256_add_epi64(_mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), v) ); 103 | } 104 | static ALWAYS_INLINE __m256i mm256_xord_epi64(__m256i v, __m256i sv) { 105 | v = _mm256_xor_si256(v, _mm256_alignr_epi8(v, _mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), 8)); 106 | return _mm256_xor_si256(_mm256_permute4x64_epi64(sv, _MM_SHUFFLE(3, 3, 3, 3)), _mm256_xor_si256(_mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), v) ); 107 | } 108 | 109 | static ALWAYS_INLINE __m256i mm256_scani_epi32(__m256i v, __m256i sv, __m256i vi) { return _mm256_add_epi32(mm256_scan_epi32(v, sv), vi); } 110 | #endif 111 | 112 | #if defined(__SSSE3__) || defined(__ARM_NEON) 113 | static ALWAYS_INLINE __m128i mm_delta_epi16(__m128i v, __m128i sv) { return _mm_sub_epi16(v, _mm_alignr_epi8(v, sv, 14)); } 114 | static ALWAYS_INLINE __m128i mm_delta_epi32(__m128i v, __m128i sv) { return _mm_sub_epi32(v, _mm_alignr_epi8(v, sv, 12)); } 115 | static ALWAYS_INLINE __m128i mm_xore_epi16( __m128i v, __m128i sv) { return _mm_xor_si128(v, _mm_alignr_epi8(v, sv, 14)); } 116 | static ALWAYS_INLINE __m128i mm_xore_epi32( __m128i v, __m128i sv) { return _mm_xor_si128(v, _mm_alignr_epi8(v, sv, 12)); } 117 | 118 | #define MM_HDEC_EPI16(_v_,_sv_,_hop_) {\ 119 | _v_ = _hop_( _v_, _mm_slli_si128(_v_, 2));\ 120 | _v_ = _hop_( _v_, _mm_slli_si128(_v_, 4));\ 121 | _v_ = _hop_(_hop_(_v_, _mm_slli_si128(_v_, 8)), _mm_shuffle_epi8(_sv_, _mm_set1_epi16(0x0f0e)));\ 122 | } 123 | 124 | static ALWAYS_INLINE __m128i mm_scan_epi16(__m128i v, __m128i sv) { MM_HDEC_EPI16(v,sv,_mm_add_epi16); return v; } 125 | static ALWAYS_INLINE __m128i mm_xord_epi16(__m128i v, __m128i sv) { MM_HDEC_EPI16(v,sv,_mm_xor_si128); return v; } 126 | #elif defined(__SSE2__) 127 | static ALWAYS_INLINE __m128i mm_delta_epi16(__m128i v, __m128i sv) { return _mm_sub_epi16(v, _mm_or_si128(_mm_srli_si128(sv, 14), _mm_slli_si128(v, 2))); } 128 | static ALWAYS_INLINE __m128i mm_xore_epi16( __m128i v, __m128i sv) { return _mm_xor_epi16(v, _mm_or_si128(_mm_srli_si128(sv, 14), _mm_slli_si128(v, 2))); } 129 | static ALWAYS_INLINE __m128i mm_delta_epi32(__m128i v, __m128i sv) { return _mm_sub_epi32(v, _mm_or_si128(_mm_srli_si128(sv, 12), _mm_slli_si128(v, 4))); } 130 | static ALWAYS_INLINE __m128i mm_xore_epi32( __m128i v, __m128i sv) { return _mm_xor_epi32(v, _mm_or_si128(_mm_srli_si128(sv, 12), _mm_slli_si128(v, 4))); } 131 | #endif 132 | 133 | #if defined(__SSE2__) || defined(__ARM_NEON) 134 | #define MM_HDEC_EPI32(_v_,_sv_,_hop_) { _v_ = _hop_(_v_, _mm_slli_si128(_v_, 4)); _v_ = _hop_(mm_shuffle_nnnn_epi32(_sv_, 3), _hop_(_mm_slli_si128(_v_, 8), _v_)); } 135 | static ALWAYS_INLINE __m128i mm_scan_epi32(__m128i v, __m128i sv) { MM_HDEC_EPI32(v,sv,_mm_add_epi32); return v; } 136 | static ALWAYS_INLINE __m128i mm_xord_epi32(__m128i v, __m128i sv) { MM_HDEC_EPI32(v,sv,_mm_xor_si128); return v; } 137 | 138 | //-------- scan with vi delta > 0 ----------------------------- 139 | static ALWAYS_INLINE __m128i mm_scani_epi16(__m128i v, __m128i sv, __m128i vi) { return _mm_add_epi16(mm_scan_epi16(v, sv), vi); } 140 | static ALWAYS_INLINE __m128i mm_scani_epi32(__m128i v, __m128i sv, __m128i vi) { return _mm_add_epi32(mm_scan_epi32(v, sv), vi); } 141 | #endif 142 | 143 | //------------------ Horizontal OR ----------------------------------------------- 144 | #ifdef __AVX2__ 145 | static ALWAYS_INLINE unsigned mm256_hor_epi32(__m256i v) { 146 | v = _mm256_or_si256(v, _mm256_srli_si256(v, 8)); 147 | v = _mm256_or_si256(v, _mm256_srli_si256(v, 4)); 148 | return _mm256_extract_epi32(v,0) | _mm256_extract_epi32(v, 4); 149 | } 150 | 151 | static ALWAYS_INLINE uint64_t mm256_hor_epi64(__m256i v) { 152 | v = _mm256_or_si256(v, _mm256_permute2x128_si256(v, v, _MM_SHUFFLE(2, 0, 0, 1))); 153 | return _mm256_extract_epi64(v, 1) | _mm256_extract_epi64(v,0); 154 | } 155 | #endif 156 | 157 | #if defined(__SSE2__) || defined(__ARM_NEON) 158 | #define MM_HOZ_EPI16(v,_hop_) {\ 159 | v = _hop_(v, _mm_srli_si128(v, 8));\ 160 | v = _hop_(v, _mm_srli_si128(v, 6));\ 161 | v = _hop_(v, _mm_srli_si128(v, 4));\ 162 | v = _hop_(v, _mm_srli_si128(v, 2));\ 163 | } 164 | 165 | #define MM_HOZ_EPI32(v,_hop_) {\ 166 | v = _hop_(v, _mm_srli_si128(v, 8));\ 167 | v = _hop_(v, _mm_srli_si128(v, 4));\ 168 | } 169 | 170 | static ALWAYS_INLINE uint16_t mm_hor_epi16( __m128i v) { MM_HOZ_EPI16(v,_mm_or_si128); return (unsigned short)_mm_cvtsi128_si32(v); } 171 | static ALWAYS_INLINE uint32_t mm_hor_epi32( __m128i v) { MM_HOZ_EPI32(v,_mm_or_si128); return (unsigned )_mm_cvtsi128_si32(v); } 172 | static ALWAYS_INLINE uint64_t mm_hor_epi64( __m128i v) { v = _mm_or_si128( v, _mm_srli_si128(v, 8)); return (uint64_t )_mm_cvtsi128_si64(v); } 173 | #endif 174 | 175 | //----------------- sub / add ---------------------------------------------------------- 176 | #if defined(__SSE2__) || defined(__ARM_NEON) 177 | #define SUBI16x8(_v_, _sv_) _mm_sub_epi16(_v_, _sv_) 178 | #define SUBI32x4(_v_, _sv_) _mm_sub_epi32(_v_, _sv_) 179 | #define ADDI16x8(_v_, _sv_, _vi_) _sv_ = _mm_add_epi16(_mm_add_epi16(_sv_, _vi_),_v_) 180 | #define ADDI32x4(_v_, _sv_, _vi_) _sv_ = _mm_add_epi32(_mm_add_epi32(_sv_, _vi_),_v_) 181 | 182 | //---------------- Convert _mm_cvtsi128_siXX ------------------------------------------- 183 | static ALWAYS_INLINE uint8_t _mm_cvtsi128_si8 (__m128i v) { return (uint8_t )_mm_cvtsi128_si32(v); } 184 | static ALWAYS_INLINE uint16_t _mm_cvtsi128_si16(__m128i v) { return (uint16_t)_mm_cvtsi128_si32(v); } 185 | #endif 186 | 187 | //--------- memset ----------------------------------------- 188 | #define BITFORSET_(_out_, _n_, _start_, _mindelta_) do { unsigned _i;\ 189 | for(_i = 0; _i != (_n_&~3); _i+=4) { \ 190 | _out_[_i+0] = _start_+(_i )*_mindelta_; \ 191 | _out_[_i+1] = _start_+(_i+1)*_mindelta_; \ 192 | _out_[_i+2] = _start_+(_i+2)*_mindelta_; \ 193 | _out_[_i+3] = _start_+(_i+3)*_mindelta_; \ 194 | } \ 195 | while(_i != _n_) \ 196 | _out_[_i] = _start_+_i*_mindelta_, ++_i; \ 197 | } while(0) 198 | 199 | //--------- SIMD zero ----------------------------------------- 200 | #ifdef __AVX2__ 201 | #define BITZERO32(_out_, _n_, _start_) do {\ 202 | __m256i _sv_ = _mm256_set1_epi32(_start_), *_ov = (__m256i *)(_out_), *_ove = (__m256i *)(_out_ + _n_);\ 203 | do _mm256_storeu_si256(_ov++, _sv_); while(_ov < _ove);\ 204 | } while(0) 205 | 206 | #define BITFORZERO32(_out_, _n_, _start_, _mindelta_) do {\ 207 | __m256i _sv = _mm256_set1_epi32(_start_), *_ov=(__m256i *)(_out_), *_ove = (__m256i *)(_out_ + _n_), _cv = _mm256_set_epi32(7+_mindelta_,6+_mindelta_,5+_mindelta_,4+_mindelta_,3*_mindelta_,2*_mindelta_,1*_mindelta_,0); \ 208 | _sv = _mm256_add_epi32(_sv, _cv);\ 209 | _cv = _mm256_set1_epi32(4);\ 210 | do { _mm256_storeu_si256(_ov++, _sv); _sv = _mm256_add_epi32(_sv, _cv); } while(_ov < _ove);\ 211 | } while(0) 212 | 213 | #define BITDIZERO32(_out_, _n_, _start_, _mindelta_) do { __m256i _sv = _mm256_set1_epi32(_start_), _cv = _mm256_set_epi32(7+_mindelta_,6+_mindelta_,5+_mindelta_,4+_mindelta_,3+_mindelta_,2+_mindelta_,1+_mindelta_,_mindelta_), *_ov=(__m256i *)(_out_), *_ove = (__m256i *)(_out_ + _n_);\ 214 | _sv = _mm256_add_epi32(_sv, _cv); _cv = _mm256_set1_epi32(4*_mindelta_); do { _mm256_storeu_si256(_ov++, _sv), _sv = _mm256_add_epi32(_sv, _cv); } while(_ov < _ove);\ 215 | } while(0) 216 | 217 | #elif defined(__SSE2__) || defined(__ARM_NEON) // ------------- 218 | // SIMD set value (memset) 219 | #define BITZERO32(_out_, _n_, _v_) do {\ 220 | __m128i _sv_ = _mm_set1_epi32(_v_), *_ov = (__m128i *)(_out_), *_ove = (__m128i *)(_out_ + _n_);\ 221 | do _mm_storeu_si128(_ov++, _sv_); while(_ov < _ove); \ 222 | } while(0) 223 | 224 | #define BITFORZERO32(_out_, _n_, _start_, _mindelta_) do {\ 225 | __m128i _sv = _mm_set1_epi32(_start_), *_ov=(__m128i *)(_out_), *_ove = (__m128i *)(_out_ + _n_), _cv = _mm_set_epi32(3*_mindelta_,2*_mindelta_,1*_mindelta_,0); \ 226 | _sv = _mm_add_epi32(_sv, _cv);\ 227 | _cv = _mm_set1_epi32(4);\ 228 | do { _mm_storeu_si128(_ov++, _sv); _sv = _mm_add_epi32(_sv, _cv); } while(_ov < _ove);\ 229 | } while(0) 230 | 231 | #define BITDIZERO32(_out_, _n_, _start_, _mindelta_) do { __m128i _sv = _mm_set1_epi32(_start_), _cv = _mm_set_epi32(3+_mindelta_,2+_mindelta_,1+_mindelta_,_mindelta_), *_ov=(__m128i *)(_out_), *_ove = (__m128i *)(_out_ + _n_);\ 232 | _sv = _mm_add_epi32(_sv, _cv); _cv = _mm_set1_epi32(4*_mindelta_); do { _mm_storeu_si128(_ov++, _sv), _sv = _mm_add_epi32(_sv, _cv); } while(_ov < _ove);\ 233 | } while(0) 234 | #else 235 | #define BITFORZERO32(_out_, _n_, _start_, _mindelta_) BITFORSET_(_out_, _n_, _start_, _mindelta_) 236 | #define BITZERO32( _out_, _n_, _start_) BITFORSET_(_out_, _n_, _start_, 0) 237 | #endif 238 | 239 | #define DELTR( _in_, _n_, _start_, _mindelta_, _out_) { unsigned _v; for( _v = 0; _v < _n_; _v++) _out_[_v] = _in_[_v] - (_start_) - _v*(_mindelta_) - (_mindelta_); } 240 | #define DELTRB(_in_, _n_, _start_, _mindelta_, _b_, _out_) { unsigned _v; for(_b_=0,_v = 0; _v < _n_; _v++) _out_[_v] = _in_[_v] - (_start_) - _v*(_mindelta_) - (_mindelta_), _b_ |= _out_[_v]; _b_ = bsr32(_b_); } 241 | 242 | //----------------------------------------- bitreverse scalar + SIMD ------------------------------------------- 243 | #if __clang__ //__has_builtin(__builtin_bitreverse64) 244 | #define rbit8(x) __builtin_bitreverse8( x) 245 | #define rbit16(x) __builtin_bitreverse16(x) 246 | #define rbit32(x) __builtin_bitreverse32(x) 247 | #define rbit64(x) __builtin_bitreverse64(x) 248 | #else 249 | 250 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u) 251 | static ALWAYS_INLINE uint32_t _rbit_(uint32_t x) { uint32_t rc; __asm volatile ("rbit %0, %1" : "=r" (rc) : "r" (x) ); } 252 | #endif 253 | static ALWAYS_INLINE uint8_t rbit8(uint8_t x) { 254 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u) 255 | return _rbit_(x) >> 24; 256 | #elif 0 257 | x = (x & 0xaa) >> 1 | (x & 0x55) << 1; 258 | x = (x & 0xcc) >> 2 | (x & 0x33) << 2; 259 | return x << 4 | x >> 4; 260 | #else 261 | return (x * 0x0202020202ull & 0x010884422010ull) % 1023; 262 | #endif 263 | } 264 | 265 | static ALWAYS_INLINE uint16_t rbit16(uint16_t x) { 266 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u) 267 | return _rbit_(x) >> 16; 268 | #else 269 | x = (x & 0xaaaa) >> 1 | (x & 0x5555) << 1; 270 | x = (x & 0xcccc) >> 2 | (x & 0x3333) << 2; 271 | x = (x & 0xf0f0) >> 4 | (x & 0x0f0f) << 4; 272 | return x << 8 | x >> 8; 273 | #endif 274 | } 275 | 276 | static ALWAYS_INLINE uint32_t rbit32(uint32_t x) { 277 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u) 278 | return _rbit_(x); 279 | #else 280 | x = ((x & 0xaaaaaaaa) >> 1 | (x & 0x55555555) << 1); 281 | x = ((x & 0xcccccccc) >> 2 | (x & 0x33333333) << 2); 282 | x = ((x & 0xf0f0f0f0) >> 4 | (x & 0x0f0f0f0f) << 4); 283 | x = ((x & 0xff00ff00) >> 8 | (x & 0x00ff00ff) << 8); 284 | return x << 16 | x >> 16; 285 | #endif 286 | } 287 | static ALWAYS_INLINE uint64_t rbit64(uint64_t x) { 288 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u) 289 | return (uint64_t)_rbit_(x) << 32 | _rbit_(x >> 32); 290 | #else 291 | x = (x & 0xaaaaaaaaaaaaaaaa) >> 1 | (x & 0x5555555555555555) << 1; 292 | x = (x & 0xcccccccccccccccc) >> 2 | (x & 0x3333333333333333) << 2; 293 | x = (x & 0xf0f0f0f0f0f0f0f0) >> 4 | (x & 0x0f0f0f0f0f0f0f0f) << 4; 294 | x = (x & 0xff00ff00ff00ff00) >> 8 | (x & 0x00ff00ff00ff00ff) << 8; 295 | x = (x & 0xffff0000ffff0000) >> 16 | (x & 0x0000ffff0000ffff) << 16; 296 | return x << 32 | x >> 32; 297 | #endif 298 | } 299 | #endif 300 | 301 | #if defined(__SSSE3__) || defined(__ARM_NEON) 302 | static ALWAYS_INLINE __m128i mm_rbit_epi16(__m128i v) { return mm_rbit_epi8(mm_rev_epi16(v)); } 303 | static ALWAYS_INLINE __m128i mm_rbit_epi32(__m128i v) { return mm_rbit_epi8(mm_rev_epi32(v)); } 304 | static ALWAYS_INLINE __m128i mm_rbit_epi64(__m128i v) { return mm_rbit_epi8(mm_rev_epi64(v)); } 305 | //static ALWAYS_INLINE __m128i mm_rbit_si128(__m128i v) { return mm_rbit_epi8(mm_rev_si128(v)); } 306 | #endif 307 | 308 | #ifdef __AVX2__ 309 | static ALWAYS_INLINE __m256i mm256_rbit_epi8(__m256i v) { 310 | __m256i fv = _mm256_setr_epi8(0, 8, 4,12, 2,10, 6,14, 1, 9, 5,13, 3,11, 7,15, 0, 8, 4,12, 2,10, 6,14, 1, 9, 5,13, 3,11, 7,15), cv0f_8 = _mm256_set1_epi8(0xf); 311 | __m256i lv = _mm256_shuffle_epi8(fv,_mm256_and_si256( v, cv0f_8)); 312 | __m256i hv = _mm256_shuffle_epi8(fv,_mm256_and_si256(_mm256_srli_epi64(v, 4), cv0f_8)); 313 | return _mm256_or_si256(_mm256_slli_epi64(lv,4), hv); 314 | } 315 | 316 | static ALWAYS_INLINE __m256i mm256_rev_epi16(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8( 1, 0, 3, 2, 5, 4, 7, 6, 9, 8,11,10,13,12,15,14, 1, 0, 3, 2, 5, 4, 7, 6, 9, 8,11,10,13,12,15,14)); } 317 | static ALWAYS_INLINE __m256i mm256_rev_epi32(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8( 3, 2, 1, 0, 7, 6, 5, 4, 11,10, 9, 8,15,14,13,12, 3, 2, 1, 0, 7, 6, 5, 4, 11,10, 9, 8,15,14,13,12)); } 318 | static ALWAYS_INLINE __m256i mm256_rev_epi64(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8( 7, 6, 5, 4, 3, 2, 1, 0, 15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15,14,13,12,11,10, 9, 8)); } 319 | static ALWAYS_INLINE __m256i mm256_rev_si128(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8(15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)); } 320 | 321 | static ALWAYS_INLINE __m256i mm256_rbit_epi16(__m256i v) { return mm256_rbit_epi8(mm256_rev_epi16(v)); } 322 | static ALWAYS_INLINE __m256i mm256_rbit_epi32(__m256i v) { return mm256_rbit_epi8(mm256_rev_epi32(v)); } 323 | static ALWAYS_INLINE __m256i mm256_rbit_epi64(__m256i v) { return mm256_rbit_epi8(mm256_rev_epi64(v)); } 324 | static ALWAYS_INLINE __m256i mm256_rbit_si128(__m256i v) { return mm256_rbit_epi8(mm256_rev_si128(v)); } 325 | #endif 326 | #endif 327 | 328 | //---------- max. bit length + transform for sorted/unsorted arrays, delta,delta 1, delta > 1, zigzag, zigzag of delta, xor, FOR,---------------- 329 | #ifdef __cplusplus 330 | extern "C" { 331 | #endif 332 | //------ ORed array, for maximum bit length of the elements in the unsorted integer array --------------------- 333 | uint8_t bit8( uint8_t *in, unsigned n, uint8_t *px); 334 | uint16_t bit16(uint16_t *in, unsigned n, uint16_t *px); 335 | uint32_t bit32(uint32_t *in, unsigned n, uint32_t *px); 336 | uint64_t bit64(uint64_t *in, unsigned n, uint64_t *px); 337 | 338 | //-------------- delta = 0: Sorted integer array w/ mindelta = 0 ---------------------------------------------- 339 | //-- ORed array, maximum bit length of the non decreasing integer array. out[i] = in[i] - in[i-1] 340 | uint8_t bitd8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 341 | uint16_t bitd16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 342 | uint32_t bitd32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 343 | uint64_t bitd64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 344 | 345 | //-- in-place reverse delta 0 346 | void bitddec8( uint8_t *p, unsigned n, uint8_t start); // non decreasing (out[i] = in[i] - in[i-1]) 347 | void bitddec16( uint16_t *p, unsigned n, uint16_t start); 348 | void bitddec32( uint32_t *p, unsigned n, uint32_t start); 349 | void bitddec64( uint64_t *p, unsigned n, uint64_t start); 350 | 351 | //-- vectorized fast delta4 one: out[0] = in[4]-in[0], out[1]=in[5]-in[1], out[2]=in[6]-in[2], out[3]=in[7]-in[3],... 352 | uint16_t bits128v16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 353 | uint32_t bits128v32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 354 | 355 | //------------- delta = 1: Sorted integer array w/ mindelta = 1 --------------------------------------------- 356 | //-- get delta maximum bit length of the non strictly decreasing integer array. out[i] = in[i] - in[i-1] - 1 357 | uint8_t bitd18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 358 | uint16_t bitd116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 359 | uint32_t bitd132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 360 | uint64_t bitd164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 361 | 362 | //-- in-place reverse delta one 363 | void bitd1dec8( uint8_t *p, unsigned n, uint8_t start); // non strictly decreasing (out[i] = in[i] - in[i-1] - 1) 364 | void bitd1dec16( uint16_t *p, unsigned n, uint16_t start); 365 | void bitd1dec32( uint32_t *p, unsigned n, uint32_t start); 366 | void bitd1dec64( uint64_t *p, unsigned n, uint64_t start); 367 | 368 | //------------- delta > 1: Sorted integer array w/ mindelta > 1 --------------------------------------------- 369 | //-- ORed array, for max. bit length get min. delta () 370 | uint8_t bitdi8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 371 | uint16_t bitdi16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 372 | uint32_t bitdi32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 373 | uint64_t bitdi64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 374 | //-- transform sorted integer array to delta array: out[i] = in[i] - in[i-1] - mindelta 375 | uint8_t bitdienc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta); 376 | uint16_t bitdienc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta); 377 | uint32_t bitdienc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta); 378 | uint64_t bitdienc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta); 379 | //-- in-place reverse delta 380 | void bitdidec8( uint8_t *in, unsigned n, uint8_t start, uint8_t mindelta); 381 | void bitdidec16(uint16_t *in, unsigned n, uint16_t start, uint16_t mindelta); 382 | void bitdidec32(uint32_t *in, unsigned n, uint32_t start, uint32_t mindelta); 383 | void bitdidec64(uint64_t *in, unsigned n, uint64_t start, uint64_t mindelta); 384 | 385 | //------------- FOR : array bit length: --------------------------------------------------------------------- 386 | //------ ORed array, for max. bit length of the non decreasing integer array. out[i] = in[i] - start 387 | uint8_t bitf8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 388 | uint16_t bitf16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 389 | uint32_t bitf32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 390 | uint64_t bitf64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 391 | 392 | //------ ORed array, for max. bit length of the non strictly decreasing integer array out[i] = in[i] - 1 - start 393 | uint8_t bitf18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 394 | uint16_t bitf116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 395 | uint32_t bitf132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 396 | uint64_t bitf164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 397 | 398 | //------ ORed array, for max. bit length for usorted array 399 | uint8_t bitfm8( uint8_t *in, unsigned n, uint8_t *px, uint8_t *pmin); // unsorted 400 | uint16_t bitfm16(uint16_t *in, unsigned n, uint16_t *px, uint16_t *pmin); 401 | uint32_t bitfm32(uint32_t *in, unsigned n, uint32_t *px, uint32_t *pmin); 402 | uint64_t bitfm64(uint64_t *in, unsigned n, uint64_t *px, uint64_t *pmin); 403 | 404 | //------------- Zigzag encoding for unsorted integer lists: out[i] = in[i] - in[i-1] ------------------------ 405 | //-- ORed array, to get maximum zigzag bit length integer array 406 | uint8_t bitz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 407 | uint16_t bitz16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 408 | uint32_t bitz32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 409 | uint64_t bitz64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 410 | //-- Zigzag transform 411 | uint8_t bitzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta); 412 | uint16_t bitzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta); 413 | uint32_t bitzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta); 414 | uint64_t bitzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta); 415 | //-- in-place zigzag reverse transform 416 | void bitzdec8( uint8_t *in, unsigned n, uint8_t start); 417 | void bitzdec16( uint16_t *in, unsigned n, uint16_t start); 418 | void bitzdec32( uint32_t *in, unsigned n, uint32_t start); 419 | void bitzdec64( uint64_t *in, unsigned n, uint64_t start); 420 | 421 | //------------- Zigzag of zigzag/delta : unsorted/sorted integer array ---------------------------------------------------- 422 | //-- get delta maximum bit length of the non strictly decreasing integer array. out[i] = in[i] - in[i-1] - 1 423 | uint8_t bitzz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 424 | uint16_t bitzz16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 425 | uint32_t bitzz32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 426 | uint64_t bitzz64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 427 | 428 | uint8_t bitzzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta); 429 | uint16_t bitzzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta); 430 | uint32_t bitzzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta); 431 | uint64_t bitzzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta); 432 | 433 | //-- in-place reverse zigzag of delta (encoded w/ bitdiencNN and parameter mindelta = 1) 434 | void bitzzdec8( uint8_t *in, unsigned n, uint8_t start); // non strictly decreasing (out[i] = in[i] - in[i-1] - 1) 435 | void bitzzdec16( uint16_t *in, unsigned n, uint16_t start); 436 | void bitzzdec32( uint32_t *in, unsigned n, uint32_t start); 437 | void bitzzdec64( uint64_t *in, unsigned n, uint64_t start); 438 | 439 | //------------- XOR encoding for unsorted integer lists: out[i] = in[i] - in[i-1] ------------- 440 | //-- ORed array, to get maximum zigzag bit length integer array 441 | uint8_t bitx8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start); 442 | uint16_t bitx16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start); 443 | uint32_t bitx32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start); 444 | uint64_t bitx64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start); 445 | 446 | //-- XOR transform 447 | uint8_t bitxenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start); 448 | uint16_t bitxenc16( uint16_t *in, unsigned n, uint16_t *out, uint16_t start); 449 | uint32_t bitxenc32( uint32_t *in, unsigned n, uint32_t *out, uint32_t start); 450 | uint64_t bitxenc64( uint64_t *in, unsigned n, uint64_t *out, uint64_t start); 451 | 452 | //-- XOR in-place reverse transform 453 | void bitxdec8( uint8_t *p, unsigned n, uint8_t start); 454 | void bitxdec16( uint16_t *p, unsigned n, uint16_t start); 455 | void bitxdec32( uint32_t *p, unsigned n, uint32_t start); 456 | void bitxdec64( uint64_t *p, unsigned n, uint64_t start); 457 | 458 | //------- Lossy floating point transform: pad the trailing mantissa bits with zeros according to the error e (ex. e=0.00001) 459 | #ifdef USE_FLOAT16 460 | void fppad16(_Float16 *in, size_t n, _Float16 *out, float e); 461 | #endif 462 | void fppad32(float *in, size_t n, float *out, float e); 463 | void fppad64(double *in, size_t n, double *out, double e); 464 | 465 | #ifdef __cplusplus 466 | } 467 | #endif 468 | 469 | //---- Floating point to Integer decomposition --------------------------------- 470 | // seeeeeeee21098765432109876543210 (s:sign, e:exponent, 0-9:mantissa) 471 | #ifdef BITUTIL_IN 472 | #define MANTF32 23 473 | #define MANTF64 52 474 | 475 | #define BITFENC(_u_, _sgn_, _expo_, _mant_, _mantbits_, _one_) _sgn_ = _u_ >> (sizeof(_u_)*8-1); _expo_ = ((_u_ >> (_mantbits_)) & ( (_one_<<(sizeof(_u_)*8 - 1 - _mantbits_)) -1)); _mant_ = _u_ & ((_one_<<_mantbits_)-1); 476 | #define BITFDEC( _sgn_, _expo_, _mant_, _u_, _mantbits_) _u_ = (_sgn_) << (sizeof(_u_)*8-1) | (_expo_) << _mantbits_ | (_mant_) 477 | #endif 478 | -------------------------------------------------------------------------------- /conf.h: -------------------------------------------------------------------------------- 1 | /** 2 | Copyright (C) powturbo 2013-2019 3 | GPL v2 License 4 | 5 | This program is free software; you can redistribute it and/or modify 6 | it under the terms of the GNU General Public License as published by 7 | the Free Software Foundation; either version 2 of the License, or 8 | (at your option) any later version. 9 | 10 | This program is distributed in the hope that it will be useful, 11 | but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | GNU General Public License for more details. 14 | 15 | You should have received a copy of the GNU General Public License along 16 | with this program; if not, write to the Free Software Foundation, Inc., 17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 18 | 19 | - homepage : https://sites.google.com/site/powturbo/ 20 | - github : https://github.com/powturbo 21 | - twitter : https://twitter.com/powturbo 22 | - email : powturbo [_AT_] gmail [_DOT_] com 23 | **/ 24 | 25 | // conf.h - config & common 26 | #ifndef CONF_H 27 | #define CONF_H 28 | //------------------------- Compiler ------------------------------------------ 29 | #if defined(__GNUC__) 30 | #include 31 | #define ALIGNED(t,v,n) t v __attribute__ ((aligned (n))) 32 | #define ALWAYS_INLINE inline __attribute__((always_inline)) 33 | #define NOINLINE __attribute__((noinline)) 34 | #define _PACKED __attribute__ ((packed)) 35 | #define likely(x) __builtin_expect((x),1) 36 | #define unlikely(x) __builtin_expect((x),0) 37 | 38 | #define popcnt32(_x_) __builtin_popcount(_x_) 39 | #define popcnt64(_x_) __builtin_popcountll(_x_) 40 | 41 | #if defined(__i386__) || defined(__x86_64__) 42 | //__bsr32: 1:0,2:1,3:1,4:2,5:2,6:2,7:2,8:3,9:3,10:3,11:3,12:3,13:3,14:3,15:3,16:4,17:4,18:4,19:4,20:4,21:4,22:4,23:4,24:4,25:4,26:4,27:4,28:4,29:4,30:4,31:4,32:5 43 | // bsr32: 0:0,1:1,2:2,3:2,4:3,5:3,6:3,7:3,8:4,9:4,10:4,11:4,12:4,13:4,14:4,15:4,16:5,17:5,18:5,19:5,20:5,21:5,22:5,23:5,24:5,25:5,26:5,27:5,28:5,29:5,30:5,31:5,32:6, 44 | static inline int __bsr32( int x) { asm("bsr %1,%0" : "=r" (x) : "rm" (x) ); return x; } 45 | static inline int bsr32( int x) { int b = -1; asm("bsrl %1,%0" : "+r" (b) : "rm" (x) ); return b + 1; } 46 | static inline int bsr64(uint64_t x) { return x?64 - __builtin_clzll(x):0; } 47 | 48 | static inline unsigned rol32(unsigned x, int s) { asm ("roll %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; } 49 | static inline unsigned ror32(unsigned x, int s) { asm ("rorl %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; } 50 | static inline uint64_t rol64(uint64_t x, int s) { asm ("rolq %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; } 51 | static inline uint64_t ror64(uint64_t x, int s) { asm ("rorq %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; } 52 | #else 53 | static inline int __bsr32(unsigned x ) { return 31 - __builtin_clz( x); } 54 | static inline int bsr32(int x ) { return x?32 - __builtin_clz( x):0; } 55 | static inline int bsr64(uint64_t x) { return x?64 - __builtin_clzll(x):0; } 56 | 57 | static inline unsigned rol32(unsigned x, int s) { return x << s | x >> (32 - s); } 58 | static inline unsigned ror32(unsigned x, int s) { return x >> s | x << (32 - s); } 59 | static inline unsigned rol64(unsigned x, int s) { return x << s | x >> (64 - s); } 60 | static inline unsigned ror64(unsigned x, int s) { return x >> s | x << (64 - s); } 61 | #endif 62 | 63 | #define ctz64(_x_) __builtin_ctzll(_x_) 64 | #define ctz32(_x_) __builtin_ctz(_x_) // 0:32 ctz32(1< 4 || __GNUC__ == 4 && __GNUC_MINOR__ >= 8 70 | #define bswap16(x) __builtin_bswap16(x) 71 | #else 72 | static inline unsigned short bswap16(unsigned short x) { return __builtin_bswap32(x << 16); } 73 | #endif 74 | #define bswap32(x) __builtin_bswap32(x) 75 | #define bswap64(x) __builtin_bswap64(x) 76 | 77 | #elif _MSC_VER //---------------------------------------------------- 78 | #include 79 | #include 80 | #if _MSC_VER < 1600 81 | #include "vs/stdint.h" 82 | #define __builtin_prefetch(x,a) 83 | #define inline __inline 84 | #else 85 | #include 86 | #define __builtin_prefetch(x,a) _mm_prefetch(x, _MM_HINT_NTA) 87 | #endif 88 | 89 | #define ALIGNED(t,v,n) __declspec(align(n)) t v 90 | #define ALWAYS_INLINE __forceinline 91 | #define NOINLINE __declspec(noinline) 92 | #define THREADLOCAL __declspec(thread) 93 | #define likely(x) (x) 94 | #define unlikely(x) (x) 95 | 96 | static inline int __bsr32(unsigned x) { unsigned long z=0; _BitScanReverse(&z, x); return z; } 97 | static inline int bsr32( unsigned x) { unsigned long z; _BitScanReverse(&z, x); return x?z+1:0; } 98 | static inline int ctz32( unsigned x) { unsigned long z; _BitScanForward(&z, x); return x?z:32; } 99 | static inline int clz32( unsigned x) { unsigned long z; _BitScanReverse(&z, x); return x?31-z:32; } 100 | #if !defined(_M_ARM64) && !defined(_M_X64) 101 | static inline unsigned char _BitScanForward64(unsigned long* ret, uint64_t x) { 102 | unsigned long x0 = (unsigned long)x, top, bottom; _BitScanForward(&top, (unsigned long)(x >> 32)); _BitScanForward(&bottom, x0); 103 | *ret = x0 ? bottom : 32 + top; return x != 0; 104 | } 105 | static unsigned char _BitScanReverse64(unsigned long* ret, uint64_t x) { 106 | unsigned long x1 = (unsigned long)(x >> 32), top, bottom; _BitScanReverse(&top, x1); _BitScanReverse(&bottom, (unsigned long)x); 107 | *ret = x1 ? top + 32 : bottom; return x != 0; 108 | } 109 | #endif 110 | static inline int bsr64(uint64_t x) { unsigned long z=0; _BitScanReverse64(&z, x); return x?z+1:0; } 111 | static inline int ctz64(uint64_t x) { unsigned long z; _BitScanForward64(&z, x); return x?z:64; } 112 | static inline int clz64(uint64_t x) { unsigned long z; _BitScanReverse64(&z, x); return x?63-z:64; } 113 | 114 | #define rol32(x,s) _lrotl(x, s) 115 | #define ror32(x,s) _lrotr(x, s) 116 | 117 | #define bswap16(x) _byteswap_ushort(x) 118 | #define bswap32(x) _byteswap_ulong(x) 119 | #define bswap64(x) _byteswap_uint64(x) 120 | 121 | #define popcnt32(x) __popcnt(x) 122 | #ifdef _WIN64 123 | #define popcnt64(x) __popcnt64(x) 124 | #else 125 | #define popcnt64(x) (popcnt32(x) + popcnt32(x>>32)) 126 | #endif 127 | 128 | #define sleep(x) Sleep(x/1000) 129 | #define fseeko _fseeki64 130 | #define ftello _ftelli64 131 | #define strcasecmp _stricmp 132 | #define strncasecmp _strnicmp 133 | #define strtoull _strtoui64 134 | static inline double round(double num) { return (num > 0.0) ? floor(num + 0.5) : ceil(num - 0.5); } 135 | #endif 136 | 137 | #define bsr8(_x_) bsr32(_x_) 138 | #define bsr16(_x_) bsr32(_x_) 139 | #define ctz8(_x_) ctz32(_x_) 140 | #define ctz16(_x_) ctz32(_x_) 141 | #define clz8(_x_) (clz32(_x_)-24) 142 | #define clz16(_x_) (clz32(_x_)-16) 143 | 144 | #define popcnt8(x) popcnt32(x) 145 | #define popcnt16(x) popcnt32(x) 146 | 147 | //--------------- Unaligned memory access ------------------------------------- 148 | /*# || defined(i386) || defined(_X86_) || defined(__THW_INTEL)*/ 149 | #if defined(__i386__) || defined(__x86_64__) || \ 150 | defined(_M_IX86) || defined(_M_AMD64) || _MSC_VER ||\ 151 | defined(__powerpc__) ||\ 152 | defined(__ARM_FEATURE_UNALIGNED) || defined(__aarch64__) || defined(__arm__) ||\ 153 | defined(__ARM_ARCH_4__) || defined(__ARM_ARCH_4T__) || \ 154 | defined(__ARM_ARCH_5__) || defined(__ARM_ARCH_5T__) || defined(__ARM_ARCH_5TE__) || defined(__ARM_ARCH_5TEJ__) || \ 155 | defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6T2__) || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) 156 | #define ctou16(_cp_) (*(unsigned short *)(_cp_)) 157 | #define ctou32(_cp_) (*(unsigned *)(_cp_)) 158 | #define ctof32(_cp_) (*(float *)(_cp_)) 159 | 160 | #if defined(__i386__) || defined(__x86_64__) || defined(__powerpc__) || defined(_MSC_VER) 161 | #define ctou64(_cp_) (*(uint64_t *)(_cp_)) 162 | #define ctof64(_cp_) (*(double *)(_cp_)) 163 | #elif defined(__ARM_FEATURE_UNALIGNED) 164 | struct _PACKED longu { uint64_t l; }; 165 | struct _PACKED doubleu { double d; }; 166 | #define ctou64(_cp_) ((struct longu *)(_cp_))->l 167 | #define ctof64(_cp_) ((struct doubleu *)(_cp_))->d 168 | #endif 169 | 170 | #elif defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7M__) || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7S__) 171 | struct _PACKED shortu { unsigned short s; }; 172 | struct _PACKED unsignedu { unsigned u; }; 173 | struct _PACKED longu { uint64_t l; }; 174 | struct _PACKED floatu { float f; }; 175 | struct _PACKED doubleu { double d; }; 176 | 177 | #define ctou16(_cp_) ((struct shortu *)(_cp_))->s 178 | #define ctou32(_cp_) ((struct unsignedu *)(_cp_))->u 179 | #define ctou64(_cp_) ((struct longu *)(_cp_))->l 180 | #define ctof32(_cp_) ((struct floatu *)(_cp_))->f 181 | #define ctof64(_cp_) ((struct doubleu *)(_cp_))->d 182 | #else 183 | #error "unknown cpu" 184 | #endif 185 | 186 | #ifdef ctou16 187 | //#define utoc16(_x_,_cp_) ctou16(_cp_) = _x_ 188 | #else 189 | static inline unsigned short ctou16(void *cp) { unsigned short x; memcpy((void *)&x, cp, (unsigned int)sizeof(x)); return x; } 190 | //static inline void utoc16(unsigned short x, void *cp ) { memcpy(cp, &x, sizeof(x)); } 191 | #endif 192 | 193 | #ifdef ctou32 194 | //#define utoc32(_x_,_cp_) ctou32(_cp_) = _x_ 195 | #else 196 | static inline unsigned ctou32(void *cp) { unsigned x; memcpy(void *)&x, cp, (unsigned int)sizeof(x)); return x; } 197 | //static inline void utoc32(unsigned x, void *cp ) { memcpy(cp, &x, sizeof(x)); } 198 | #endif 199 | 200 | #ifdef ctou64 201 | //#define utoc64(_x_,_cp_) ctou64(_cp_) = _x_ 202 | #else 203 | static inline uint64_t ctou64(void *cp) { uint64_t x; memcpy((void *)&x, cp, (unsigned int)sizeof(x)); return x; } 204 | //static inline void utoc64(uint64_t x, void *cp ) { memcpy(cp, &x, sizeof(x)); } 205 | #endif 206 | 207 | #define ctou24(_cp_) (ctou32(_cp_) & 0xffffff) 208 | #define ctou48(_cp_) (ctou64(_cp_) & 0xffffffffffffull) 209 | #define ctou8(_cp_) (*(_cp_)) 210 | //--------------------- wordsize ---------------------------------------------- 211 | #if defined(__64BIT__) || defined(_LP64) || defined(__LP64__) || defined(_WIN64) ||\ 212 | defined(__x86_64__) || defined(_M_X64) ||\ 213 | defined(__ia64) || defined(_M_IA64) ||\ 214 | defined(__aarch64__) ||\ 215 | defined(__mips64) ||\ 216 | defined(__powerpc64__) || defined(__ppc64__) || defined(__PPC64__) ||\ 217 | defined(__s390x__) 218 | #define __WORDSIZE 64 219 | #else 220 | #define __WORDSIZE 32 221 | #endif 222 | #endif 223 | 224 | //---------------------misc --------------------------------------------------- 225 | #define BZHI64(_u_, _b_) ((_u_) & ((1ull<<(_b_))-1)) 226 | #define BZHI32(_u_, _b_) ((_u_) & ((1u <<(_b_))-1)) 227 | #define BZHI16(_u_, _b_) BZHI32(_u_, _b_) 228 | #define BZHI8(_u_, _b_) BZHI32(_u_, _b_) 229 | 230 | #define SIZE_ROUNDUP(_n_, _a_) (((size_t)(_n_) + (size_t)((_a_) - 1)) & ~(size_t)((_a_) - 1)) 231 | #define ALIGN_DOWN(__ptr, __a) ((void *)((uintptr_t)(__ptr) & ~(uintptr_t)((__a) - 1))) 232 | 233 | #define TEMPLATE2_(_x_, _y_) _x_##_y_ 234 | #define TEMPLATE2(_x_, _y_) TEMPLATE2_(_x_,_y_) 235 | 236 | #define TEMPLATE3_(_x_,_y_,_z_) _x_##_y_##_z_ 237 | #define TEMPLATE3(_x_,_y_,_z_) TEMPLATE3_(_x_, _y_, _z_) 238 | 239 | #define CACHE_LINE_SIZE 64 240 | #define PREFETCH_DISTANCE (CACHE_LINE_SIZE*4) 241 | //--- NDEBUG ------- 242 | #include 243 | #ifdef _MSC_VER 244 | #ifdef NDEBUG 245 | #define AS(expr, fmt, ...) 246 | #define AC(expr, fmt, ...) do { if(!(expr)) { fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); abort(); } } while(0) 247 | #define die(fmt, ...) do { fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); exit(-1); } while(0) 248 | #else 249 | #define AS(expr, fmt, ...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); abort(); } } while(0) 250 | #define AC(expr, fmt, ...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); abort(); } } while(0) 251 | #define die(fmt, ...) do { fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); exit(-1); } while(0) 252 | #endif 253 | #else 254 | #ifdef NDEBUG 255 | #define AS(expr, fmt,args...) 256 | #define AC(expr, fmt,args...) do { if(!(expr)) { fprintf(stderr, fmt, ## args ); fflush(stderr); abort(); } } while(0) 257 | #define die(fmt,args...) do { fprintf(stderr, fmt, ## args ); fflush(stderr); exit(-1); } while(0) 258 | #else 259 | #define AS(expr, fmt,args...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ## args ); fflush(stderr); abort(); } } while(0) 260 | #define AC(expr, fmt,args...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ## args ); fflush(stderr); abort(); } } while(0) 261 | #define die(fmt,args...) do { fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ## args ); fflush(stderr); exit(-1); } while(0) 262 | #endif 263 | #endif 264 | 265 | -------------------------------------------------------------------------------- /makefile: -------------------------------------------------------------------------------- 1 | # powturbo (c) Copyright 2013-2019 2 | # ----------- Downloading + Compiling ---------------------- 3 | # Download or clone TurboTranspose: 4 | # git clone git://github.com/powturbo/TurboTranspose.git 5 | # make 6 | 7 | # Linux: "export CC=clang" "export CXX=clang". windows mingw: "set CC=gcc" "set CXX=g++" or uncomment the CC,CXX lines 8 | CC ?= gcc 9 | CXX ?= g++ 10 | #CC=clang-8 11 | #CXX=clang++-8 12 | 13 | #CC = gcc-8 14 | #CXX = g++-8 15 | 16 | #CC=powerpc64le-linux-gnu-gcc 17 | #CXX=powerpc64le-linux-gnu-g++ 18 | 19 | DDEBUG=-DNDEBUG -s 20 | #DDEBUG=-g 21 | 22 | ifneq (,$(filter Windows%,$(OS))) 23 | OS := Windows 24 | CFLAGS+=-D__int64_t=int64_t 25 | else 26 | OS := $(shell uname -s) 27 | ARCH := $(shell uname -m) 28 | ifneq (,$(findstring powerpc64le,$(CC))) 29 | ARCH = ppc64le 30 | endif 31 | ifneq (,$(findstring aarch64,$(CC))) 32 | ARCH = aarch64 33 | endif 34 | endif 35 | 36 | #------ ARMv8 37 | ifeq ($(ARCH),aarch64) 38 | CFLAGS+=-march=armv8-a 39 | ifneq (,$(findstring clang, $(CC))) 40 | MSSE=-O3 -mcpu=cortex-a72 -falign-loops -fomit-frame-pointer 41 | else 42 | MSSE=-O3 -mcpu=cortex-a72 -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer 43 | endif 44 | 45 | else 46 | # ----- Power9 47 | ifeq ($(ARCH),ppc64le) 48 | MSSE=-D__SSE__ -D__SSE2__ -D__SSE3__ -D__SSSE3__ 49 | MARCH=-march=power9 -mtune=power9 50 | CFLAGS+=-DNO_WARN_X86_INTRINSICS 51 | CXXFLAGS+=-DNO_WARN_X86_INTRINSICS 52 | #------ x86_64 : minimum SSE = Sandy Bridge, AVX2 = haswell 53 | else 54 | MSSE=-march=corei7-avx -mtune=corei7-avx 55 | # -mno-avx -mno-aes (add for Pentium based Sandy bridge) 56 | CFLAGS+=-mssse3 57 | MAVX2=-march=haswell 58 | endif 59 | endif 60 | 61 | ifeq ($(OS),$(filter $(OS),Linux Darwin GNU/kFreeBSD GNU OpenBSD FreeBSD DragonFly NetBSD MSYS_NT Haiku)) 62 | #LDFLAGS+=-lpthread -lm 63 | ifneq ($(OS),Darwin) 64 | LDFLAGS+=-lrt 65 | endif 66 | endif 67 | 68 | # Minimum CPU architecture 69 | #MARCH=-march=native 70 | MARCH=$(MSSE) 71 | 72 | ifeq ($(AVX2),1) 73 | MARCH+=-mbmi2 -mavx2 74 | CFLAGS+=-DUSE_AVX2 75 | CXXFLAGS+=-DUSE_AVX2 76 | else 77 | AVX2=0 78 | endif 79 | 80 | #---------------------------------------------- 81 | ifeq ($(STATIC),1) 82 | LDFLAGS+=-static 83 | endif 84 | 85 | #---------------------- make args -------------------------- 86 | ifeq ($(BLOSC),1) 87 | DEFS+=-DBLOSC 88 | endif 89 | 90 | ifeq ($(LZ4),1) 91 | CFLAGS+=-DLZ4 -Ilz4/lib 92 | endif 93 | 94 | ifeq ($(BITSHUFFLE),1) 95 | CFLAGS+=-DBITSHUFFLE -Iext/bitshuffle/lz4 96 | endif 97 | 98 | OB=transpose.o tpbench.o 99 | 100 | ifneq ($(NSIMD),1) 101 | OB+=transpose_sse.o 102 | CFLAGS+=-DUSE_SSE 103 | 104 | ifeq ($(AVX2),1) 105 | MARCH+=-mavx2 -mbmi2 106 | CFLAGS+=-DUSE_AVX2 107 | OB+=transpose_avx2.o 108 | endif 109 | endif 110 | 111 | CFLAGS+=$(DDEBUG) -w -Wall -std=gnu99 -DUSE_THREADS -fstrict-aliasing -Iext $(DEFS) 112 | CXXFLAGS+=$(DDEBUG) -w -fpermissive -Wall -fno-rtti -Iext/FastPFor/headers $(DEFS) 113 | 114 | 115 | all: tpbench 116 | 117 | transpose.o: transpose.c 118 | $(CC) -O3 $(CFLAGS) $(COPT) -c -DUSE_SSE -falign-loops transpose.c -o transpose.o 119 | 120 | transpose_sse.o: transpose.c 121 | $(CC) -O3 $(CFLAGS) $(COPT) -DSSE2_ON $(MSSE) -falign-loops -c transpose.c -o transpose_sse.o 122 | 123 | transpose_avx2.o: transpose.c 124 | $(CC) -O3 $(CFLAGS) $(COPT) -DAVX2_ON $(MAVX2) -falign-loops -c transpose.c -o transpose_avx2.o 125 | 126 | 127 | #-------- BLOSC + BitShuffle ----------------------- 128 | ifeq ($(BLOSC),1) 129 | LDFLAGS+=-lpthread 130 | 131 | CFLAGS+=-DBLOSC 132 | #-DPREFER_EXTERNAL_LZ4=ON -DHAVE_LZ4 -DHAVE_LZ4HC -Ibitshuffle/lz4 133 | 134 | c-blosc2/blosc/shuffle-sse2.o: c-blosc2/blosc/shuffle-sse2.c 135 | $(CC) -O3 $(CFLAGS) -msse2 -c c-blosc2/blosc/shuffle-sse2.c -o c-blosc2/blosc/shuffle-sse2.o 136 | 137 | c-blosc2/blosc/shuffle-generic.o: c-blosc2/blosc/shuffle-generic.c 138 | $(CC) -O3 $(CFLAGS) -c c-blosc2/blosc/shuffle-generic.c -o c-blosc2/blosc/shuffle-generic.o 139 | 140 | c-blosc2/blosc/shuffle-avx2.o: c-blosc2/blosc/shuffle-avx2.c 141 | $(CC) -O3 $(CFLAGS) -mavx2 -c c-blosc2/blosc/shuffle-avx2.c -o c-blosc2/blosc/shuffle-avx2.o 142 | 143 | c-blosc2/blosc/shuffle-neon.o: c-blosc2/blosc/shuffle-neon.c 144 | $(CC) -O3 $(CFLAGS) -flax-vector-conversions -c c-blosc2/blosc/shuffle-neon.c -o c-blosc2/blosc/shuffle-neon.o 145 | 146 | c-blosc2/blosc/bitshuffle-neon.o: c-blosc2/blosc/bitshuffle-neon.c 147 | $(CC) -O3 $(CFLAGS) -flax-vector-conversions -c c-blosc2/blosc/bitshuffle-neon.c -o c-blosc2/blosc/bitshuffle-neon.o 148 | 149 | OB+=c-blosc2/blosc/blosc2.o c-blosc2/blosc/blosclz.o c-blosc2/blosc/shuffle.o c-blosc2/blosc/shuffle-generic.o \ 150 | c-blosc2/blosc/bitshuffle-generic.o c-blosc2/blosc/btune.o c-blosc2/blosc/fastcopy.o c-blosc2/blosc/delta.o c-blosc2/blosc/timestamp.o c-blosc2/blosc/trunc-prec.o 151 | 152 | ifeq ($(AVX2),1) 153 | CFLAGS+=-DSHUFFLE_AVX2_ENABLED 154 | OB+=c-blosc2/blosc/shuffle-avx2.o c-blosc2/blosc/bitshuffle-avx2.o 155 | endif 156 | ifeq ($(ARCH),aarch64) 157 | CFLAGS+=-DSHUFFLE_NEON_ENABLED 158 | OB+=c-blosc2/blosc/shuffle-neon.o c-blosc2/blosc/bitshuffle-neon.o 159 | else 160 | CFLAGS+=-DSHUFFLE_SSE2_ENABLED 161 | OB+=c-blosc2/blosc/bitshuffle-sse2.o c-blosc2/blosc/shuffle-sse2.o 162 | endif 163 | 164 | else 165 | 166 | ifeq ($(BITSHUFFLE),1) 167 | CFLAGS+=-DBITSHUFFLE -Ibitshuffle/lz4 -DLZ4_ON 168 | 169 | ifeq ($(ARCH),aarch64) 170 | CFLAGS+=-DUSEARMNEON 171 | else 172 | ifeq ($(AVX2),1) 173 | CFLAGS+=-DUSEAVX2 174 | endif 175 | endif 176 | 177 | OB+=bitshuffle/src/bitshuffle.o bitshuffle/src/iochain.o bitshuffle/src/bitshuffle_core.o 178 | OB+=bitshuffle/lz4/lz4.o 179 | endif 180 | 181 | endif 182 | #--------------- 183 | 184 | tpbench: $(OB) tpbench.o transpose.o 185 | $(CC) $^ $(LDFLAGS) -o tpbench 186 | 187 | .c.o: 188 | $(CC) -O3 $(MARCH) $(CFLAGS) $< -c -o $@ 189 | 190 | ifeq ($(OS),Windows_NT) 191 | clean: 192 | del /S *.o 193 | del /S *.exe 194 | else 195 | clean: 196 | @find . -type f -name "*\.o" -delete -or -name "*\~" -delete -or -name "core" -delete 197 | endif 198 | 199 | -------------------------------------------------------------------------------- /makefile.vs: -------------------------------------------------------------------------------- 1 | # powturbo (c) Copyright 2015-2018 2 | # nmake /f makefile.msc 3 | # or 4 | # nmake "AVX2=1" /f makefile.msc 5 | 6 | .SUFFIXES: .c .obj .sobj 7 | 8 | CC = cl 9 | LD = link 10 | AR = lib 11 | CFLAGS = /MD /O2 -I. 12 | 13 | LIB_LIB = libtp.lib 14 | LIB_DLL = tp.dll 15 | LIB_IMP = tp.lib 16 | 17 | OBJS = transpose.obj 18 | 19 | !if "$(NSIMD)" == "1" 20 | !else 21 | OBJS = $(OBJS) transpose_sse.obj 22 | CFLAGS = $(CFLAGS) /DUSE_SSE /D__SSE2__ 23 | 24 | !IF "$(AVX2)" == "1" 25 | CFLAGS = $(CFLAGS) /DUSE_AVX2 26 | OBJS = $(OBJS) transpose_avx2.obj 27 | !endif 28 | 29 | !endif 30 | 31 | DLL_OBJS = $(OBJS:.obj=.sobj) 32 | 33 | all: $(LIB_LIB) $(LIB_DLL) tpbench.exe tpbenchdll.exe 34 | 35 | #$(LIB_DLL): $(LIB_IMP) 36 | 37 | transpose.obj: transpose.c 38 | $(CC) /O2 $(CFLAGS) /DUSE_SSE -c transpose.c /Fotranspose.obj 39 | 40 | transpose_sse.obj: transpose.c 41 | $(CC) /O2 $(CFLAGS) /DSSE2_ON /D__SSE2__ /arch:SSE2 /c transpose.c /Fotranspose_sse.obj 42 | 43 | transpose_avx2.obj: transpose.c 44 | $(CC) /O2 $(CFLAGS) /DAVX2_ON /D__AVX2__ /arch:avx2 /c transpose.c /Fotranspose_avx2.obj 45 | 46 | transpose.sobj: transpose.c 47 | $(CC) /O2 $(CFLAGS) /DLIB_DLL=1 /DUSE_SSE -c transpose.c /Fotranspose.sobj 48 | 49 | transpose_sse.sobj: transpose.c 50 | $(CC) /O2 $(CFLAGS) /DLIB_DLL=1 /DSSE2_ON /D__SSE2__ /arch:SSE2 /c transpose.c /Fotranspose_sse.sobj 51 | 52 | transpose_avx2.sobj: transpose.c 53 | $(CC) /O2 $(CFLAGS) /DLIB_DLL=1 /DAVX2_ON /D__AVX2__ /arch:avx2 /c transpose.c /Fotranspose_avx2.sobj 54 | 55 | tpbench.sobj: tpbench.c 56 | $(CC) /O2 $(CFLAGS) /DLIB_DLL -c tpbench.c /Fotpbench.sobj 57 | 58 | .c.obj: 59 | $(CC) -c /Fo$@ /O2 $(CFLAGS) $** 60 | 61 | .c.sobj: 62 | $(CC) -c /Fo$@ /O2 $(CFLAGS) /DLIB_DLL $** 63 | 64 | $(LIB_LIB): $(OBJS) 65 | $(AR) $(ARFLAGS) -out:$@ $(OBJS) 66 | 67 | $(LIB_DLL): $(DLL_OBJS) 68 | $(LD) $(LDFLAGS) -out:$@ -dll -implib:$(LIB_IMP) $(DLL_OBJS) 69 | 70 | $(LIB_IMP): $(LIB_DLL) 71 | 72 | tpbench.exe: tpbench.obj vs/getopt.obj $(LIB_LIB) 73 | $(LD) $(LDFLAGS) -out:$@ $** 74 | 75 | tpbenchdll.exe: tpbench.sobj vs/getopt.obj 76 | $(LD) $(LDFLAGS) -out:$@ $** tp.lib 77 | 78 | clean: 79 | -del *.dll *.exe *.exp *.lib *.obj *.sobj 2>nul 80 | -------------------------------------------------------------------------------- /sse_neon.h: -------------------------------------------------------------------------------- 1 | /** 2 | Copyright (C) powturbo 2013-2019 3 | GPL v2 License 4 | 5 | This program is free software; you can redistribute it and/or modify 6 | it under the terms of the GNU General Public License as published by 7 | the Free Software Foundation; either version 2 of the License, or 8 | (at your option) any later version. 9 | 10 | This program is distributed in the hope that it will be useful, 11 | but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | GNU General Public License for more details. 14 | 15 | You should have received a copy of the GNU General Public License along 16 | with this program; if not, write to the Free Software Foundation, Inc., 17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 18 | 19 | - homepage : https://sites.google.com/site/powturbo/ 20 | - github : https://github.com/powturbo 21 | - twitter : https://twitter.com/powturbo 22 | - email : powturbo [_AT_] gmail [_DOT_] com 23 | **/ 24 | // intel sse to arm neon 25 | 26 | #ifndef _SSE_NEON_H_ 27 | #define _SSE_NEON_H_ 28 | #include "conf.h" 29 | 30 | #ifdef __ARM_NEON //-------------------------------------------------------------------------------------------------- 31 | #include 32 | #define __m128i uint32x4_t 33 | 34 | //#define USE_MACROS 35 | #define uint8x16_to_8x8x2(_a_) ((uint8x8x2_t) { vget_low_u8(_a_), vget_high_u8(_a_) }) 36 | 37 | #ifdef USE_MACROS //---------------------------- Set : _mm_set_epi/_mm_set1_epi ---------------------------------------------------------- 38 | #define _mm_set_epi8(u15,u14,u13,u12,\ 39 | u11,u10, u9, u8,\ 40 | u7,u6,u5,u4,\ 41 | u3,u2,u1,u0) ({ uint8_t __attribute__((aligned(16))) _u[16] = { u0,u1,u2,u3,u4,u5,u6,u7,u8,u9,u10,u11,u12,u13,u14,u15 }; (uint32x4_t)vld1q_u8( _u);}) 42 | #define _mm_set_epi16( u7,u6,u5,u4,\ 43 | u3,u2,u1,u0) ({ uint16_t __attribute__((aligned(16))) _u[ 8] = { u0,u1,u2,u3,u4,u5,u6,u7 }; (uint32x4_t)vld1q_u16(_u);}) 44 | #define _mm_set_epi32( u3,u2,u1,u0) ({ uint32_t __attribute__((aligned(16))) _u[ 4] = { u0,u1,u2,u3 }; vld1q_u32(_u);}) 45 | #define _mm_set_epi64x( u1,u0) ({ uint64_t __attribute__((aligned(16))) _u[ 2] = { u0,u1 }; (uint32x4_t)vld1q_u64(_u);}) 46 | #define _mm_set_epi32(u3, u2, u1, u0) vcombine_u32(vcreate_u32((uint64_t)u1 << 32 | u0), vcreate_u32((uint64_t)u3 << 32 | u2)) 47 | #define _mm_set_epi64x(u1, u0) (__m128i)vcombine_u64(vcreate_u64(u0), vcreate_u64(u1)) 48 | #else 49 | static ALWAYS_INLINE __m128i _mm_set_epi8( uint8_t u15, uint8_t u14, uint8_t u13, uint8_t u12, uint8_t u11, uint8_t u10, uint8_t u9, uint8_t u8, 50 | uint8_t u7, uint8_t u6, uint8_t u5, uint8_t u4, 51 | uint8_t u3, uint8_t u2, uint8_t u1, uint8_t u0) { 52 | uint8_t __attribute__((aligned(16))) u[16] = { u0,u1,u2,u3,u4,u5,u6,u7,u8,u9,u10,u11,u12,u13,u14,u15 }; return (uint32x4_t)vld1q_u8( u); } 53 | static ALWAYS_INLINE __m128i _mm_set_epi16( uint16_t u7, uint16_t u6, uint16_t u5, uint16_t u4, 54 | uint16_t u3, uint16_t u2, uint16_t u1, uint16_t u0) { uint16_t __attribute__((aligned(16))) u[ 8] = { u0,u1,u2,u3,u4,u5,u6,u7 }; return (uint32x4_t)vld1q_u16(u); } 55 | static ALWAYS_INLINE __m128i _mm_set_epi32( uint32_t u3, uint32_t u2, uint32_t u1, uint32_t u0) { uint32_t __attribute__((aligned(16))) u[ 4] = { u0,u1,u2,u3 }; return vld1q_u32(u); } 56 | static ALWAYS_INLINE __m128i _mm_set_epi64x( uint64_t u1, uint64_t u0) { uint64_t __attribute__((aligned(16))) u[ 2] = { u0,u1 }; return (uint32x4_t)vld1q_u64(u); } 57 | #endif 58 | 59 | #define _mm_set1_epi8( _u8_ ) (__m128i)vdupq_n_u8( _u8_ ) 60 | #define _mm_set1_epi16( _u16_) (__m128i)vdupq_n_u16(_u16_) 61 | #define _mm_set1_epi32( _u32_) vdupq_n_u32(_u32_) 62 | #define _mm_set1_epi64x(_u64_) (__m128i)vdupq_n_u64(_u64_) 63 | #define _mm_setzero_si128() vdupq_n_u32( 0 ) 64 | //---------------------------------------------- Arithmetic ----------------------------------------------------------------------- 65 | #define _mm_add_epi8( _a_,_b_) (__m128i)vaddq_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_)) 66 | #define _mm_add_epi16( _a_,_b_) (__m128i)vaddq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_)) 67 | #define _mm_add_epi32( _a_,_b_) vaddq_u32( _a_, _b_ ) 68 | #define _mm_sub_epi16( _a_,_b_) (__m128i)vsubq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_)) 69 | #define _mm_sub_epi32( _a_,_b_) (__m128i)vsubq_u32((uint32x4_t)(_a_), (uint32x4_t)(_b_)) 70 | #define _mm_subs_epu8( _a_,_b_) (__m128i)vqsubq_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_)) 71 | 72 | #define _mm_mullo_epi32(_a_,_b_) (__m128i)vmulq_s32(( int32x4_t)(_a_), ( int32x4_t)(_b_)) 73 | #define mm_mullo_epu32(_a_,_b_) vmulq_u32(_a_,_b_) 74 | #define _mm_mul_epu32( _a_,_b_) (__m128i)vmull_u32(vget_low_u32(_a_),vget_low_u32(_b_)) 75 | #define _mm_adds_epu16( _a_,_b_) (__m128i)vqaddq_u16((uint16x8_t)(_a_),(uint16x8_t)(_b_)) 76 | static ALWAYS_INLINE __m128i _mm_madd_epi16(__m128i a, __m128i b) { 77 | int32x4_t mlo = vmull_s16(vget_low_s16( (int16x8_t)a), vget_low_s16( (int16x8_t)b)); 78 | int32x4_t mhi = vmull_s16(vget_high_s16((int16x8_t)a), vget_high_s16((int16x8_t)b)); 79 | int32x2_t alo = vpadd_s32(vget_low_s32(mlo), vget_high_s32(mlo)); 80 | int32x2_t ahi = vpadd_s32(vget_low_s32(mhi), vget_high_s32(mhi)); 81 | return (__m128i)vcombine_s32(alo, ahi); 82 | } 83 | //---------------------------------------------- Special math functions ----------------------------------------------------------- 84 | #define _mm_min_epu8( _a_,_b_) (__m128i)vminq_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_)) 85 | #define _mm_min_epu16( _a_,_b_) (__m128i)vminq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_)) 86 | #define _mm_min_epi16( _a_,_b_) (__m128i)vminq_s16((int16x8_t)(_a_), (int16x8_t)(_b_)) 87 | //---------------------------------------------- Logical -------------------------------------------------------------------------- 88 | #define mm_testnz_epu32(_a_) vmaxvq_u32(_a_) //vaddvq_u32(_a_) 89 | #define mm_testnz_epu8(_a_) vmaxv_u8(_a_) 90 | #define _mm_or_si128( _a_,_b_) (__m128i)vorrq_u32( (uint32x4_t)(_a_), (uint32x4_t)(_b_)) 91 | #define _mm_and_si128( _a_,_b_) (__m128i)vandq_u32( (uint32x4_t)(_a_), (uint32x4_t)(_b_)) 92 | #define _mm_xor_si128( _a_,_b_) (__m128i)veorq_u32( (uint32x4_t)(_a_), (uint32x4_t)(_b_)) 93 | //---------------------------------------------- Shift ---------------------------------------------------------------------------- 94 | #define _mm_slli_epi16( _a_,_m_) (__m128i)vshlq_n_u16((uint16x8_t)(_a_), _m_) 95 | #define _mm_slli_epi32( _a_,_m_) (__m128i)vshlq_n_u32((uint32x4_t)(_a_), _m_) 96 | #define _mm_slli_epi64( _a_,_m_) (__m128i)vshlq_n_u64((uint64x2_t)(_a_), _m_) 97 | #define _mm_slli_si128( _a_,_m_) (__m128i)vextq_u8(vdupq_n_u8(0), (uint8x16_t)(_a_), 16 - (_m_) ) // _m_: 1 - 15 98 | 99 | #define _mm_srli_epi16( _a_,_m_) (__m128i)vshrq_n_u16((uint16x8_t)(_a_), _m_) 100 | #define _mm_srli_epi32( _a_,_m_) (__m128i)vshrq_n_u32((uint32x4_t)(_a_), _m_) 101 | #define _mm_srli_epi64( _a_,_m_) (__m128i)vshlq_n_u64((uint64x2_t)(_a_), _m_) 102 | #define _mm_srli_si128( _a_,_m_) (__m128i)vextq_s8((int8x16_t)(_a_), vdupq_n_s8(0), (_m_)) 103 | 104 | #define _mm_srai_epi16( _a_,_m_) (__m128i)vshrq_n_s16((int16x8_t)(_a_), _m_) 105 | #define _mm_srai_epi32( _a_,_m_) (__m128i)vshrq_n_s32((int32x4_t)(_a_), _m_) 106 | #define _mm_srai_epi64( _a_,_m_) (__m128i)vshrq_n_s64((int64x2_t)(_a_), _m_) 107 | 108 | #define _mm_sllv_epi32( _a_,_b_) (__m128i)vshlq_u32((uint32x4_t)(_a_), (uint32x4_t)(_b_)) 109 | #define _mm_srlv_epi32( _a_,_b_) (__m128i)vshlq_u32((uint32x4_t)(_a_), vnegq_s32((int32x4_t)(_b_))) 110 | //---------------------------------------------- Compare --------- true/false->1/0 (all bits set) --------------------------------- 111 | #define _mm_cmpeq_epi8( _a_,_b_) (__m128i)vceqq_s8( ( int8x16_t)(_a_), ( int8x16_t)(_b_)) 112 | #define _mm_cmpeq_epi16(_a_,_b_) (__m128i)vceqq_s16(( int16x8_t)(_a_), ( int16x8_t)(_b_)) 113 | #define _mm_cmpeq_epi32(_a_,_b_) (__m128i)vceqq_s32(( int32x4_t)(_a_), ( int32x4_t)(_b_)) 114 | 115 | #define _mm_cmpgt_epi16(_a_,_b_) (__m128i)vcgtq_s16(( int16x8_t)(_a_), ( int16x8_t)(_b_)) 116 | #define _mm_cmpgt_epi32(_a_,_b_) (__m128i)vcgtq_s32(( int32x4_t)(_a_), ( int32x4_t)(_b_)) 117 | 118 | #define _mm_cmpgt_epu16(_a_,_b_) (__m128i)vcgtq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_)) 119 | #define mm_cmpgt_epu32(_a_,_b_) (__m128i)vcgtq_u32( _a_, _b_) 120 | //---------------------------------------------- Load ----------------------------------------------------------------------------- 121 | #define _mm_loadl_epi64( _u64p_) (__m128i)vcombine_s32(vld1_s32((int32_t const *)(_u64p_)), vcreate_s32(0)) 122 | #define mm_loadu_epi64p( _u64p_,_a_) (__m128i)vld1q_lane_u64((uint64_t *)(_u64p_), (uint64x2_t)(_a_), 0) 123 | #define _mm_loadu_si128( _ip_) vld1q_u32(_ip_) 124 | #define _mm_load_si128( _ip_) vld1q_u32(_ip_) 125 | //---------------------------------------------- Store ---------------------------------------------------------------------------- 126 | #define _mm_storel_epi64(_ip_,_a_) vst1q_lane_u64((uint64_t *)(_ip_), (uint64x2_t)(_a_), 0) 127 | #define _mm_storeu_si128(_ip_,_a_) vst1q_u32((__m128i *)(_ip_),_a_) 128 | //---------------------------------------------- Convert -------------------------------------------------------------------------- 129 | #define mm_cvtsi64_si128p(_u64p_,_a_) mm_loadu_epi64p(_u64p_,_a_) 130 | #define _mm_cvtsi64_si128(_a_) (__m128i)vdupq_n_u64(_a_) //vld1q_s64(_a_) 131 | //---------------------------------------------- Reverse bits/bytes --------------------------------------------------------------- 132 | #define mm_rbit_epi8(a) (__m128i)vrbitq_u8( (uint8x16_t)(a)) // reverse bits 133 | #define mm_rev_epi16(a) vrev16q_u8((uint8x16_t)(a)) // reverse bytes 134 | #define mm_rev_epi32(a) vrev32q_u8((uint8x16_t)(a)) 135 | #define mm_rev_epi64(a) vrev64q_u8((uint8x16_t)(a)) 136 | //--------------------------------------------- Insert/extract -------------------------------------------------------------------- 137 | #define mm_extract_epi32x(_a_,_u32_,_id_) vst1q_lane_u32((uint32_t *)&(_u32_), _a_, _id_) 138 | #define _mm_extract_epi64x(_a_,_u64_,_id_) vst1q_lane_u64((uint64_t *)&(_u64_), (uint64x2_t)(_a_), _id_) 139 | 140 | #define _mm_extract_epi8(_a_, _id_) vgetq_lane_u8( (uint8x16_t)(_a_), _id_) 141 | #define _mm_extract_epi16(_a_, _id_) vgetq_lane_u16(_a_, _id_) 142 | #define _mm_extract_epi32(_a_, _id_) vgetq_lane_u32(_a_, _id_) 143 | #define mm_extract_epu32(_a_, _id_) vgetq_lane_u32(_a_, _id_) 144 | #define _mm_cvtsi128_si32(_a_) vgetq_lane_u32((uint32x4_t)(_a_),0) 145 | #define _mm_cvtsi128_si64(_a_) vgetq_lane_u64((uint64x2_t)(_a_),0) 146 | 147 | #define _mm_insert_epu32p(_a_,_u32p_,_id_) vsetq_lane_u32(_x_, _a_, _id_) 148 | #define mm_insert_epi32p(_a_,_u32p_,_id_) vld1q_lane_u32(_u32p_, (uint32x4_t)(_a_), _id_) 149 | #define _mm_cvtsi32_si128(_a_) (__m128i)vsetq_lane_s32(_a_, vdupq_n_s32(0), 0) 150 | 151 | #define _mm_blendv_epi8(_a_,_b_,_m_) vbslq_u32(_m_,_b_,_a_) 152 | //---------------------------------------------- Miscellaneous -------------------------------------------------------------------- 153 | #define _mm_alignr_epi8(_a_,_b_,_m_) (__m128i)vextq_u8( (uint8x16_t)(_b_), (uint8x16_t)(_a_), _m_) 154 | #define _mm_packs_epi16( _a_,_b_) (__m128i)vcombine_s8( vqmovn_s16((int16x8_t)(_a_)), vqmovn_s16((int16x8_t)(_b_))) 155 | #define _mm_packs_epi32( _a_,_b_) (__m128i)vcombine_s16(vqmovn_s32((int32x4_t)(_a_)), vqmovn_s32((int32x4_t)(_b_))) 156 | 157 | #define _mm_packs_epu16( _a_,_b_) (__m128i)vcombine_u8((uint16x8_t)(_a_), (uint16x8_t)(_b_)) 158 | #define _mm_packus_epi16( _a_,_b_) (__m128i)vcombine_u8(vqmovun_s16((int16x8_t)(_a_)), vqmovun_s16((int16x8_t)(_b_))) 159 | 160 | static ALWAYS_INLINE uint16_t _mm_movemask_epi8(__m128i v) { 161 | const uint8x16_t __attribute__ ((aligned (16))) m = {1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7, 1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7}; 162 | uint8x16_t mv = (uint8x16_t)vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(vcltq_s8((int8x16_t)v, vdupq_n_s8(0)), m)))); 163 | return vgetq_lane_u8(mv, 8) << 8 | vgetq_lane_u8(mv, 0); 164 | } 165 | //-------- Neon movemask ------ All lanes must be 0 or -1 (=0xff, 0xffff or 0xffffffff) 166 | #ifdef __aarch64__ 167 | static ALWAYS_INLINE uint8_t mm_movemask_epi8s(uint8x8_t sv) { const uint8x8_t m = { 1, 1<<1, 1<<2, 1<<3, 1<<4, 1<< 5, 1<< 6, 1<<7 }; return vaddv_u8( vand_u8( sv, m)); } // short only ARM 168 | //static ALWAYS_INLINE uint16_t mm_movemask_epu16(uint32x4_t v) { const uint16x8_t m = { 1, 1<<2, 1<<4, 1<<6, 1<<8, 1<<10, 1<<12, 1<<14}; return vaddvq_u16(vandq_u16((uint16x8_t)v, m)); } 169 | static ALWAYS_INLINE uint16_t mm_movemask_epu16(__m128i v) { const uint16x8_t m = { 1, 1<<1, 1<<2, 1<<3, 1<<4, 1<< 5, 1<< 6, 1<<7 }; return vaddvq_u16(vandq_u16((uint16x8_t)v, m)); } 170 | static ALWAYS_INLINE uint32_t mm_movemask_epu32(__m128i v) { const uint32x4_t m = { 1, 1<<1, 1<<2, 1<<3 }; return vaddvq_u32(vandq_u32((uint32x4_t)v, m)); } 171 | static ALWAYS_INLINE uint64_t mm_movemask_epu64(__m128i v) { const uint64x2_t m = { 1, 1<<1 }; return vaddvq_u64(vandq_u64((uint64x2_t)v, m)); } 172 | #else 173 | static ALWAYS_INLINE uint32_t mm_movemask_epu32(uint32x4_t v) { const uint32x4_t mask = {1,2,4,8}, av = vandq_u32(v, mask), xv = vextq_u32(av, av, 2), ov = vorrq_u32(av, xv); return vgetq_lane_u32(vorrq_u32(ov, vextq_u32(ov, ov, 3)), 0); } 174 | #endif 175 | // --------------------------------------------- Swizzle : _mm_shuffle_epi8 / _mm_shuffle_epi32 / Pack/Unpack ----------------------------------------- 176 | #define _MM_SHUFFLE(u3,u2,u1,u0) ((u3) << 6 | (u2) << 4 | (u1) << 2 | (u0)) 177 | 178 | #define _mm_shuffle_epi8(_a_, _b_) (__m128i)vqtbl1q_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_)) 179 | #if defined(__aarch64__) 180 | #define mm_shuffle_nnnn_epi32(_a_,_m_) (__m128i)vdupq_laneq_u32(_a_, _m_) 181 | #else 182 | #define mm_shuffle_nnnn_epi32(_a_,_m_) (__m128i)vdupq_n_u32(vgetq_lane_u32(_a_, _m_) 183 | #endif 184 | 185 | #ifdef USE_MACROS 186 | #define mm_shuffle_2031_epi32(_a_) ({ uint32x4_t _zv = (uint32x4_t)vrev64q_u32(_a_); uint32x2x2_t _zv = vtrn_u32(vget_low_u32(_zv), vget_high_u32(_zv)); vcombine_u32(_zv.val[0], _zv.val[1]);}) 187 | #define mm_shuffle_3120_epi32(_a_) ({ uint32x4_t _zv = _a_; _zv = vtrn_u32(vget_low_u32(_zv), vget_high_u32(_zv)); vcombine_u32(_zv.val[0], _zv.val[1]);}) 188 | #else 189 | static ALWAYS_INLINE __m128i mm_shuffle_2031_epi32(__m128i a) { uint32x4_t v = (uint32x4_t)vrev64q_u32(a); uint32x2x2_t z = vtrn_u32(vget_low_u32(v), vget_high_u32(v)); return vcombine_u32(z.val[0], z.val[1]);} 190 | static ALWAYS_INLINE __m128i mm_shuffle_3120_epi32(__m128i a) { uint32x2x2_t z = vtrn_u32(vget_low_u32(a), vget_high_u32(a)); return vcombine_u32(z.val[0], z.val[1]);} 191 | #endif 192 | 193 | #if defined(USE_MACROS) || defined(__clang__) 194 | #define _mm_shuffle_epi32(_a_, _m_) ({ const uint32x4_t _av =_a_;\ 195 | uint32x4_t _v = vmovq_n_u32(vgetq_lane_u32(_av, (_m_) & 0x3));\ 196 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 2) & 0x3), _v, 1);\ 197 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 4) & 0x3), _v, 2);\ 198 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 6) & 0x3), _v, 3); _v;\ 199 | }) 200 | #define _mm_shuffle_epi32s(_a_, _m_) _mm_set_epi32(vgetq_lane_u32(_a_, ((_m_) ) & 0x3),\ 201 | vgetq_lane_u32(_a_, ((_m_) >> 2) & 0x3),\ 202 | vgetq_lane_u32(_a_, ((_m_) >> 4) & 0x3),\ 203 | vgetq_lane_u32(_a_, ((_m_) >> 6) & 0x3)) 204 | #else 205 | static ALWAYS_INLINE __m128i _mm_shuffle_epi32(__m128i _a_, const unsigned _m_) { const uint32x4_t _av =_a_; 206 | uint32x4_t _v = vmovq_n_u32(vgetq_lane_u32(_av, (_m_) & 0x3)); 207 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 2) & 0x3), _v, 1); 208 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 4) & 0x3), _v, 2); 209 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 6) & 0x3), _v, 3); 210 | return _v; 211 | } 212 | static ALWAYS_INLINE __m128i _mm_shuffle_epi32s(__m128i _a_, const unsigned _m_) { 213 | return _mm_set_epi32(vgetq_lane_u32(_a_, ((_m_) ) & 0x3), 214 | vgetq_lane_u32(_a_, ((_m_) >> 2) & 0x3), 215 | vgetq_lane_u32(_a_, ((_m_) >> 4) & 0x3), 216 | vgetq_lane_u32(_a_, ((_m_) >> 6) & 0x3)); 217 | } 218 | #endif 219 | #ifdef USE_MACROS 220 | #define _mm_unpacklo_epi8( _a_,_b_) ({ uint8x8x2_t _zv = vzip_u8 ( vget_low_u8( (uint8x16_t)(_a_)), vget_low_u8 ((uint8x16_t)(_b_))); (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]);}) 221 | #define _mm_unpacklo_epi16(_a_,_b_) ({ uint16x4x2_t _zv = vzip_u16( vget_low_u16((uint16x8_t)(_a_)), vget_low_u16((uint16x8_t)(_b_))); (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]);}) 222 | #define _mm_unpacklo_epi32(_a_,_b_) ({ uint32x2x2_t _zv = vzip_u32( vget_low_u32( _a_ ), vget_low_u32( _b_ )); vcombine_u32(_zv.val[0], _zv.val[1]);}) 223 | #define _mm_unpacklo_epi64(_a_,_b_) (uint32x4_t)vcombine_u64(vget_low_u64((uint64x2_t)(_a_)), vget_low_u64((uint64x2_t)(_b_))) 224 | 225 | #define _mm_unpackhi_epi8( _a_,_b_) ({ uint8x8x2_t _zv = vzip_u8 (vget_high_u8( (uint8x16_t)(_a_)), vget_high_u8( (uint8x16_t)(_b_))); (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]);}) 226 | #define _mm_unpackhi_epi16(_a_,_b_) ({ uint16x4x2_t _zv = vzip_u16(vget_high_u16((uint16x8_t)(_a_)), vget_high_u16((uint16x8_t)(_b_))); (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]);}) 227 | #define _mm_unpackhi_epi32(_a_,_b_) ({ uint32x2x2_t _zv = vzip_u32(vget_high_u32( _a_ ), vget_high_u32( _b_ )); vcombine_u32(_zv.val[0], _zv.val[1]);}) 228 | #define _mm_unpackhi_epi64(_a_,_b_) (uint32x4_t)vcombine_u64(vget_high_u64((uint64x2_t)(_a_)), vget_high_u64((uint64x2_t)(_b_))) 229 | #else 230 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi8( __m128i _a_, __m128i _b_) { uint8x8x2_t _zv = vzip_u8 ( vget_low_u8( (uint8x16_t)(_a_)), vget_low_u8 ((uint8x16_t)(_b_))); return (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]);} 231 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi16(__m128i _a_, __m128i _b_) { uint16x4x2_t _zv = vzip_u16( vget_low_u16((uint16x8_t)(_a_)), vget_low_u16((uint16x8_t)(_b_))); return (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]);} 232 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi32(__m128i _a_, __m128i _b_) { uint32x2x2_t _zv = vzip_u32( vget_low_u32( _a_ ), vget_low_u32( _b_ )); return vcombine_u32(_zv.val[0], _zv.val[1]);} 233 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi64(__m128i _a_, __m128i _b_) { return (uint32x4_t)vcombine_u64(vget_low_u64((uint64x2_t)(_a_)), vget_low_u64((uint64x2_t)(_b_))); } 234 | 235 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi8( __m128i _a_, __m128i _b_) { uint8x8x2_t _zv = vzip_u8 (vget_high_u8( (uint8x16_t)(_a_)), vget_high_u8( (uint8x16_t)(_b_))); return (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]); } 236 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi16(__m128i _a_, __m128i _b_) { uint16x4x2_t _zv = vzip_u16(vget_high_u16((uint16x8_t)(_a_)), vget_high_u16((uint16x8_t)(_b_))); return (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]); } 237 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi32(__m128i _a_, __m128i _b_) { uint32x2x2_t _zv = vzip_u32(vget_high_u32( _a_ ), vget_high_u32( _b_ )); return vcombine_u32(_zv.val[0], _zv.val[1]); } 238 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi64(__m128i _a_, __m128i _b_) { return (uint32x4_t)vcombine_u64(vget_high_u64((uint64x2_t)(_a_)), vget_high_u64((uint64x2_t)(_b_))); } 239 | #endif 240 | 241 | #else //------------------------------------- intel SSE2/SSSE3 -------------------------------------------------------------- 242 | #define mm_movemask_epu32(_a_) _mm_movemask_ps(_mm_castsi128_ps(_a_)) 243 | #define mm_movemask_epu16(_a_) _mm_movemask_epi8(_a_) 244 | #define mm_loadu_epi64p( _u64p_,_a_) _a_ = _mm_cvtsi64_si128(ctou64(_u64p_)) 245 | 246 | #define mm_extract_epu32( _a_, _id_) _mm_extract_epi32(_a_, _id_) 247 | #define mm_extract_epi32x(_a_,_u32_, _id_) _u32_ = _mm_extract_epi32(_a_, _id_) 248 | #define mm_extract_epi64x(_a_,_u64_, _id_) _u64_ = _mm_extract_epi64(_a_, _id_) 249 | #define mm_insert_epi32p( _a_,_u32p_,_c_) _mm_insert_epi32( _a_,ctou32(_u32p_),_c_) 250 | 251 | #define mm_mullo_epu32( _a_,_b_) _mm_mullo_epi32(_a_,_b_) 252 | #define mm_cvtsi64_si128p(_u64p_,_a_) _a_ = _mm_cvtsi64_si128(ctou64(_u64p_)) 253 | 254 | #define mm_cmpgt_epu32( _a_, _b_) _mm_cmpgt_epi32(_mm_xor_si128(_a_, cv80000000), _mm_xor_si128(_b_, cv80000000)) 255 | 256 | #define mm_shuffle_nnnn_epi32(_a_, _n_) _mm_shuffle_epi32(_a_, _MM_SHUFFLE(_n_,_n_,_n_,_n_)) 257 | #define mm_shuffle_2031_epi32(_a_) _mm_shuffle_epi32(_a_, _MM_SHUFFLE(2,0,3,1)) 258 | #define mm_shuffle_3120_epi32(_a_) _mm_shuffle_epi32(_a_, _MM_SHUFFLE(3,1,2,0)) 259 | 260 | static ALWAYS_INLINE __m128i mm_rbit_epi8(__m128i v) { // reverse bits in bytes 261 | __m128i fv = _mm_set_epi8(15, 7,11, 3,13, 5, 9, 1,14, 6,10, 2,12, 4, 8, 0), cv0f_8 = _mm_set1_epi8(0xf); 262 | __m128i lv = _mm_shuffle_epi8(fv,_mm_and_si128( v, cv0f_8)); 263 | __m128i hv = _mm_shuffle_epi8(fv,_mm_and_si128(_mm_srli_epi64(v, 4), cv0f_8)); 264 | return _mm_or_si128(_mm_slli_epi64(lv,4), hv); 265 | } 266 | 267 | static ALWAYS_INLINE __m128i mm_rev_epi16(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8(14,15,12,13,10,11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1)); } // reverse vector bytes in uint??_t 268 | static ALWAYS_INLINE __m128i mm_rev_epi32(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8(12,13,14,15, 8, 9,10,11, 4, 5, 6, 7, 0, 1, 2, 3)); } 269 | static ALWAYS_INLINE __m128i mm_rev_epi64(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8( 8, 9,10,11,12,13,14,15, 0, 1, 2, 3, 4, 5, 6, 7)); } 270 | static ALWAYS_INLINE __m128i mm_rev_si128(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15)); } 271 | #endif 272 | #endif 273 | -------------------------------------------------------------------------------- /time_.h: -------------------------------------------------------------------------------- 1 | /** 2 | Copyright (C) powturbo 2013-2019 3 | GPL v2 License 4 | 5 | This program is free software; you can redistribute it and/or modify 6 | it under the terms of the GNU General Public License as published by 7 | the Free Software Foundation; either version 2 of the License, or 8 | (at your option) any later version. 9 | 10 | This program is distributed in the hope that it will be useful, 11 | but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | GNU General Public License for more details. 14 | 15 | You should have received a copy of the GNU General Public License along 16 | with this program; if not, write to the Free Software Foundation, Inc., 17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 18 | 19 | - homepage : https://sites.google.com/site/powturbo/ 20 | - github : https://github.com/powturbo 21 | - twitter : https://twitter.com/powturbo 22 | - email : powturbo [_AT_] gmail [_DOT_] com 23 | **/ 24 | // time_.h : time functions 25 | #include 26 | 27 | #ifdef _WIN32 28 | #include 29 | #ifndef sleep 30 | #define sleep(n) Sleep((n) * 1000) 31 | #endif 32 | typedef unsigned __int64 uint64_t; 33 | typedef unsigned __int64 tm_t; 34 | #else 35 | #include 36 | #include 37 | typedef uint64_t tm_t; 38 | #define Sleep(ms) usleep((ms) * 1000) 39 | #endif 40 | 41 | #if defined (__i386__) || defined( __x86_64__ ) 42 | #ifdef _MSC_VER 43 | #include // __rdtsc 44 | #else 45 | #include 46 | #endif 47 | 48 | #ifdef __corei7__ 49 | #define RDTSC_INI(_c_) do { unsigned _cl, _ch; \ 50 | __asm volatile ("couid\n\t" \ 51 | "rdtsc\n\t" \ 52 | "mov %%edx, %0\n" \ 53 | "mov %%eax, %1\n": "=r" (_ch), "=r" (_cl):: \ 54 | "%rax", "%rbx", "%rcx", "%rdx"); \ 55 | _c_ = (uint64_t)_ch << 32 | _cl; \ 56 | } while(0) 57 | 58 | #define RDTSC(_c_) do { unsigned _cl, _ch; \ 59 | __asm volatile("rdtscp\n" \ 60 | "mov %%edx, %0\n" \ 61 | "mov %%eax, %1\n" \ 62 | "cpuid\n\t": "=r" (_ch), "=r" (_cl):: "%rax",\ 63 | "%rbx", "%rcx", "%rdx");\ 64 | _c_ = (uint64_t)_ch << 32 | _cl;\ 65 | } while(0) 66 | #else 67 | #define RDTSC(_c_) do { unsigned _cl, _ch;\ 68 | __asm volatile ("cpuid \n"\ 69 | "rdtsc"\ 70 | : "=a"(_cl), "=d"(_ch)\ 71 | : "a"(0)\ 72 | : "%ebx", "%ecx");\ 73 | _c_ = (uint64_t)_ch << 32 | _cl;\ 74 | } while(0) 75 | #define RDTSC_INI(_c_) RDTSC(_c_) 76 | #endif 77 | #else 78 | #define RDTSC_INI(_c_) 79 | #define RDTSC(_c_) 80 | #endif 81 | 82 | #define tmrdtscini() ({ tm_t _c; __asm volatile("" ::: "memory"); RDTSC_INI(_c); _c; }) 83 | #define tmrdtsc() ({ tm_t _c; RDTSC(_c); _c; }) 84 | 85 | #ifndef TM_F 86 | #define TM_F 1.0 // TM_F=4 -> MI/s 87 | #endif 88 | 89 | #ifdef RDTSC_ON 90 | #define tminit() tmrdtscini() 91 | #define tmtime() tmrdtsc() 92 | #define TM_T CLOCKS_PER_SEC 93 | static double TMBS(unsigned l, tm_t t) { double dt=t,dl=l; return t/l; } 94 | #define TM_C 1000 95 | #else 96 | #define TM_T 1000000.0 97 | #define TM_C 1 98 | static double TMBS(unsigned l, tm_t tm) { double dl=l,dt=tm; return dt>=0.000001?(dl/(1000000.0*TM_F))/(dt/TM_T):0.0; } 99 | #ifdef _WIN32 100 | static LARGE_INTEGER tps; 101 | static tm_t tmtime(void) { 102 | LARGE_INTEGER tm; 103 | tm_t t; 104 | double d; 105 | QueryPerformanceCounter(&tm); 106 | d = tm.QuadPart; 107 | t = d*1000000.0/tps.QuadPart; 108 | return t; 109 | } 110 | 111 | static tm_t tminit() { tm_t t0,ts; QueryPerformanceFrequency(&tps); t0 = tmtime(); while((ts = tmtime())==t0); return ts; } 112 | #else 113 | #ifdef __APPLE__ 114 | #include 115 | #ifndef MAC_OS_X_VERSION_10_12 116 | #define MAC_OS_X_VERSION_10_12 101200 117 | #endif 118 | #define CIVETWEB_APPLE_HAVE_CLOCK_GETTIME defined(__APPLE__) && MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_X_VERSION_10_12 119 | #if !(CIVETWEB_APPLE_HAVE_CLOCK_GETTIME) 120 | #include 121 | #define CLOCK_REALTIME 0 122 | #define CLOCK_MONOTONIC 0 123 | int clock_gettime(int /*clk_id*/, struct timespec* t) { 124 | struct timeval now; 125 | int rv = gettimeofday(&now, NULL); 126 | if (rv) return rv; 127 | t->tv_sec = now.tv_sec; 128 | t->tv_nsec = now.tv_usec * 1000; 129 | return 0; 130 | } 131 | #endif 132 | #endif 133 | static tm_t tmtime(void) { struct timespec tm; clock_gettime(CLOCK_MONOTONIC, &tm); return (tm_t)tm.tv_sec*1000000 + tm.tv_nsec/1000; } 134 | static tm_t tminit() { tm_t t0=tmtime(),ts; while((ts = tmtime())==t0); return ts; } 135 | #endif 136 | static double tmsec( tm_t tm) { double d = tm; return d/1000000.0; } 137 | static double tmmsec(tm_t tm) { double d = tm; return d/1000.0; } 138 | #endif 139 | //---------------------------------------- bench ---------------------------------------------------------------------- 140 | #define TM_TX TM_T 141 | 142 | #define TMSLEEP do { tm_T = tmtime(); if(!tm_0) tm_0 = tm_T; else if(tm_T - tm_0 > tm_TX) { if(tm_verbose) { printf("S \b\b");fflush(stdout);} sleep(tm_slp); tm_0=tmtime();} } while(0) 143 | 144 | #define TMBEG(_tm_reps_, _tm_Reps_) { unsigned _tm_r,_tm_c=0,_tm_R; tm_t _tm_t0,_tm_t,_tm_ts;\ 145 | for(tm_rm = _tm_reps_, tm_tm = (tm_t)1<<63,_tm_R = 0,_tm_ts=tmtime(); _tm_R < _tm_Reps_; _tm_R++) { tm_t _tm_t0 = tminit();\ 146 | for(_tm_r=0;_tm_r < tm_rm;) { 147 | 148 | #define TMEND(_len_) _tm_r++; if((_tm_t = tmtime() - _tm_t0) > tm_tx) break; } \ 149 | if(_tm_t < tm_tm) { if(tm_tm == (tm_t)1<<63) tm_rm = _tm_r; tm_tm = _tm_t; _tm_c++; } \ 150 | else if(_tm_t>tm_tm*1.2) TMSLEEP; if(tm_verbose) { double d = tm_tm*TM_C,dr=tm_rm; printf("%8.2f %2d_%.2d\b\b\b\b\b\b\b\b\b\b\b\b\b\b",TMBS(_len_, d/dr),_tm_R+1,_tm_c),fflush(stdout); }\ 151 | if(tmtime()-_tm_ts > tm_TX && _tm_R < tm_RepMin) break;\ 152 | if((_tm_R & 7)==7) sleep(tm_slp),_tm_ts=tmtime(); } } 153 | 154 | static unsigned tm_rep = 1<<20, tm_Rep = 3, tm_rep2 = 1<<20, tm_Rep2 = 4, tm_slp = 20, tm_rm; 155 | static tm_t tm_tx = TM_T, tm_TX = 120*TM_T, tm_RepMin=1, tm_0, tm_T, tm_verbose=2, tm_tm; 156 | static void tm_init(int _tm_Rep, int _tm_verbose) { tm_verbose = _tm_verbose; if(_tm_Rep) tm_Rep = _tm_Rep; tm_tx = tminit(); Sleep(500); tm_tx = tmtime() - tm_tx; tm_TX = 10*tm_tx; } 157 | 158 | #define TMBENCH(_name_, _func_, _len_) do { if(tm_verbose>1) printf("%s ", _name_?_name_:#_func_); TMBEG(tm_rep, tm_Rep) _func_; TMEND(_len_); { double dm = tm_tm,dr=tm_rm; if(tm_verbose) printf("%8.2f \b\b\b\b\b", TMBS(_len_, dm*TM_C/dr) );} } while(0) 159 | #define TMBENCH2(_name_, _func_, _len_) do { TMBEG(tm_rep2, tm_Rep2) _func_; TMEND(_len_); { double dm = tm_tm,dr=tm_rm; if(tm_verbose) printf("%8.2f \b\b\b\b\b", TMBS(_len_, dm*TM_C/dr) );} if(tm_verbose>1) printf("%s ", _name_?_name_:#_func_); } while(0) 160 | #define TMBENCHT(_name_,_func_, _len_, _res_) do { TMBEG(tm_rep, tm_Rep) if(_func_ != _res_) { printf("ERROR: %lld != %lld", (long long)_func_, (long long)_res_ ); exit(0); }; TMEND(_len_); if(tm_verbose) printf("%8.2f \b\b\b\b\b", TMBS(_len_,(double)tm_tm*TM_C/(double)tm_rm) ); if(tm_verbose) printf("%s ", _name_?_name_:#_func_ ); } while(0) 161 | 162 | #define Kb (1u<<10) 163 | #define Mb (1u<<20) 164 | #define Gb (1u<<30) 165 | #define KB 1000 166 | #define MB 1000000 167 | #define GB 1000000000 168 | 169 | static unsigned argtoi(char *s, unsigned def) { 170 | char *p; 171 | unsigned n = strtol(s, &p, 10),f = 1; 172 | switch(*p) { 173 | case 'K': f = KB; break; 174 | case 'M': f = MB; break; 175 | case 'G': f = GB; break; 176 | case 'k': f = Kb; break; 177 | case 'm': f = Mb; break; 178 | case 'g': f = Gb; break; 179 | case 'b': def = 0; 180 | default: if(!def) return n>=32?0xffffffffu:(1u << n); f = def; 181 | } 182 | return n*f; 183 | } 184 | static uint64_t argtol(char *s) { 185 | char *p; 186 | uint64_t n = strtol(s, &p, 10),f=1; 187 | switch(*p) { 188 | case 'K': f = KB; break; 189 | case 'M': f = MB; break; 190 | case 'G': f = GB; break; 191 | case 'k': f = Kb; break; 192 | case 'm': f = Mb; break; 193 | case 'g': f = Gb; break; 194 | case 'b': return 1u << n; 195 | default: f = MB; 196 | } 197 | return n*f; 198 | } 199 | 200 | static uint64_t argtot(char *s) { 201 | char *p; 202 | uint64_t n = strtol(s, &p, 10),f=1; 203 | switch(*p) { 204 | case 'h': f = 3600000; break; 205 | case 'm': f = 60000; break; 206 | case 's': f = 1000; break; 207 | case 'M': f = 1; break; 208 | default: f = 1000; 209 | } 210 | return n*f; 211 | } 212 | 213 | static void memrcpy(unsigned char *out, unsigned char *in, unsigned n) { int i; for(i = 0; i < n; i++) out[i] = ~in[i]; } 214 | -------------------------------------------------------------------------------- /tpbench.c: -------------------------------------------------------------------------------- 1 | /** 2 | Copyright (C) powturbo 2013-2018 3 | GPL v2 License 4 | 5 | This program is free software; you can redistribute it and/or modify 6 | it under the terms of the GNU General Public License as published by 7 | the Free Software Foundation; either version 2 of the License, or 8 | (at your option) any later version. 9 | 10 | This program is distributed in the hope that it will be useful, 11 | but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | GNU General Public License for more details. 14 | 15 | You should have received a copy of the GNU General Public License along 16 | with this program; if not, write to the Free Software Foundation, Inc., 17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 18 | 19 | - homepage : https://sites.google.com/site/powturbo/ 20 | - github : https://github.com/powturbo 21 | - twitter : https://twitter.com/powturbo 22 | - email : powturbo [_AT_] gmail [_DOT_] com 23 | **/ 24 | #include 25 | #include 26 | 27 | #ifdef __APPLE__ 28 | #include 29 | #else 30 | #include 31 | #endif 32 | #ifdef _MSC_VER 33 | #include "vs/getopt.h" 34 | #else 35 | #include 36 | #endif 37 | 38 | #include "conf.h" 39 | //#define RDTSC_ON 40 | #include "time_.h" 41 | 42 | #include "transpose.h" 43 | 44 | #ifdef BITSHUFFLE 45 | #include "bitshuffle/src/bitshuffle.h" 46 | #include "bitshuffle/lz4/lz4.h" 47 | #endif 48 | 49 | #ifdef BLOSC 50 | #include "c-blosc2/blosc/shuffle.h" 51 | #include "c-blosc2/blosc/blosc2.h" 52 | #endif 53 | 54 | int memcheck(unsigned char *in, unsigned n, unsigned char *cpy) { 55 | int i; 56 | for(i = 0; i < n; i++) 57 | if(in[i] != cpy[i]) { 58 | printf("ERROR in[%d]=%x, dec[%d]=%x\n", i, in[i], i, cpy[i]); 59 | return i+1; 60 | } 61 | return 0; 62 | } 63 | 64 | #ifdef LZ4_ON 65 | #ifdef USE_SSE 66 | unsigned tp4lz4enc(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) { 67 | tp4enc(in, n, tmp, esize); 68 | return LZ4_compress(tmp, out, n); 69 | } 70 | 71 | unsigned tp4lz4dec(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) { 72 | unsigned rc; 73 | LZ4_decompress_fast((char *)in, (char *)tmp, n); 74 | tp4dec(tmp, n, (unsigned char *)out, esize); 75 | return rc; 76 | } 77 | #endif 78 | 79 | unsigned tplz4enc(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) { 80 | tpenc(in, n, tmp, esize); 81 | return LZ4_compress(tmp, out, n); 82 | } 83 | 84 | unsigned tplz4dec(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) { 85 | unsigned rc; 86 | LZ4_decompress_fast((char *)in, (char *)tmp, n); 87 | tpdec(tmp, n, (unsigned char *)out, esize); 88 | return rc; 89 | } 90 | #endif 91 | 92 | #ifdef BITSHUFFLE 93 | #define BITSHUFFLE(in,n,out,esize) bshuf_bitshuffle( in, out, (n)/esize, esize, 0); memcpy((char *)out+((n)&(~(8*esize-1))),(char *)in+((n)&(~(8*esize-1))),(n)&(8*esize-1)) 94 | #define BITUNSHUFFLE(in,n,out,esize) bshuf_bitunshuffle(in, out, (n)/esize, esize, 0); memcpy((char *)out+((n)&(~(8*esize-1))),(char *)in+((n)&(~(8*esize-1))),(n)&(8*esize-1)) 95 | 96 | unsigned bslz4enc(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) { 97 | BITSHUFFLE(in, n, tmp, esize); 98 | return LZ4_compress(tmp, out, n); 99 | } 100 | 101 | unsigned bslz4dec(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) { 102 | unsigned rc; 103 | LZ4_decompress_fast((char *)in, (char *)tmp, n); 104 | BITUNSHUFFLE(tmp, n, (unsigned char *)out, esize); 105 | return rc; 106 | } 107 | #endif 108 | 109 | #define ID_MEMCPY 7 110 | void bench(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *cpy, int id) { 111 | memrcpy(cpy,in,n); 112 | 113 | switch(id) { 114 | case 1: { TMBENCH("", tpenc(in, n,out,esize) ,n); TMBENCH2("tp_byte ",tpdec(out,n,cpy,esize) ,n); } break; 115 | #ifdef USE_SSE 116 | case 2: { TMBENCH("", tp4enc(in,n,out,esize) ,n); TMBENCH2("tp_nibble ",tp4dec(out,n,cpy,esize) ,n); } break; 117 | #endif 118 | #ifdef BLOSC 119 | case 3: { TMBENCH("",shuffle(esize,n,in,out), n); TMBENCH2("blosc shuffle ",unshuffle(esize,n,out,cpy), n); } break; 120 | case 4: { unsigned char *tmp = malloc(n); TMBENCH("",bitshuffle(esize,n,in,out,tmp), n); TMBENCH2("blosc bitshuffle ",bitunshuffle(esize,n,out,cpy,tmp), n); free(tmp); } break; 121 | #endif 122 | #ifdef BITSHUFFLE 123 | case 5: { TMBENCH("",bshuf_bitshuffle(in,out,(n)/esize,esize,0), n); TMBENCH2("bitshuffle ",bshuf_bitunshuffle(out,cpy,(n)/esize,esize,0), n); } break; 124 | #endif 125 | case 6: TMBENCH("",memcpy(in,out,n) ,n); TMBENCH2("memcpy ",memcpy(cpy,out,n) ,n); break; 126 | case 7: 127 | switch(esize) { 128 | case 2: { TMBENCH("", tpenc2( in, n,out) ,n); TMBENCH2("tp_byte2 scalar", tpdec2( out,n,cpy) ,n); } break; 129 | case 4: { TMBENCH("", tpenc4( in, n,out) ,n); TMBENCH2("tp_byte4 scalar", tpdec4( out,n,cpy) ,n); } break; 130 | case 8: { TMBENCH("", tpenc8( in, n,out) ,n); TMBENCH2("tp_byte8 scalar", tpdec8( out,n,cpy) ,n); } break; 131 | case 16: { TMBENCH("", tpenc16(in, n,out) ,n); TMBENCH2("tp_byte16 scalar",tpdec16(out,n,cpy) ,n); } break; 132 | } 133 | break; 134 | default: return; 135 | } 136 | printf("\n"); 137 | memcheck(in,n,cpy); 138 | } 139 | 140 | void usage(char *pgm) { 141 | fprintf(stderr, "\nTPBench Copyright (c) 2013-2019 Powturbo %s\n", __DATE__); 142 | fprintf(stderr, "Usage: %s [options] [file]\n", pgm); 143 | fprintf(stderr, " -e# # = function ids separated by ',' or ranges '#-#' (default='1-%d')\n", ID_MEMCPY); 144 | fprintf(stderr, " -B#s # = max. benchmark filesize (default 1GB) ex. -B4G\n"); 145 | fprintf(stderr, " s = modifier s:K,M,G=(1000, 1.000.000, 1.000.000.000) s:k,m,h=(1024,1Mb,1Gb). (default m) ex. 64k or 64K\n"); 146 | fprintf(stderr, "Benchmark:\n"); 147 | fprintf(stderr, " -i#/-j# # = Minimum de/compression iterations per run (default=auto)\n"); 148 | fprintf(stderr, " -I#/-J# # = Number of de/compression runs (default=3)\n"); 149 | fprintf(stderr, " -e# # = function id\n"); 150 | exit(0); 151 | } 152 | 153 | int main(int argc, char* argv[]) { 154 | unsigned cmp=1, b = 1 << 30, esize=4, lz=0, fno,id=0; 155 | unsigned char *scmd = NULL; 156 | int c, digit_optind = 0, this_option_optind = optind ? optind : 1, option_index = 0; 157 | static struct option long_options[] = { {"blocsize", 0, 0, 'b'}, {0, 0, 0} }; 158 | for(;;) { 159 | if((c = getopt_long(argc, argv, "B:ce:i:I:j:J:q:s:z", long_options, &option_index)) == -1) break; 160 | switch(c) { 161 | case 0 : printf("Option %s", long_options[option_index].name); if(optarg) printf (" with arg %s", optarg); printf ("\n"); break; 162 | case 'e': scmd = optarg; break; 163 | case 's': esize = atoi(optarg); break; 164 | case 'i': if((tm_rep = atoi(optarg))<=0) tm_rep =tm_Rep=1; break; 165 | case 'I': if((tm_Rep = atoi(optarg))<=0) tm_rep =tm_Rep=1; break; 166 | case 'j': if((tm_rep2 = atoi(optarg))<=0) tm_rep2=tm_Rep2=1; break; 167 | case 'J': if((tm_Rep2 = atoi(optarg))<=0) tm_rep2=tm_Rep2=1; break; 168 | case 'B': b = argtoi(optarg,1); break; 169 | case 'z': lz++; break; 170 | case 'c': cmp++; break; 171 | case 'q': cpuini(atoi(optarg)); break; 172 | default: 173 | usage(argv[0]); 174 | exit(0); 175 | } 176 | } 177 | 178 | printf("tm_verbose=%d ", tm_verbose); 179 | if(argc - optind < 1) { fprintf(stderr, "File not specified\n"); exit(-1); } 180 | { 181 | unsigned char *in,*out,*cpy; 182 | uint64_t totlen=0,tot[3]={0}; 183 | for(fno = optind; fno < argc; fno++) { 184 | uint64_t flen; 185 | int n,i; 186 | char *inname = argv[fno]; 187 | FILE *fi = fopen(inname, "rb"); if(!fi ) { perror(inname); continue; } 188 | fseek(fi, 0, SEEK_END); 189 | flen = ftell(fi); 190 | fseek(fi, 0, SEEK_SET); 191 | 192 | if(flen > b) flen = b; 193 | n = flen; 194 | if(!(in = (unsigned char*)malloc(n+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); } cpy = in; 195 | if(!(out = (unsigned char*)malloc(flen*4/3+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); } 196 | if(cmp && !(cpy = (unsigned char*)malloc(n+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); } 197 | n = fread(in, 1, n, fi); printf("File='%s' Length=%u\n", inname, n); 198 | fclose(fi); 199 | if(n <= 0) exit(0); 200 | if(fno == optind) { 201 | tm_init(tm_Rep, 2); 202 | tpini(0); 203 | printf("size=%u, element size=%d. detected simd=%s\n\n", n, esize, cpustr(cpuini(0))); 204 | } 205 | printf(" E MB/s D MB/s function (size=%d )\n", esize); 206 | char *p = scmd?scmd:"1-10"; 207 | do { 208 | unsigned id = strtoul(p, &p, 10),idx = id, i; 209 | while(isspace(*p)) p++; if(*p == '-') { if((idx = strtoul(p+1, &p, 10)) < id) idx = id; if(idx > ID_MEMCPY) idx = ID_MEMCPY; } 210 | for(i = id; i <= idx; i++) { 211 | bench(in,n,out,esize,cpy,i); 212 | 213 | if(lz) { 214 | char *tmp; int rc; 215 | totlen += n; 216 | // Test Transpose + lz 217 | if(!(tmp = (unsigned char*)malloc(n+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); } 218 | #ifdef LZ4_ON 219 | memrcpy(cpy,in,n); TMBENCH("lz4",rc = LZ4_compress(in, out, n) ,n); tot[0]+=rc; TMBENCH("",LZ4_decompress_fast(out,cpy,n) ,n); memcheck(in,n,cpy); 220 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n); 221 | 222 | memrcpy(cpy,in,n); TMBENCH("tpbyte+lz4",rc = tplz4enc(in, n,out,esize,tmp) ,n); tot[0]+=rc; TMBENCH("",tplz4dec(out,n,cpy,esize,tmp) ,n); memcheck(in,n,cpy); 223 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n); 224 | #ifdef USE_SSE 225 | memrcpy(cpy,in,n); TMBENCH("tpnibble+lz4",rc = tp4lz4enc(in, n,out,esize,tmp) ,n); tot[1]+=rc; TMBENCH("",tp4lz4dec(out,n,cpy,esize,tmp) ,n); memcheck(in,n,cpy); 226 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n); 227 | #endif 228 | #endif 229 | 230 | #ifdef BITSHUFFLE 231 | memrcpy(cpy,in,n); TMBENCH("bitshuffle+lz4",rc=bslz4enc(in,n,out,esize,tmp), n); tot[2] += rc; TMBENCH("",bslz4dec(out,n,cpy,esize,tmp), n); memcheck(in,n,cpy); 232 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n); 233 | #endif 234 | printf("\n"); 235 | free(tmp); 236 | } 237 | } 238 | } while(*p++); 239 | if(lz) { 240 | #ifdef HAVE_LZ4 241 | printf("tplz4enc : compressed len=%llu ratio=%.2f %%\n", tot[0], (double)(tot[0]*100.0)/(double)totlen); 242 | #ifdef USE_SSE2 243 | printf("tp4lz4enc : compressed len=%llu ratio=%.2f %%\n", tot[1], (double)(tot[1]*100.0)/(double)totlen); 244 | #endif 245 | #endif 246 | #ifdef BITSHUFFLE 247 | printf("bshuf_compress_lz4: compressed len=%llu ratio=%.2f %%\n", tot[2], (double)(tot[2]*100.0)/(double)totlen); 248 | #endif 249 | } 250 | } 251 | } 252 | } 253 | -------------------------------------------------------------------------------- /transpose.h: -------------------------------------------------------------------------------- 1 | /** 2 | Copyright (C) powturbo 2013-2019 3 | GPL v2 License 4 | 5 | This program is free software; you can redistribute it and/or modify 6 | it under the terms of the GNU General Public License as published by 7 | the Free Software Foundation; either version 2 of the License, or 8 | (at your option) any later version. 9 | 10 | This program is distributed in the hope that it will be useful, 11 | but WITHOUT ANY WARRANTY; without even the implied warranty of 12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 | GNU General Public License for more details. 14 | 15 | You should have received a copy of the GNU General Public License along 16 | with this program; if not, write to the Free Software Foundation, Inc., 17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. 18 | 19 | - homepage : https://sites.google.com/site/powturbo/ 20 | - github : https://github.com/powturbo 21 | - twitter : https://twitter.com/powturbo 22 | - email : powturbo [_AT_] gmail [_DOT_] com 23 | **/ 24 | // transpose.h - Byte/Nibble transpose for further compressing with lz77 or other compressors 25 | #ifdef __cplusplus 26 | extern "C" { 27 | #endif 28 | // Syntax 29 | // in : Input buffer 30 | // n : Total number of bytes in input buffer 31 | // out : output buffer 32 | // esize : element size in bytes (ex. 2, 4, 8,... ) 33 | 34 | //---------- High level functions with dynamic cpu detection and JIT scalar/sse/avx2 switching 35 | void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); // tranpose 36 | void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); // reverse transpose 37 | 38 | void tp2denc(unsigned char *in, unsigned x, unsigned y, unsigned char *out, unsigned esize); //2D transpose 39 | void tp2ddec(unsigned char *in, unsigned x, unsigned y, unsigned char *out, unsigned esize); 40 | void tp3denc(unsigned char *in, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize); //3D transpose 41 | void tp3ddec(unsigned char *in, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize); 42 | void tp4denc(unsigned char *in, unsigned w, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize); //4D transpose 43 | void tp4ddec(unsigned char *in, unsigned w, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize); 44 | 45 | // Nibble transpose SIMD (SSE2,AVX2, ARM Neon) 46 | void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); 47 | void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); 48 | 49 | // bit transpose 50 | //void tp1enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); 51 | //void tp1dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); 52 | 53 | //---------- Low level functions ------------------------------------ 54 | void tpenc2( unsigned char *in, unsigned n, unsigned char *out); // scalar 55 | void tpenc3( unsigned char *in, unsigned n, unsigned char *out); 56 | void tpenc4( unsigned char *in, unsigned n, unsigned char *out); 57 | void tpenc8( unsigned char *in, unsigned n, unsigned char *out); 58 | void tpenc16( unsigned char *in, unsigned n, unsigned char *out); 59 | 60 | void tpdec2( unsigned char *in, unsigned n, unsigned char *out); 61 | void tpdec3( unsigned char *in, unsigned n, unsigned char *out); 62 | void tpdec4( unsigned char *in, unsigned n, unsigned char *out); 63 | void tpdec8( unsigned char *in, unsigned n, unsigned char *out); 64 | void tpdec16( unsigned char *in, unsigned n, unsigned char *out); 65 | 66 | void tpenc128v2( unsigned char *in, unsigned n, unsigned char *out); // sse2 67 | void tpdec128v2( unsigned char *in, unsigned n, unsigned char *out); 68 | void tpenc128v4( unsigned char *in, unsigned n, unsigned char *out); 69 | void tpdec128v4( unsigned char *in, unsigned n, unsigned char *out); 70 | void tpenc128v8( unsigned char *in, unsigned n, unsigned char *out); 71 | void tpdec128v8( unsigned char *in, unsigned n, unsigned char *out); 72 | 73 | void tp4enc128v2( unsigned char *in, unsigned n, unsigned char *out); 74 | void tp4dec128v2( unsigned char *in, unsigned n, unsigned char *out); 75 | void tp4enc128v4( unsigned char *in, unsigned n, unsigned char *out); 76 | void tp4dec128v4( unsigned char *in, unsigned n, unsigned char *out); 77 | void tp4enc128v8( unsigned char *in, unsigned n, unsigned char *out); 78 | void tp4dec128v8( unsigned char *in, unsigned n, unsigned char *out); 79 | 80 | void tp1enc128v2( unsigned char *in, unsigned n, unsigned char *out); 81 | void tp1dec128v2( unsigned char *in, unsigned n, unsigned char *out); 82 | void tp1enc128v4( unsigned char *in, unsigned n, unsigned char *out); 83 | void tp1dec128v4( unsigned char *in, unsigned n, unsigned char *out); 84 | void tp1enc128v8( unsigned char *in, unsigned n, unsigned char *out); 85 | void tp1dec128v8( unsigned char *in, unsigned n, unsigned char *out); 86 | 87 | void tpenc256v2( unsigned char *in, unsigned n, unsigned char *out); // avx2 88 | void tpdec256v2( unsigned char *in, unsigned n, unsigned char *out); 89 | void tpenc256v4( unsigned char *in, unsigned n, unsigned char *out); 90 | void tpdec256v4( unsigned char *in, unsigned n, unsigned char *out); 91 | void tpenc256v8( unsigned char *in, unsigned n, unsigned char *out); 92 | void tpdec256v8( unsigned char *in, unsigned n, unsigned char *out); 93 | 94 | void tp4enc256v2( unsigned char *in, unsigned n, unsigned char *out); 95 | void tp4dec256v2( unsigned char *in, unsigned n, unsigned char *out); 96 | void tp4enc256v4( unsigned char *in, unsigned n, unsigned char *out); 97 | void tp4dec256v4( unsigned char *in, unsigned n, unsigned char *out); 98 | void tp4enc256v8( unsigned char *in, unsigned n, unsigned char *out); 99 | void tp4dec256v8( unsigned char *in, unsigned n, unsigned char *out); 100 | 101 | //------- CPU instruction set 102 | // cpuiset = 0: return current simd set, 103 | // cpuiset != 0: set simd set 0:scalar, 20:sse2, 52:avx2 104 | int cpuini(int cpuiset); 105 | 106 | // convert simd set to string "sse3", "sse3", "sse4.1" or "avx2" 107 | // Ex.: printf("current cpu set=%s\n", cpustr(cpuini(0)) ); 108 | char *cpustr(int cpuiset); 109 | 110 | #ifdef __cplusplus 111 | } 112 | #endif 113 | -------------------------------------------------------------------------------- /vs/getopt.c: -------------------------------------------------------------------------------- 1 | /* $OpenBSD: getopt_long.c,v 1.23 2007/10/31 12:34:57 chl Exp $ */ 2 | /* $NetBSD: getopt_long.c,v 1.15 2002/01/31 22:43:40 tv Exp $ */ 3 | 4 | /* 5 | * Copyright (c) 2002 Todd C. Miller 6 | * 7 | * Permission to use, copy, modify, and distribute this software for any 8 | * purpose with or without fee is hereby granted, provided that the above 9 | * copyright notice and this permission notice appear in all copies. 10 | * 11 | * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 12 | * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 13 | * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 14 | * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 15 | * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 16 | * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 17 | * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 18 | * 19 | * Sponsored in part by the Defense Advanced Research Projects 20 | * Agency (DARPA) and Air Force Research Laboratory, Air Force 21 | * Materiel Command, USAF, under agreement number F39502-99-1-0512. 22 | */ 23 | /*- 24 | * Copyright (c) 2000 The NetBSD Foundation, Inc. 25 | * All rights reserved. 26 | * 27 | * This code is derived from software contributed to The NetBSD Foundation 28 | * by Dieter Baron and Thomas Klausner. 29 | * 30 | * Redistribution and use in source and binary forms, with or without 31 | * modification, are permitted provided that the following conditions 32 | * are met: 33 | * 1. Redistributions of source code must retain the above copyright 34 | * notice, this list of conditions and the following disclaimer. 35 | * 2. Redistributions in binary form must reproduce the above copyright 36 | * notice, this list of conditions and the following disclaimer in the 37 | * documentation and/or other materials provided with the distribution. 38 | * 39 | * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS 40 | * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 41 | * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 42 | * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS 43 | * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 44 | * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 45 | * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 46 | * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 47 | * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 48 | * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 49 | * POSSIBILITY OF SUCH DAMAGE. 50 | */ 51 | 52 | #include 53 | #include 54 | #include 55 | #include "getopt.h" 56 | #include 57 | #include 58 | #include 59 | 60 | #define REPLACE_GETOPT /* use this getopt as the system getopt(3) */ 61 | 62 | #ifdef REPLACE_GETOPT 63 | int opterr = 1; /* if error message should be printed */ 64 | int optind = 1; /* index into parent argv vector */ 65 | int optopt = '?'; /* character checked for validity */ 66 | #undef optreset /* see getopt.h */ 67 | #define optreset __mingw_optreset 68 | int optreset; /* reset getopt */ 69 | char *optarg; /* argument associated with option */ 70 | #endif 71 | 72 | #define PRINT_ERROR ((opterr) && (*options != ':')) 73 | 74 | #define FLAG_PERMUTE 0x01 /* permute non-options to the end of argv */ 75 | #define FLAG_ALLARGS 0x02 /* treat non-options as args to option "-1" */ 76 | #define FLAG_LONGONLY 0x04 /* operate as getopt_long_only */ 77 | 78 | /* return values */ 79 | #define BADCH (int)'?' 80 | #define BADARG ((*options == ':') ? (int)':' : (int)'?') 81 | #define INORDER (int)1 82 | 83 | #ifndef __CYGWIN__ 84 | #define __progname __argv[0] 85 | #else 86 | extern char __declspec(dllimport) *__progname; 87 | #endif 88 | 89 | #ifdef __CYGWIN__ 90 | static char EMSG[] = ""; 91 | #else 92 | #define EMSG "" 93 | #endif 94 | 95 | static int getopt_internal(int, char * const *, const char *, 96 | const struct option *, int *, int); 97 | static int parse_long_options(char * const *, const char *, 98 | const struct option *, int *, int); 99 | static int gcd(int, int); 100 | static void permute_args(int, int, int, char * const *); 101 | 102 | static char *place = EMSG; /* option letter processing */ 103 | 104 | /* XXX: set optreset to 1 rather than these two */ 105 | static int nonopt_start = -1; /* first non option argument (for permute) */ 106 | static int nonopt_end = -1; /* first option after non options (for permute) */ 107 | 108 | /* Error messages */ 109 | static const char recargchar[] = "option requires an argument -- %c"; 110 | static const char recargstring[] = "option requires an argument -- %s"; 111 | static const char ambig[] = "ambiguous option -- %.*s"; 112 | static const char noarg[] = "option doesn't take an argument -- %.*s"; 113 | static const char illoptchar[] = "unknown option -- %c"; 114 | static const char illoptstring[] = "unknown option -- %s"; 115 | 116 | static void 117 | _vwarnx(const char *fmt,va_list ap) 118 | { 119 | (void)fprintf(stderr,"%s: ",__progname); 120 | if (fmt != NULL) 121 | (void)vfprintf(stderr,fmt,ap); 122 | (void)fprintf(stderr,"\n"); 123 | } 124 | 125 | static void 126 | warnx(const char *fmt,...) 127 | { 128 | va_list ap; 129 | va_start(ap,fmt); 130 | _vwarnx(fmt,ap); 131 | va_end(ap); 132 | } 133 | 134 | /* 135 | * Compute the greatest common divisor of a and b. 136 | */ 137 | static int 138 | gcd(int a, int b) 139 | { 140 | int c; 141 | 142 | c = a % b; 143 | while (c != 0) { 144 | a = b; 145 | b = c; 146 | c = a % b; 147 | } 148 | 149 | return (b); 150 | } 151 | 152 | /* 153 | * Exchange the block from nonopt_start to nonopt_end with the block 154 | * from nonopt_end to opt_end (keeping the same order of arguments 155 | * in each block). 156 | */ 157 | static void 158 | permute_args(int panonopt_start, int panonopt_end, int opt_end, 159 | char * const *nargv) 160 | { 161 | int cstart, cyclelen, i, j, ncycle, nnonopts, nopts, pos; 162 | char *swap; 163 | 164 | /* 165 | * compute lengths of blocks and number and size of cycles 166 | */ 167 | nnonopts = panonopt_end - panonopt_start; 168 | nopts = opt_end - panonopt_end; 169 | ncycle = gcd(nnonopts, nopts); 170 | cyclelen = (opt_end - panonopt_start) / ncycle; 171 | 172 | for (i = 0; i < ncycle; i++) { 173 | cstart = panonopt_end+i; 174 | pos = cstart; 175 | for (j = 0; j < cyclelen; j++) { 176 | if (pos >= panonopt_end) 177 | pos -= nnonopts; 178 | else 179 | pos += nopts; 180 | swap = nargv[pos]; 181 | /* LINTED const cast */ 182 | ((char **) nargv)[pos] = nargv[cstart]; 183 | /* LINTED const cast */ 184 | ((char **)nargv)[cstart] = swap; 185 | } 186 | } 187 | } 188 | 189 | /* 190 | * parse_long_options -- 191 | * Parse long options in argc/argv argument vector. 192 | * Returns -1 if short_too is set and the option does not match long_options. 193 | */ 194 | static int 195 | parse_long_options(char * const *nargv, const char *options, 196 | const struct option *long_options, int *idx, int short_too) 197 | { 198 | char *current_argv, *has_equal; 199 | size_t current_argv_len; 200 | int i, ambiguous, match; 201 | 202 | #define IDENTICAL_INTERPRETATION(_x, _y) \ 203 | (long_options[(_x)].has_arg == long_options[(_y)].has_arg && \ 204 | long_options[(_x)].flag == long_options[(_y)].flag && \ 205 | long_options[(_x)].val == long_options[(_y)].val) 206 | 207 | current_argv = place; 208 | match = -1; 209 | ambiguous = 0; 210 | 211 | optind++; 212 | 213 | if ((has_equal = strchr(current_argv, '=')) != NULL) { 214 | /* argument found (--option=arg) */ 215 | current_argv_len = has_equal - current_argv; 216 | has_equal++; 217 | } else 218 | current_argv_len = strlen(current_argv); 219 | 220 | for (i = 0; long_options[i].name; i++) { 221 | /* find matching long option */ 222 | if (strncmp(current_argv, long_options[i].name, 223 | current_argv_len)) 224 | continue; 225 | 226 | if (strlen(long_options[i].name) == current_argv_len) { 227 | /* exact match */ 228 | match = i; 229 | ambiguous = 0; 230 | break; 231 | } 232 | /* 233 | * If this is a known short option, don't allow 234 | * a partial match of a single character. 235 | */ 236 | if (short_too && current_argv_len == 1) 237 | continue; 238 | 239 | if (match == -1) /* partial match */ 240 | match = i; 241 | else if (!IDENTICAL_INTERPRETATION(i, match)) 242 | ambiguous = 1; 243 | } 244 | if (ambiguous) { 245 | /* ambiguous abbreviation */ 246 | if (PRINT_ERROR) 247 | warnx(ambig, (int)current_argv_len, 248 | current_argv); 249 | optopt = 0; 250 | return (BADCH); 251 | } 252 | if (match != -1) { /* option found */ 253 | if (long_options[match].has_arg == no_argument 254 | && has_equal) { 255 | if (PRINT_ERROR) 256 | warnx(noarg, (int)current_argv_len, 257 | current_argv); 258 | /* 259 | * XXX: GNU sets optopt to val regardless of flag 260 | */ 261 | if (long_options[match].flag == NULL) 262 | optopt = long_options[match].val; 263 | else 264 | optopt = 0; 265 | return (BADARG); 266 | } 267 | if (long_options[match].has_arg == required_argument || 268 | long_options[match].has_arg == optional_argument) { 269 | if (has_equal) 270 | optarg = has_equal; 271 | else if (long_options[match].has_arg == 272 | required_argument) { 273 | /* 274 | * optional argument doesn't use next nargv 275 | */ 276 | optarg = nargv[optind++]; 277 | } 278 | } 279 | if ((long_options[match].has_arg == required_argument) 280 | && (optarg == NULL)) { 281 | /* 282 | * Missing argument; leading ':' indicates no error 283 | * should be generated. 284 | */ 285 | if (PRINT_ERROR) 286 | warnx(recargstring, 287 | current_argv); 288 | /* 289 | * XXX: GNU sets optopt to val regardless of flag 290 | */ 291 | if (long_options[match].flag == NULL) 292 | optopt = long_options[match].val; 293 | else 294 | optopt = 0; 295 | --optind; 296 | return (BADARG); 297 | } 298 | } else { /* unknown option */ 299 | if (short_too) { 300 | --optind; 301 | return (-1); 302 | } 303 | if (PRINT_ERROR) 304 | warnx(illoptstring, current_argv); 305 | optopt = 0; 306 | return (BADCH); 307 | } 308 | if (idx) 309 | *idx = match; 310 | if (long_options[match].flag) { 311 | *long_options[match].flag = long_options[match].val; 312 | return (0); 313 | } else 314 | return (long_options[match].val); 315 | #undef IDENTICAL_INTERPRETATION 316 | } 317 | 318 | /* 319 | * getopt_internal -- 320 | * Parse argc/argv argument vector. Called by user level routines. 321 | */ 322 | static int 323 | getopt_internal(int nargc, char * const *nargv, const char *options, 324 | const struct option *long_options, int *idx, int flags) 325 | { 326 | char *oli; /* option letter list index */ 327 | int optchar, short_too; 328 | static int posixly_correct = -1; 329 | 330 | if (options == NULL) 331 | return (-1); 332 | 333 | /* 334 | * XXX Some GNU programs (like cvs) set optind to 0 instead of 335 | * XXX using optreset. Work around this braindamage. 336 | */ 337 | if (optind == 0) 338 | optind = optreset = 1; 339 | 340 | /* 341 | * Disable GNU extensions if POSIXLY_CORRECT is set or options 342 | * string begins with a '+'. 343 | * 344 | * CV, 2009-12-14: Check POSIXLY_CORRECT anew if optind == 0 or 345 | * optreset != 0 for GNU compatibility. 346 | */ 347 | if (posixly_correct == -1 || optreset != 0) 348 | posixly_correct = (getenv("POSIXLY_CORRECT") != NULL); 349 | if (*options == '-') 350 | flags |= FLAG_ALLARGS; 351 | else if (posixly_correct || *options == '+') 352 | flags &= ~FLAG_PERMUTE; 353 | if (*options == '+' || *options == '-') 354 | options++; 355 | 356 | optarg = NULL; 357 | if (optreset) 358 | nonopt_start = nonopt_end = -1; 359 | start: 360 | if (optreset || !*place) { /* update scanning pointer */ 361 | optreset = 0; 362 | if (optind >= nargc) { /* end of argument vector */ 363 | place = EMSG; 364 | if (nonopt_end != -1) { 365 | /* do permutation, if we have to */ 366 | permute_args(nonopt_start, nonopt_end, 367 | optind, nargv); 368 | optind -= nonopt_end - nonopt_start; 369 | } 370 | else if (nonopt_start != -1) { 371 | /* 372 | * If we skipped non-options, set optind 373 | * to the first of them. 374 | */ 375 | optind = nonopt_start; 376 | } 377 | nonopt_start = nonopt_end = -1; 378 | return (-1); 379 | } 380 | if (*(place = nargv[optind]) != '-' || 381 | (place[1] == '\0' && strchr(options, '-') == NULL)) { 382 | place = EMSG; /* found non-option */ 383 | if (flags & FLAG_ALLARGS) { 384 | /* 385 | * GNU extension: 386 | * return non-option as argument to option 1 387 | */ 388 | optarg = nargv[optind++]; 389 | return (INORDER); 390 | } 391 | if (!(flags & FLAG_PERMUTE)) { 392 | /* 393 | * If no permutation wanted, stop parsing 394 | * at first non-option. 395 | */ 396 | return (-1); 397 | } 398 | /* do permutation */ 399 | if (nonopt_start == -1) 400 | nonopt_start = optind; 401 | else if (nonopt_end != -1) { 402 | permute_args(nonopt_start, nonopt_end, 403 | optind, nargv); 404 | nonopt_start = optind - 405 | (nonopt_end - nonopt_start); 406 | nonopt_end = -1; 407 | } 408 | optind++; 409 | /* process next argument */ 410 | goto start; 411 | } 412 | if (nonopt_start != -1 && nonopt_end == -1) 413 | nonopt_end = optind; 414 | 415 | /* 416 | * If we have "-" do nothing, if "--" we are done. 417 | */ 418 | if (place[1] != '\0' && *++place == '-' && place[1] == '\0') { 419 | optind++; 420 | place = EMSG; 421 | /* 422 | * We found an option (--), so if we skipped 423 | * non-options, we have to permute. 424 | */ 425 | if (nonopt_end != -1) { 426 | permute_args(nonopt_start, nonopt_end, 427 | optind, nargv); 428 | optind -= nonopt_end - nonopt_start; 429 | } 430 | nonopt_start = nonopt_end = -1; 431 | return (-1); 432 | } 433 | } 434 | 435 | /* 436 | * Check long options if: 437 | * 1) we were passed some 438 | * 2) the arg is not just "-" 439 | * 3) either the arg starts with -- we are getopt_long_only() 440 | */ 441 | if (long_options != NULL && place != nargv[optind] && 442 | (*place == '-' || (flags & FLAG_LONGONLY))) { 443 | short_too = 0; 444 | if (*place == '-') 445 | place++; /* --foo long option */ 446 | else if (*place != ':' && strchr(options, *place) != NULL) 447 | short_too = 1; /* could be short option too */ 448 | 449 | optchar = parse_long_options(nargv, options, long_options, 450 | idx, short_too); 451 | if (optchar != -1) { 452 | place = EMSG; 453 | return (optchar); 454 | } 455 | } 456 | 457 | if ((optchar = (int)*place++) == (int)':' || 458 | (optchar == (int)'-' && *place != '\0') || 459 | (oli = strchr(options, optchar)) == NULL) { 460 | /* 461 | * If the user specified "-" and '-' isn't listed in 462 | * options, return -1 (non-option) as per POSIX. 463 | * Otherwise, it is an unknown option character (or ':'). 464 | */ 465 | if (optchar == (int)'-' && *place == '\0') 466 | return (-1); 467 | if (!*place) 468 | ++optind; 469 | if (PRINT_ERROR) 470 | warnx(illoptchar, optchar); 471 | optopt = optchar; 472 | return (BADCH); 473 | } 474 | if (long_options != NULL && optchar == 'W' && oli[1] == ';') { 475 | /* -W long-option */ 476 | if (*place) /* no space */ 477 | /* NOTHING */; 478 | else if (++optind >= nargc) { /* no arg */ 479 | place = EMSG; 480 | if (PRINT_ERROR) 481 | warnx(recargchar, optchar); 482 | optopt = optchar; 483 | return (BADARG); 484 | } else /* white space */ 485 | place = nargv[optind]; 486 | optchar = parse_long_options(nargv, options, long_options, 487 | idx, 0); 488 | place = EMSG; 489 | return (optchar); 490 | } 491 | if (*++oli != ':') { /* doesn't take argument */ 492 | if (!*place) 493 | ++optind; 494 | } else { /* takes (optional) argument */ 495 | optarg = NULL; 496 | if (*place) /* no white space */ 497 | optarg = place; 498 | else if (oli[1] != ':') { /* arg not optional */ 499 | if (++optind >= nargc) { /* no arg */ 500 | place = EMSG; 501 | if (PRINT_ERROR) 502 | warnx(recargchar, optchar); 503 | optopt = optchar; 504 | return (BADARG); 505 | } else 506 | optarg = nargv[optind]; 507 | } 508 | place = EMSG; 509 | ++optind; 510 | } 511 | /* dump back option letter */ 512 | return (optchar); 513 | } 514 | 515 | #ifdef REPLACE_GETOPT 516 | /* 517 | * getopt -- 518 | * Parse argc/argv argument vector. 519 | * 520 | * [eventually this will replace the BSD getopt] 521 | */ 522 | int 523 | getopt(int nargc, char * const *nargv, const char *options) 524 | { 525 | 526 | /* 527 | * We don't pass FLAG_PERMUTE to getopt_internal() since 528 | * the BSD getopt(3) (unlike GNU) has never done this. 529 | * 530 | * Furthermore, since many privileged programs call getopt() 531 | * before dropping privileges it makes sense to keep things 532 | * as simple (and bug-free) as possible. 533 | */ 534 | return (getopt_internal(nargc, nargv, options, NULL, NULL, 0)); 535 | } 536 | #endif /* REPLACE_GETOPT */ 537 | 538 | /* 539 | * getopt_long -- 540 | * Parse argc/argv argument vector. 541 | */ 542 | int 543 | getopt_long(int nargc, char * const *nargv, const char *options, 544 | const struct option *long_options, int *idx) 545 | { 546 | 547 | return (getopt_internal(nargc, nargv, options, long_options, idx, 548 | FLAG_PERMUTE)); 549 | } 550 | 551 | /* 552 | * getopt_long_only -- 553 | * Parse argc/argv argument vector. 554 | */ 555 | int 556 | getopt_long_only(int nargc, char * const *nargv, const char *options, 557 | const struct option *long_options, int *idx) 558 | { 559 | 560 | return (getopt_internal(nargc, nargv, options, long_options, idx, 561 | FLAG_PERMUTE|FLAG_LONGONLY)); 562 | } 563 | -------------------------------------------------------------------------------- /vs/getopt.h: -------------------------------------------------------------------------------- 1 | #ifndef __GETOPT_H__ 2 | /** 3 | * DISCLAIMER 4 | * This file has no copyright assigned and is placed in the Public Domain. 5 | * This file is a part of the w64 mingw-runtime package. 6 | * 7 | * The w64 mingw-runtime package and its code is distributed in the hope that it 8 | * will be useful but WITHOUT ANY WARRANTY. ALL WARRANTIES, EXPRESSED OR 9 | * IMPLIED ARE HEREBY DISCLAIMED. This includes but is not limited to 10 | * warranties of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 11 | */ 12 | 13 | #define __GETOPT_H__ 14 | 15 | /* All the headers include this file. */ 16 | #if _MSC_VER >= 1300 17 | #include 18 | #endif 19 | 20 | #ifdef __cplusplus 21 | extern "C" { 22 | #endif 23 | 24 | extern int optind; /* index of first non-option in argv */ 25 | extern int optopt; /* single option character, as parsed */ 26 | extern int opterr; /* flag to enable built-in diagnostics... */ 27 | /* (user may set to zero, to suppress) */ 28 | 29 | extern char *optarg; /* pointer to argument of current option */ 30 | 31 | extern int getopt(int nargc, char * const *nargv, const char *options); 32 | 33 | #ifdef _BSD_SOURCE 34 | /* 35 | * BSD adds the non-standard `optreset' feature, for reinitialisation 36 | * of `getopt' parsing. We support this feature, for applications which 37 | * proclaim their BSD heritage, before including this header; however, 38 | * to maintain portability, developers are advised to avoid it. 39 | */ 40 | # define optreset __mingw_optreset 41 | extern int optreset; 42 | #endif 43 | #ifdef __cplusplus 44 | } 45 | #endif 46 | /* 47 | * POSIX requires the `getopt' API to be specified in `unistd.h'; 48 | * thus, `unistd.h' includes this header. However, we do not want 49 | * to expose the `getopt_long' or `getopt_long_only' APIs, when 50 | * included in this manner. Thus, close the standard __GETOPT_H__ 51 | * declarations block, and open an additional __GETOPT_LONG_H__ 52 | * specific block, only when *not* __UNISTD_H_SOURCED__, in which 53 | * to declare the extended API. 54 | */ 55 | #endif /* !defined(__GETOPT_H__) */ 56 | 57 | #if !defined(__UNISTD_H_SOURCED__) && !defined(__GETOPT_LONG_H__) 58 | #define __GETOPT_LONG_H__ 59 | 60 | #ifdef __cplusplus 61 | extern "C" { 62 | #endif 63 | 64 | struct option /* specification for a long form option... */ 65 | { 66 | const char *name; /* option name, without leading hyphens */ 67 | int has_arg; /* does it take an argument? */ 68 | int *flag; /* where to save its status, or NULL */ 69 | int val; /* its associated status value */ 70 | }; 71 | 72 | enum /* permitted values for its `has_arg' field... */ 73 | { 74 | no_argument = 0, /* option never takes an argument */ 75 | required_argument, /* option always requires an argument */ 76 | optional_argument /* option may take an argument */ 77 | }; 78 | 79 | extern int getopt_long(int nargc, char * const *nargv, const char *options, 80 | const struct option *long_options, int *idx); 81 | extern int getopt_long_only(int nargc, char * const *nargv, const char *options, 82 | const struct option *long_options, int *idx); 83 | /* 84 | * Previous MinGW implementation had... 85 | */ 86 | #ifndef HAVE_DECL_GETOPT 87 | /* 88 | * ...for the long form API only; keep this for compatibility. 89 | */ 90 | # define HAVE_DECL_GETOPT 1 91 | #endif 92 | 93 | #ifdef __cplusplus 94 | } 95 | #endif 96 | 97 | #endif /* !defined(__UNISTD_H_SOURCED__) && !defined(__GETOPT_LONG_H__) */ 98 | -------------------------------------------------------------------------------- /vs/inttypes.h: -------------------------------------------------------------------------------- 1 | // ISO C9x compliant inttypes.h for Microsoft Visual Studio 2 | // Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124 3 | // 4 | // Copyright (c) 2006-2013 Alexander Chemeris 5 | // 6 | // Redistribution and use in source and binary forms, with or without 7 | // modification, are permitted provided that the following conditions are met: 8 | // 9 | // 1. Redistributions of source code must retain the above copyright notice, 10 | // this list of conditions and the following disclaimer. 11 | // 12 | // 2. Redistributions in binary form must reproduce the above copyright 13 | // notice, this list of conditions and the following disclaimer in the 14 | // documentation and/or other materials provided with the distribution. 15 | // 16 | // 3. Neither the name of the product nor the names of its contributors may 17 | // be used to endorse or promote products derived from this software 18 | // without specific prior written permission. 19 | // 20 | // THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED 21 | // WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 22 | // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 23 | // EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 | // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 25 | // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 26 | // OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 27 | // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR 28 | // OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 29 | // ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | // 31 | /////////////////////////////////////////////////////////////////////////////// 32 | 33 | #ifndef _MSC_VER // [ 34 | #error "Use this header only with Microsoft Visual C++ compilers!" 35 | #endif // _MSC_VER ] 36 | 37 | #ifndef _MSC_INTTYPES_H_ // [ 38 | #define _MSC_INTTYPES_H_ 39 | 40 | #if _MSC_VER > 1000 41 | #pragma once 42 | #endif 43 | 44 | #include "stdint.h" 45 | 46 | // 7.8 Format conversion of integer types 47 | 48 | typedef struct { 49 | intmax_t quot; 50 | intmax_t rem; 51 | } imaxdiv_t; 52 | 53 | // 7.8.1 Macros for format specifiers 54 | 55 | #if !defined(__cplusplus) || defined(__STDC_FORMAT_MACROS) // [ See footnote 185 at page 198 56 | 57 | // The fprintf macros for signed integers are: 58 | #define PRId8 "d" 59 | #define PRIi8 "i" 60 | #define PRIdLEAST8 "d" 61 | #define PRIiLEAST8 "i" 62 | #define PRIdFAST8 "d" 63 | #define PRIiFAST8 "i" 64 | 65 | #define PRId16 "hd" 66 | #define PRIi16 "hi" 67 | #define PRIdLEAST16 "hd" 68 | #define PRIiLEAST16 "hi" 69 | #define PRIdFAST16 "hd" 70 | #define PRIiFAST16 "hi" 71 | 72 | #define PRId32 "I32d" 73 | #define PRIi32 "I32i" 74 | #define PRIdLEAST32 "I32d" 75 | #define PRIiLEAST32 "I32i" 76 | #define PRIdFAST32 "I32d" 77 | #define PRIiFAST32 "I32i" 78 | 79 | #define PRId64 "I64d" 80 | #define PRIi64 "I64i" 81 | #define PRIdLEAST64 "I64d" 82 | #define PRIiLEAST64 "I64i" 83 | #define PRIdFAST64 "I64d" 84 | #define PRIiFAST64 "I64i" 85 | 86 | #define PRIdMAX "I64d" 87 | #define PRIiMAX "I64i" 88 | 89 | #define PRIdPTR "Id" 90 | #define PRIiPTR "Ii" 91 | 92 | // The fprintf macros for unsigned integers are: 93 | #define PRIo8 "o" 94 | #define PRIu8 "u" 95 | #define PRIx8 "x" 96 | #define PRIX8 "X" 97 | #define PRIoLEAST8 "o" 98 | #define PRIuLEAST8 "u" 99 | #define PRIxLEAST8 "x" 100 | #define PRIXLEAST8 "X" 101 | #define PRIoFAST8 "o" 102 | #define PRIuFAST8 "u" 103 | #define PRIxFAST8 "x" 104 | #define PRIXFAST8 "X" 105 | 106 | #define PRIo16 "ho" 107 | #define PRIu16 "hu" 108 | #define PRIx16 "hx" 109 | #define PRIX16 "hX" 110 | #define PRIoLEAST16 "ho" 111 | #define PRIuLEAST16 "hu" 112 | #define PRIxLEAST16 "hx" 113 | #define PRIXLEAST16 "hX" 114 | #define PRIoFAST16 "ho" 115 | #define PRIuFAST16 "hu" 116 | #define PRIxFAST16 "hx" 117 | #define PRIXFAST16 "hX" 118 | 119 | #define PRIo32 "I32o" 120 | #define PRIu32 "I32u" 121 | #define PRIx32 "I32x" 122 | #define PRIX32 "I32X" 123 | #define PRIoLEAST32 "I32o" 124 | #define PRIuLEAST32 "I32u" 125 | #define PRIxLEAST32 "I32x" 126 | #define PRIXLEAST32 "I32X" 127 | #define PRIoFAST32 "I32o" 128 | #define PRIuFAST32 "I32u" 129 | #define PRIxFAST32 "I32x" 130 | #define PRIXFAST32 "I32X" 131 | 132 | #define PRIo64 "I64o" 133 | #define PRIu64 "I64u" 134 | #define PRIx64 "I64x" 135 | #define PRIX64 "I64X" 136 | #define PRIoLEAST64 "I64o" 137 | #define PRIuLEAST64 "I64u" 138 | #define PRIxLEAST64 "I64x" 139 | #define PRIXLEAST64 "I64X" 140 | #define PRIoFAST64 "I64o" 141 | #define PRIuFAST64 "I64u" 142 | #define PRIxFAST64 "I64x" 143 | #define PRIXFAST64 "I64X" 144 | 145 | #define PRIoMAX "I64o" 146 | #define PRIuMAX "I64u" 147 | #define PRIxMAX "I64x" 148 | #define PRIXMAX "I64X" 149 | 150 | #define PRIoPTR "Io" 151 | #define PRIuPTR "Iu" 152 | #define PRIxPTR "Ix" 153 | #define PRIXPTR "IX" 154 | 155 | // The fscanf macros for signed integers are: 156 | #define SCNd8 "d" 157 | #define SCNi8 "i" 158 | #define SCNdLEAST8 "d" 159 | #define SCNiLEAST8 "i" 160 | #define SCNdFAST8 "d" 161 | #define SCNiFAST8 "i" 162 | 163 | #define SCNd16 "hd" 164 | #define SCNi16 "hi" 165 | #define SCNdLEAST16 "hd" 166 | #define SCNiLEAST16 "hi" 167 | #define SCNdFAST16 "hd" 168 | #define SCNiFAST16 "hi" 169 | 170 | #define SCNd32 "ld" 171 | #define SCNi32 "li" 172 | #define SCNdLEAST32 "ld" 173 | #define SCNiLEAST32 "li" 174 | #define SCNdFAST32 "ld" 175 | #define SCNiFAST32 "li" 176 | 177 | #define SCNd64 "I64d" 178 | #define SCNi64 "I64i" 179 | #define SCNdLEAST64 "I64d" 180 | #define SCNiLEAST64 "I64i" 181 | #define SCNdFAST64 "I64d" 182 | #define SCNiFAST64 "I64i" 183 | 184 | #define SCNdMAX "I64d" 185 | #define SCNiMAX "I64i" 186 | 187 | #ifdef _WIN64 // [ 188 | # define SCNdPTR "I64d" 189 | # define SCNiPTR "I64i" 190 | #else // _WIN64 ][ 191 | # define SCNdPTR "ld" 192 | # define SCNiPTR "li" 193 | #endif // _WIN64 ] 194 | 195 | // The fscanf macros for unsigned integers are: 196 | #define SCNo8 "o" 197 | #define SCNu8 "u" 198 | #define SCNx8 "x" 199 | #define SCNX8 "X" 200 | #define SCNoLEAST8 "o" 201 | #define SCNuLEAST8 "u" 202 | #define SCNxLEAST8 "x" 203 | #define SCNXLEAST8 "X" 204 | #define SCNoFAST8 "o" 205 | #define SCNuFAST8 "u" 206 | #define SCNxFAST8 "x" 207 | #define SCNXFAST8 "X" 208 | 209 | #define SCNo16 "ho" 210 | #define SCNu16 "hu" 211 | #define SCNx16 "hx" 212 | #define SCNX16 "hX" 213 | #define SCNoLEAST16 "ho" 214 | #define SCNuLEAST16 "hu" 215 | #define SCNxLEAST16 "hx" 216 | #define SCNXLEAST16 "hX" 217 | #define SCNoFAST16 "ho" 218 | #define SCNuFAST16 "hu" 219 | #define SCNxFAST16 "hx" 220 | #define SCNXFAST16 "hX" 221 | 222 | #define SCNo32 "lo" 223 | #define SCNu32 "lu" 224 | #define SCNx32 "lx" 225 | #define SCNX32 "lX" 226 | #define SCNoLEAST32 "lo" 227 | #define SCNuLEAST32 "lu" 228 | #define SCNxLEAST32 "lx" 229 | #define SCNXLEAST32 "lX" 230 | #define SCNoFAST32 "lo" 231 | #define SCNuFAST32 "lu" 232 | #define SCNxFAST32 "lx" 233 | #define SCNXFAST32 "lX" 234 | 235 | #define SCNo64 "I64o" 236 | #define SCNu64 "I64u" 237 | #define SCNx64 "I64x" 238 | #define SCNX64 "I64X" 239 | #define SCNoLEAST64 "I64o" 240 | #define SCNuLEAST64 "I64u" 241 | #define SCNxLEAST64 "I64x" 242 | #define SCNXLEAST64 "I64X" 243 | #define SCNoFAST64 "I64o" 244 | #define SCNuFAST64 "I64u" 245 | #define SCNxFAST64 "I64x" 246 | #define SCNXFAST64 "I64X" 247 | 248 | #define SCNoMAX "I64o" 249 | #define SCNuMAX "I64u" 250 | #define SCNxMAX "I64x" 251 | #define SCNXMAX "I64X" 252 | 253 | #ifdef _WIN64 // [ 254 | # define SCNoPTR "I64o" 255 | # define SCNuPTR "I64u" 256 | # define SCNxPTR "I64x" 257 | # define SCNXPTR "I64X" 258 | #else // _WIN64 ][ 259 | # define SCNoPTR "lo" 260 | # define SCNuPTR "lu" 261 | # define SCNxPTR "lx" 262 | # define SCNXPTR "lX" 263 | #endif // _WIN64 ] 264 | 265 | #endif // __STDC_FORMAT_MACROS ] 266 | 267 | // 7.8.2 Functions for greatest-width integer types 268 | 269 | // 7.8.2.1 The imaxabs function 270 | #define imaxabs _abs64 271 | 272 | // 7.8.2.2 The imaxdiv function 273 | 274 | // This is modified version of div() function from Microsoft's div.c found 275 | // in %MSVC.NET%\crt\src\div.c 276 | #ifdef STATIC_IMAXDIV // [ 277 | static 278 | #else // STATIC_IMAXDIV ][ 279 | _inline 280 | #endif // STATIC_IMAXDIV ] 281 | imaxdiv_t __cdecl imaxdiv(intmax_t numer, intmax_t denom) 282 | { 283 | imaxdiv_t result; 284 | 285 | result.quot = numer / denom; 286 | result.rem = numer % denom; 287 | 288 | if (numer < 0 && result.rem > 0) { 289 | // did division wrong; must fix up 290 | ++result.quot; 291 | result.rem -= denom; 292 | } 293 | 294 | return result; 295 | } 296 | 297 | // 7.8.2.3 The strtoimax and strtoumax functions 298 | #define strtoimax _strtoi64 299 | #define strtoumax _strtoui64 300 | 301 | // 7.8.2.4 The wcstoimax and wcstoumax functions 302 | #define wcstoimax _wcstoi64 303 | #define wcstoumax _wcstoui64 304 | 305 | 306 | #endif // _MSC_INTTYPES_H_ ] 307 | -------------------------------------------------------------------------------- /vs/stdint.h: -------------------------------------------------------------------------------- 1 | // ISO C9x compliant stdint.h for Microsoft Visual Studio 2 | // Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124 3 | // 4 | // Copyright (c) 2006-2013 Alexander Chemeris 5 | // 6 | // Redistribution and use in source and binary forms, with or without 7 | // modification, are permitted provided that the following conditions are met: 8 | // 9 | // 1. Redistributions of source code must retain the above copyright notice, 10 | // this list of conditions and the following disclaimer. 11 | // 12 | // 2. Redistributions in binary form must reproduce the above copyright 13 | // notice, this list of conditions and the following disclaimer in the 14 | // documentation and/or other materials provided with the distribution. 15 | // 16 | // 3. Neither the name of the product nor the names of its contributors may 17 | // be used to endorse or promote products derived from this software 18 | // without specific prior written permission. 19 | // 20 | // THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED 21 | // WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 22 | // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 23 | // EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 24 | // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 25 | // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 26 | // OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 27 | // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR 28 | // OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 29 | // ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | // 31 | /////////////////////////////////////////////////////////////////////////////// 32 | 33 | #ifndef _MSC_VER // [ 34 | #error "Use this header only with Microsoft Visual C++ compilers!" 35 | #endif // _MSC_VER ] 36 | 37 | #ifndef _MSC_STDINT_H_ // [ 38 | #define _MSC_STDINT_H_ 39 | 40 | #if _MSC_VER > 1000 41 | #pragma once 42 | #endif 43 | 44 | #if _MSC_VER >= 1600 // [ 45 | #include 46 | #else // ] _MSC_VER >= 1600 [ 47 | 48 | #include 49 | 50 | // For Visual Studio 6 in C++ mode and for many Visual Studio versions when 51 | // compiling for ARM we should wrap include with 'extern "C++" {}' 52 | // or compiler give many errors like this: 53 | // error C2733: second C linkage of overloaded function 'wmemchr' not allowed 54 | #ifdef __cplusplus 55 | extern "C" { 56 | #endif 57 | # include 58 | #ifdef __cplusplus 59 | } 60 | #endif 61 | 62 | // Define _W64 macros to mark types changing their size, like intptr_t. 63 | #ifndef _W64 64 | # if !defined(__midl) && (defined(_X86_) || defined(_M_IX86)) && _MSC_VER >= 1300 65 | # define _W64 __w64 66 | # else 67 | # define _W64 68 | # endif 69 | #endif 70 | 71 | 72 | // 7.18.1 Integer types 73 | 74 | // 7.18.1.1 Exact-width integer types 75 | 76 | // Visual Studio 6 and Embedded Visual C++ 4 doesn't 77 | // realize that, e.g. char has the same size as __int8 78 | // so we give up on __intX for them. 79 | #if (_MSC_VER < 1300) 80 | typedef signed char int8_t; 81 | typedef signed short int16_t; 82 | typedef signed int int32_t; 83 | typedef unsigned char uint8_t; 84 | typedef unsigned short uint16_t; 85 | typedef unsigned int uint32_t; 86 | #else 87 | typedef signed __int8 int8_t; 88 | typedef signed __int16 int16_t; 89 | typedef signed __int32 int32_t; 90 | typedef unsigned __int8 uint8_t; 91 | typedef unsigned __int16 uint16_t; 92 | typedef unsigned __int32 uint32_t; 93 | #endif 94 | typedef signed __int64 int64_t; 95 | typedef unsigned __int64 uint64_t; 96 | 97 | 98 | // 7.18.1.2 Minimum-width integer types 99 | typedef int8_t int_least8_t; 100 | typedef int16_t int_least16_t; 101 | typedef int32_t int_least32_t; 102 | typedef int64_t int_least64_t; 103 | typedef uint8_t uint_least8_t; 104 | typedef uint16_t uint_least16_t; 105 | typedef uint32_t uint_least32_t; 106 | typedef uint64_t uint_least64_t; 107 | 108 | // 7.18.1.3 Fastest minimum-width integer types 109 | typedef int8_t int_fast8_t; 110 | typedef int16_t int_fast16_t; 111 | typedef int32_t int_fast32_t; 112 | typedef int64_t int_fast64_t; 113 | typedef uint8_t uint_fast8_t; 114 | typedef uint16_t uint_fast16_t; 115 | typedef uint32_t uint_fast32_t; 116 | typedef uint64_t uint_fast64_t; 117 | 118 | // 7.18.1.4 Integer types capable of holding object pointers 119 | #ifdef _WIN64 // [ 120 | typedef signed __int64 intptr_t; 121 | typedef unsigned __int64 uintptr_t; 122 | #else // _WIN64 ][ 123 | typedef _W64 signed int intptr_t; 124 | typedef _W64 unsigned int uintptr_t; 125 | #endif // _WIN64 ] 126 | 127 | // 7.18.1.5 Greatest-width integer types 128 | typedef int64_t intmax_t; 129 | typedef uint64_t uintmax_t; 130 | 131 | 132 | // 7.18.2 Limits of specified-width integer types 133 | 134 | #if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS) // [ See footnote 220 at page 257 and footnote 221 at page 259 135 | 136 | // 7.18.2.1 Limits of exact-width integer types 137 | #define INT8_MIN ((int8_t)_I8_MIN) 138 | #define INT8_MAX _I8_MAX 139 | #define INT16_MIN ((int16_t)_I16_MIN) 140 | #define INT16_MAX _I16_MAX 141 | #define INT32_MIN ((int32_t)_I32_MIN) 142 | #define INT32_MAX _I32_MAX 143 | #define INT64_MIN ((int64_t)_I64_MIN) 144 | #define INT64_MAX _I64_MAX 145 | #define UINT8_MAX _UI8_MAX 146 | #define UINT16_MAX _UI16_MAX 147 | #define UINT32_MAX _UI32_MAX 148 | #define UINT64_MAX _UI64_MAX 149 | 150 | // 7.18.2.2 Limits of minimum-width integer types 151 | #define INT_LEAST8_MIN INT8_MIN 152 | #define INT_LEAST8_MAX INT8_MAX 153 | #define INT_LEAST16_MIN INT16_MIN 154 | #define INT_LEAST16_MAX INT16_MAX 155 | #define INT_LEAST32_MIN INT32_MIN 156 | #define INT_LEAST32_MAX INT32_MAX 157 | #define INT_LEAST64_MIN INT64_MIN 158 | #define INT_LEAST64_MAX INT64_MAX 159 | #define UINT_LEAST8_MAX UINT8_MAX 160 | #define UINT_LEAST16_MAX UINT16_MAX 161 | #define UINT_LEAST32_MAX UINT32_MAX 162 | #define UINT_LEAST64_MAX UINT64_MAX 163 | 164 | // 7.18.2.3 Limits of fastest minimum-width integer types 165 | #define INT_FAST8_MIN INT8_MIN 166 | #define INT_FAST8_MAX INT8_MAX 167 | #define INT_FAST16_MIN INT16_MIN 168 | #define INT_FAST16_MAX INT16_MAX 169 | #define INT_FAST32_MIN INT32_MIN 170 | #define INT_FAST32_MAX INT32_MAX 171 | #define INT_FAST64_MIN INT64_MIN 172 | #define INT_FAST64_MAX INT64_MAX 173 | #define UINT_FAST8_MAX UINT8_MAX 174 | #define UINT_FAST16_MAX UINT16_MAX 175 | #define UINT_FAST32_MAX UINT32_MAX 176 | #define UINT_FAST64_MAX UINT64_MAX 177 | 178 | // 7.18.2.4 Limits of integer types capable of holding object pointers 179 | #ifdef _WIN64 // [ 180 | # define INTPTR_MIN INT64_MIN 181 | # define INTPTR_MAX INT64_MAX 182 | # define UINTPTR_MAX UINT64_MAX 183 | #else // _WIN64 ][ 184 | # define INTPTR_MIN INT32_MIN 185 | # define INTPTR_MAX INT32_MAX 186 | # define UINTPTR_MAX UINT32_MAX 187 | #endif // _WIN64 ] 188 | 189 | // 7.18.2.5 Limits of greatest-width integer types 190 | #define INTMAX_MIN INT64_MIN 191 | #define INTMAX_MAX INT64_MAX 192 | #define UINTMAX_MAX UINT64_MAX 193 | 194 | // 7.18.3 Limits of other integer types 195 | 196 | #ifdef _WIN64 // [ 197 | # define PTRDIFF_MIN _I64_MIN 198 | # define PTRDIFF_MAX _I64_MAX 199 | #else // _WIN64 ][ 200 | # define PTRDIFF_MIN _I32_MIN 201 | # define PTRDIFF_MAX _I32_MAX 202 | #endif // _WIN64 ] 203 | 204 | #define SIG_ATOMIC_MIN INT_MIN 205 | #define SIG_ATOMIC_MAX INT_MAX 206 | 207 | #ifndef SIZE_MAX // [ 208 | # ifdef _WIN64 // [ 209 | # define SIZE_MAX _UI64_MAX 210 | # else // _WIN64 ][ 211 | # define SIZE_MAX _UI32_MAX 212 | # endif // _WIN64 ] 213 | #endif // SIZE_MAX ] 214 | 215 | // WCHAR_MIN and WCHAR_MAX are also defined in 216 | #ifndef WCHAR_MIN // [ 217 | # define WCHAR_MIN 0 218 | #endif // WCHAR_MIN ] 219 | #ifndef WCHAR_MAX // [ 220 | # define WCHAR_MAX _UI16_MAX 221 | #endif // WCHAR_MAX ] 222 | 223 | #define WINT_MIN 0 224 | #define WINT_MAX _UI16_MAX 225 | 226 | #endif // __STDC_LIMIT_MACROS ] 227 | 228 | 229 | // 7.18.4 Limits of other integer types 230 | 231 | #if !defined(__cplusplus) || defined(__STDC_CONSTANT_MACROS) // [ See footnote 224 at page 260 232 | 233 | // 7.18.4.1 Macros for minimum-width integer constants 234 | 235 | #define INT8_C(val) val##i8 236 | #define INT16_C(val) val##i16 237 | #define INT32_C(val) val##i32 238 | #define INT64_C(val) val##i64 239 | 240 | #define UINT8_C(val) val##ui8 241 | #define UINT16_C(val) val##ui16 242 | #define UINT32_C(val) val##ui32 243 | #define UINT64_C(val) val##ui64 244 | 245 | // 7.18.4.2 Macros for greatest-width integer constants 246 | // These #ifndef's are needed to prevent collisions with . 247 | // Check out Issue 9 for the details. 248 | #ifndef INTMAX_C // [ 249 | # define INTMAX_C INT64_C 250 | #endif // INTMAX_C ] 251 | #ifndef UINTMAX_C // [ 252 | # define UINTMAX_C UINT64_C 253 | #endif // UINTMAX_C ] 254 | 255 | #endif // __STDC_CONSTANT_MACROS ] 256 | 257 | #endif // _MSC_VER >= 1600 ] 258 | 259 | #endif // _MSC_STDINT_H_ ] 260 | --------------------------------------------------------------------------------