├── .travis.yml
├── README.md
├── bitutil.c
├── bitutil.h
├── conf.h
├── makefile
├── makefile.vs
├── sse_neon.h
├── time_.h
├── tpbench.c
├── transpose.c
├── transpose.h
└── vs
├── getopt.c
├── getopt.h
├── inttypes.h
└── stdint.h
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: c
2 |
3 | compiler:
4 | - gcc
5 | - clang
6 |
7 | branches:
8 | only:
9 | - master
10 |
11 | script:
12 | - make
13 |
14 | matrix:
15 | include:
16 | - name: Linux arm
17 | os: linux
18 | arch: arm64
19 | compiler: gcc
20 |
21 | - name: Windows-MinGW
22 | os: windows
23 | script:
24 | - mingw32-make
25 |
26 | - name: macOS, xcode
27 | os: osx
28 |
29 | # - name: Linux amd64
30 | # os: linux
31 | # arch: amd64
32 | # - name: Power ppc64le
33 | # os: linux-ppc64le
34 | # compiler: gcc
35 |
36 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | Integer + Floating Point Compression Filter[](https://travis-ci.org/powturbo/TurboTranspose)
2 | ======================================
3 | * **Fastest transpose/shuffle**
4 | * :new: (2019.11) **ALL** TurboTranspose functions now available under **64 bits ARMv8** including **NEON** SIMD.
5 | * **Byte/Nibble** transpose/shuffle for improving compression of binary data (ex. floating point data)
6 | * :sparkles: **Scalar/SIMD** Transpose/Shuffle 8,16,32,64,... bits
7 | * :+1: Dynamic CPU detection and **JIT scalar/sse/avx2** switching
8 | * 100% C (C++ headers), usage as simple as memcpy
9 | * **Byte Transpose**
10 | * **Fastest** byte transpose
11 | * :new: (2019.11) **2D,3D,4D** transpose
12 | * **Nibble Transpose**
13 | * nearly as fast as byte transpose
14 | * more efficient, up to **10 times!** faster than [Bitshuffle](#bitshuffle)
15 | * :new: better compression (w/ lz77) and
**10 times!** faster than one of the best floating-point compressors [SPDP](#spdp)
16 | * can compress/decompress (w/ lz77) better and faster than other domain specific floating point compressors
17 | * Scalar and SIMD **Transform**
18 | * **Delta** encoding for sorted lists
19 | * **Zigzag** encoding for unsorted lists
20 | * **Xor** encoding
21 | * :new: **lossy** floating point compression with user-defined error
22 |
23 | ### Transpose Benchmark:
24 | - Benchmark Intel CPU: Skylake i7-6700 3.4GHz gcc 9.2 **single** thread
25 | - Benchmark ARM: ARMv8 A73-ODROID-N2 1.8GHz
26 |
27 | #### - Speed test
28 | ##### Benchmark w/ 16k buffer
29 |
30 | **BOLD** = pareto frontier.
31 | E:Encode, D:Decode
32 |
33 | ./tpbench -s# file -B16K (# = 8,4,2)
34 | |E cycles/byte|D cycles/byte|Transpose 64 bits **AVX2**|
35 | |------:|------:|-----------------------------------|
36 | |.199|**.134**|**TurboTranspose Byte**|
37 | |.326|.201|Blosc byteshuffle|
38 | |**.394**|**.260**|**TurboTranspose Nibble**|
39 | |.848|.478|Bitshuffle 8|
40 |
41 | |E cycles/byte|D cycles/byte|Transpose 32 bits **AVX2**|
42 | |------:|------:|-----------------------------------|
43 | |**.121**|**.102**|**TurboTranspose Byte**|
44 | |.451|.139|Blosc byteshuffle|
45 | |**.345**|**.229**|**TurboTranspose Nibble**|
46 | |.773|.476|Bitshuffle|
47 |
48 | |E cycles/byte|D cycles/byte|Transpose 16 bits **AVX2**|
49 | |------:|------:|-----------------------------------|
50 | |**.095**|**.071**|**TurboTranspose Byte**|
51 | |.640|.108|Blosc byteshuffle|
52 | |**.329**|**.198**|**TurboTranspose Nibble**|
53 | |.758|1.177|Bitshuffle 2|
54 | |**.067**|**.067**|memcpy|
55 | ----------------------------------------------------------------
56 | |E MB/s| D MB/s| 16 bits **ARM** 2019.11|
57 | |--------:|---------:|-----------------------------------|
58 | |**8192**|**16384**|**TurboTranspose Byte**|
59 | | 8192| 8192| blosc byteshuffle |
60 | | **1638**| **2341**|**TurboTranspose Nibble**|
61 | | 356| 287| blosc bitshuffle|
62 | | 16384| 16384| memcpy |
63 |
64 | | E MB/s | D MB/s| 32 bits **ARM** 2019.11|
65 | |--------:|---------:|-----------------------------------|
66 | |**8192**|**8192**|**TurboTranspose Byte**|
67 | | 8192| 8192| blosc byteshuffle|
68 | |**1820**|**2341**|**TurboTranspose Nibble**|
69 | | 372| 252| blosc bitshuffle|
70 |
71 | | E MB/s | D MB/s| 64 bits **ARM** 2019.11|
72 | |--------:|---------:|-----------------------------------|
73 | | 4096| **8192**|**TurboTranspose Byte**|
74 | |**5461**| 5461|**blosc byteshuffle**|
75 | |**1490**|**1490**|**TurboTranspose Nibble**|
76 | | 372| 260| blosc bitshuffle|
77 |
78 | #### Transpose/Shuffle benchmark w/ **large** files (100MB).
79 |
80 | MB/s: 1,000,000 bytes/second
81 |
82 | ./tpbench -s# file (# = 8,4,2)
83 | E MB/s|D MB/s|Transpose 16 bits **AVX2** 2019.11|
84 | |------:|------:|-----------------------------------|
85 | |**9208**|**9795**|**TurboTranspose Byte**|
86 | |8382|7689|Blosc byteshuffle|
87 | |**9377**|**9584**|**TurboTranspose Nibble**|
88 | |2750|2530|Blosc bitshuffle|
89 | |13725|13900|memcpy|
90 |
91 | |E MB/s|D MB/s|Transpose 32 bits **AVX2** 2019.11|
92 | |------:|------:|-----------------------------------|
93 | |**9718**|**9713**|**TurboTranspose Byte**|
94 | |9181|9030|Blosc byteshuffle|
95 | |**8750**|**9472**|**TurboTranspose Nibble**|
96 | |2767|2942|Blosc bitshuffle 4|
97 |
98 | |E MB/s|D MB/s|Transpose 64 bits **AVX2** 2019.11|
99 | |------:|------:|-----------------------------------|
100 | |**8998**|**9573**|**TurboTranspose Byte**|
101 | |8721|8586|Blosc byteshuffle 2|
102 | |**8252**|**9222**|**TurboTranspose Nibble**|
103 | |2711|2053|Blosc bitshuffle 2|
104 |
105 | ----------------------------------------------------------
106 | | E MB/s | D MB/s| 16 bits ARM 2019.11|
107 | |--------:|---------:|-----------------------------------|
108 | |**872**|**3998**|**TurboTranspose Byte**|
109 | | 678| 3852| blosc byteshuffle|
110 | |**1365**|**2195**|**TurboTranspose Nibble**|
111 | | 357| 280| blosc bitshuffle|
112 | | 3921| 3913| memcpy|
113 |
114 | | E MB/s | D MB/s| 32 bits ARM 2019.11|
115 | |--------:|---------:|-----------------------------------|
116 | |**1828**|**3768**|**TurboTranspose Byte**|
117 | |1769|3713|blosc byteshuffle|
118 | |**1456**|**2299**|**TurboTranspose Nibble**|
119 | | 374 | 243| blosc bitshuffle|
120 |
121 | | E MB/s | D MB/s| 64 bits ARM 2019.11|
122 | |--------:|---------:|-----------------------------------|
123 | |**1793**|**3572**|**TurboTranspose Byte**
124 | |1784| 3544|**blosc byteshuffle**
125 | |**1176**|**1267**|**TurboTranspose Nibble**
126 | | 331 | 203| blosc bitshuffle
127 |
128 | #### - Compression test (transpose/shuffle+lz4)
129 | :new: Download [IcApp](https://sites.google.com/site/powturbo/downloads) a new benchmark for [TurboPFor](https://github.com/powturbo/TurboPFor)+TurboTranspose
130 | for testing allmost all integer and floating point file types.
131 | Note: Lossy compression benchmark with icapp only.
132 |
133 | - [Scientific IEEE 754 32-Bit Single-Precision Floating-Point Datasets](http://cs.txstate.edu/~burtscher/research/datasets/FPsingle/)
134 |
135 | ###### - Speed test (file msg_sweep3d)
136 |
137 | C size |ratio %|C MB/s |D MB/s|Name AVX2|
138 | ---------:|------:|------:|-----:|:--------------|
139 | 11,348,554 |18.1|**2276**|**4425**|**TurboTranspose Nibble+lz**|
140 | 22,489,691 |35.8| 1670|3881|TurboTranspose Byte+lz |
141 | 43,471,376 |69.2| 348| 402|SPDP |
142 | 44,626,407 |71.0| 1065|2101|bitshuffle+lz|
143 | 62,865,612 |100.0|13300|13300|memcpy|
144 |
145 | ./tpbench -s4 -z *.sp
146 |
147 | |File |File size|lz %|Tp8lz|Tp4lz|[BS](#bitshuffle)lz|[spdp1](#spdp)||[spdp9](#spdp)|Tp4lzt|eTp4lzt|
148 | |:---------|--------:|----:|------:|--------:|-------:|-----:|-|-------:|-------:|----:|
149 | msg_bt |133194716| 94.3|70.4|**66.4**|73.9 | 70.0|` `|67.4|**54.7**|*32.4*|
150 | msg_lu | 97059484|100.4|77.1 |**70.4**|75.4 | 76.8|` `|74.0|**61.0**|*42.2*|
151 | msg_sppm |139497932| 11.7|**11.6**|12.6 |15.4 | 14.4|` `|13.7|**9.0**|*5.6*|
152 | msg_sp |145052928|100.3|68.8 |**63.7**|68.1 | 67.9|` `|65.3|**52.6**|*24.9*|
153 | msg_sweep3d| 62865612| 98.7|35.8 |**18.1**|71.0 | 69.6|` `|13.7|**9.8**|*3.8*|
154 | num_brain | 70920000|100.4|76.5 |**71.1**|77.4 | 79.1|` `|73.9|**63.4**|*32.6*|
155 | num_comet | 53673984| 92.4|79.0 |**77.6**|82.1 | 84.5|` `|84.6|**70.1**|*41.7*|
156 | num_control| 79752372| 99.4|89.5 |90.7 |**88.1** | 98.3|` `|98.5|**81.4**|*51.2*|
157 | num_plasma | 17544800|100.4| 0.7 |**0.7** |75.5 | 30.7|` `|2.9|**0.3**|*0.2*|
158 | obs_error | 31080408| 89.2|73.1 |**70.0**|76.9 | 78.3|` `|49.4|**20.5**|*12.2*|
159 | obs_info | 9465264| 93.6|70.2 |**61.9**|72.9 | 62.4|` `|43.8|**27.3**|*15.1*|
160 | obs_spitzer| 99090432| 98.3|**90.4** |95.6 |93.6 |100.1|` `|100.7|**80.2**|*52.3*|
161 | obs_temp | 19967136|100.4|**89.5**|92.4 |91.0 | 99.4|` `|100.1|**84.0**|*55.8*|
162 |
163 | Tp8=Byte transpose, Tp4=Nibble transpose, lz = lz4
164 | eTp4Lzt = lossy compression with lzturbo and allowed error = 0.0001 (1e-4)
165 | *Slow but best compression:* SPDP9 and [lzt = lzturbo,39](https://github.com/powturbo/TurboBench)
166 |
167 | - [Scientific IEEE 754 64-Bit Double-Precision Floating-Point Datasets](http://cs.txstate.edu/~burtscher/research/datasets/FPdouble/)
168 |
169 | ./tpbench -s8 -z *.trace
170 |
171 | |File |File size |lz %|Tp8lz|Tp4lz|[BS](#bitshuffle)lz|[spdp1](#spdp)||[spdp9](#spdp)|Tp4lzt|eTp4lzt|
172 | |:---------|----------:|----:|------:|--------:|-------:|-----:|-|-------:|-------:|----:|
173 | msg_bt |266389432|94.5|77.2|**76.5**|81.6| 77.9|` `|75.4|**69.9**|*16.0*|
174 | msg_lu |194118968|100.4|82.7|**81.0**|83.7|83.3|` `|79.6|**75.5**|*21.0*|
175 | msg_sppm |278995864|18.9|**14.5**|14.9|19.5| 21.5|` `|19.8|**11.2**|*2.8*|
176 | msg_sp |290105856|100.4|79.2|**77.5**|80.2|78.8|` `|77.1|**71.3**|*12.4*|
177 | msg_sweep3d|125731224|98.7|50.7|**36.7**|80.4| 76.2|` `|33.2|**27.3**|*1.9*|
178 | num_brain |141840000|100.4|82.6|**81.1**|84.5|87.8|` `|83.3|**77.0**|*16.3*|
179 | num_comet |107347968|92.8|83.3|78.8|**76.3**| 86.5|` `|86.0|**69.8**|*21.2*|
180 | num_control|159504744|99.6|92.2|90.9|**89.4**| 97.6|` `|98.9|**85.5**|*25.8*|
181 | num_plasma | 35089600|75.2|0.7|**0.7**|84.5| 77.3|` `|3.0|**0.3**|*0.1*|
182 | obs_error | 62160816|78.7|81.0|**77.5**|84.4| 87.9|` `|62.3|**23.4**|*6.3*|
183 | obs_info | 18930528|92.3|75.4|**70.6**|82.4| 81.7|` `|51.2|**33.1**|*7.7*|
184 | obs_spitzer|198180864|95.4|93.2|93.7|**86.4**|100.1|` `|102.4|**78.0**|*26.9*|
185 | obs_temp | 39934272|100.4|93.1|93.8|**91.7**|98.0|` `|97.4|**88.2**|*28.8*|
186 |
187 | eTp4Lzt = lossy compression with allowed error = 0.0001
188 |
189 | ### Compile:
190 |
191 | git clone git://github.com/powturbo/TurboTranspose.git
192 | cd TurboTranspose
193 |
194 | ##### Linux + Windows MingW
195 |
196 | make
197 | or
198 | make AVX2=1
199 |
200 | ##### Windows Visual C++
201 |
202 | nmake /f makefile.vs
203 | or
204 | nmake AVX2=1 /f makefile.vs
205 |
206 |
207 | + benchmark with other libraries
208 | download or clone [bitshuffle](https://github.com/kiyo-masui/bitshuffle) or [blosc](https://github.com/Blosc/c-blosc) and type
209 |
210 | make AVX2=1 BLOSC=1
211 | or
212 | make AVX2=1 BITSHUFFLE=1
213 |
214 | ### Testing:
215 | + benchmark "transpose" functions
216 |
217 | ./tpbench [-s#] [-z] file
218 | s# = element size #=2,4,8,16,... (default 4)
219 | -z = only lz77 compression benchmark (bitshuffle package mandatory)
220 |
221 |
222 | ### Function usage:
223 |
224 | **Byte transpose:**
225 | >**void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
226 | void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)**
227 | in : input buffer
228 | n : number of bytes
229 | out : output buffer
230 | esize : element size in bytes (2,4,8,...)
231 |
232 |
233 | **Nibble transpose:**
234 | >**void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
235 | void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize)**
236 | in : input buffer
237 | n : number of bytes
238 | out : output buffer
239 | esize : element size in bytes (2,4,8,...)
240 |
241 | ### Environment:
242 |
243 | ###### OS/Compiler (64 bits):
244 | - Linux: GNU GCC (>=4.6)
245 | - Linux: Clang (>=3.2)
246 | - Windows: MinGW-w64 makefile
247 | - Windows: Visual c++ (>=VS2008) - makefile.vs (for nmake)
248 | - Windows: Visual Studio project file - vs/vs2017 - Thanks to [PavelP](https://github.com/pps83)
249 | - Linux ARM: 64 bits aarch64 ARMv8: gcc (>=6.3)
250 | - Linux ARM: 64 bits aarch64 ARMv8: clang
251 |
252 | ###### Multithreading:
253 | - All TurboTranspose functions are thread safe
254 |
255 | ### References:
256 | - [BS - Bitshuffle: Filter for improving compression of typed binary data.](https://github.com/kiyo-masui/bitshuffle)
257 | :green_book:[ A compression scheme for radio data in high performance computing](https://arxiv.org/abs/1503.00638)
258 | - [Blosc: A blocking, shuffling and loss-less compression](https://github.com/Blosc/c-blosc)
259 | - [SPDP is a compression/decompression algorithm for binary IEEE 754 32/64 bits floating-point data](http://cs.txstate.edu/~burtscher/research/SPDPcompressor/)
260 | :green_book:[ SPDP - An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data](http://cs.txstate.edu/~mb92/papers/dcc18.pdf) + [DCC 2018](http://www.cs.brandeis.edu//~dcc/Programs/Program2018.pdf)
261 | - :green_book:[ FPC: A High-Speed Compressor for Double-Precision Floating-Point Data](http://www.cs.txstate.edu/~burtscher/papers/tc09.pdf)
262 |
263 | Last update: 25 Oct 2019
264 |
--------------------------------------------------------------------------------
/bitutil.c:
--------------------------------------------------------------------------------
1 | /**
2 | Copyright (C) powturbo 2013-2019
3 | GPL v2 License
4 |
5 | This program is free software; you can redistribute it and/or modify
6 | it under the terms of the GNU General Public License as published by
7 | the Free Software Foundation; either version 2 of the License, or
8 | (at your option) any later version.
9 |
10 | This program is distributed in the hope that it will be useful,
11 | but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | GNU General Public License for more details.
14 |
15 | You should have received a copy of the GNU General Public License along
16 | with this program; if not, write to the Free Software Foundation, Inc.,
17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
18 |
19 | - homepage : https://sites.google.com/site/powturbo/
20 | - github : https://github.com/powturbo
21 | - twitter : https://twitter.com/powturbo
22 | - email : powturbo [_AT_] gmail [_DOT_] com
23 | **/
24 | // "Integer Compression" utility - delta, for, zigzag / Floating point compression
25 | #include "conf.h"
26 | #define BITUTIL_IN
27 | #include "bitutil.h"
28 |
29 | //------------ 'or' for bitsize + 'xor' for all duplicate ------------------
30 | #define BT(_i_) { o |= ip[_i_]; x |= ip[_i_] ^ u0; }
31 | #define BIT(_in_, _n_, _usize_) {\
32 | u0 = _in_[0]; o = x = 0;\
33 | for(ip = _in_; ip != _in_+(_n_&~(4-1)); ip += 4) { BT(0); BT(1); BT(2); BT(3); }\
34 | for(;ip != _in_+_n_; ip++) BT(0);\
35 | }
36 |
37 | uint8_t bit8( uint8_t *in, unsigned n, uint8_t *px) { uint8_t o,x,u0,*ip; BIT(in, n, 8); if(px) *px = x; return o; }
38 | uint64_t bit64(uint64_t *in, unsigned n, uint64_t *px) { uint64_t o,x,u0,*ip; BIT(in, n, 64); if(px) *px = x; return o; }
39 |
40 | uint16_t bit16(uint16_t *in, unsigned n, uint16_t *px) {
41 | uint16_t o, x, u0 = in[0], *ip;
42 | #if defined(__SSE2__) || defined(__ARM_NEON)
43 | __m128i vb0 = _mm_set1_epi16(u0), vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(),
44 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128();
45 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0);
46 | __m128i v0 = _mm_loadu_si128((__m128i *) ip);
47 | __m128i v1 = _mm_loadu_si128((__m128i *)(ip+8));
48 | vo0 = _mm_or_si128( vo0, v0);
49 | vo1 = _mm_or_si128( vo1, v1);
50 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0));
51 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0));
52 | }
53 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi16(vo0);
54 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi16(vx0);
55 | #else
56 | ip = in; o = x = 0; //BIT( in, n, 16);
57 | #endif
58 | for(; ip != in+n; ip++) BT(0);
59 | if(px) *px = x;
60 | return o;
61 | }
62 |
63 | uint32_t bit32(uint32_t *in, unsigned n, uint32_t *px) {
64 | uint32_t o,x,u0 = in[0], *ip;
65 | #if defined(__AVX2__) && defined(USE_AVX2)
66 | __m256i vb0 = _mm256_set1_epi32(*in), vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(),
67 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256();
68 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0);
69 | __m256i v0 = _mm256_loadu_si256((__m256i *) ip);
70 | __m256i v1 = _mm256_loadu_si256((__m256i *)(ip+8));
71 | vo0 = _mm256_or_si256(vo0, v0);
72 | vo1 = _mm256_or_si256(vo1, v1);
73 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0));
74 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0));
75 | }
76 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0);
77 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0);
78 | #elif defined(__SSE2__) || defined(__ARM_NEON)
79 | __m128i vb0 = _mm_set1_epi32(u0), vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(),
80 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128();
81 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0);
82 | __m128i v0 = _mm_loadu_si128((__m128i *) ip);
83 | __m128i v1 = _mm_loadu_si128((__m128i *)(ip+4));
84 | vo0 = _mm_or_si128(vo0, v0);
85 | vo1 = _mm_or_si128(vo1, v1);
86 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0));
87 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0));
88 | }
89 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0);
90 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0);
91 | #else
92 | ip = in; o = x = 0; //BIT( in, n, 32);
93 | #endif
94 | for(; ip != in+n; ip++) BT(0);
95 | if(px) *px = x;
96 | return o;
97 | }
98 |
99 | //----------------------------------------------------------- Delta ----------------------------------------------------------------
100 | #define DE(_ip_,_i_) u = (_ip_[_i_]-start)-_md; start = _ip_[_i_];
101 | #define BITDE(_t_, _in_, _n_, _md_, _act_) { _t_ _md = _md_, *_ip; o = x = 0;\
102 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { DE(_ip,0);_act_; DE(_ip,1);_act_; DE(_ip,2);_act_; DE(_ip,3);_act_; }\
103 | for(;_ip != _in_+_n_;_ip++) { DE(_ip,0); _act_; }\
104 | }
105 | //---- (min. Delta = 0)
106 | //-- delta encoding
107 | uint8_t bitd8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t u, u0 = in[0]-start, o, x; BITDE(uint8_t, in, n, 0, o |= u; x |= u^u0); if(px) *px = x; return o; }
108 | uint64_t bitd64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t u, u0 = in[0]-start, o, x; BITDE(uint64_t, in, n, 0, o |= u; x |= u^u0); if(px) *px = x; return o; }
109 |
110 | uint16_t bitd16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) {
111 | uint16_t o, x, *ip, u0 = in[0]-start;
112 | #if defined(__SSE2__) || defined(__ARM_NEON)
113 | __m128i vb0 = _mm_set1_epi16(u0),
114 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(),
115 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi16(start);
116 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0);
117 | __m128i vi0 = _mm_loadu_si128((__m128i *) ip);
118 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+8)); __m128i v0 = mm_delta_epi16(vi0,vs); vs = vi0;
119 | __m128i v1 = mm_delta_epi16(vi1,vs); vs = vi1;
120 | vo0 = _mm_or_si128(vo0, v0);
121 | vo1 = _mm_or_si128(vo1, v1);
122 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0));
123 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0));
124 | } start = _mm_cvtsi128_si16(_mm_srli_si128(vs,14));
125 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi16(vo0);
126 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi16(vx0);
127 | #else
128 | ip = in; o = x = 0;
129 | #endif
130 | for(;ip != in+n; ip++) {
131 | uint16_t u = *ip - start; start = *ip;
132 | o |= u;
133 | x |= u ^ u0;
134 | }
135 | if(px) *px = x;
136 | return o;
137 | }
138 |
139 | uint32_t bitd32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) {
140 | uint32_t o, x, *ip, u0 = in[0] - start;
141 | #if defined(__AVX2__) && defined(USE_AVX2)
142 | __m256i vb0 = _mm256_set1_epi32(u0),
143 | vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(),
144 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256(); __m256i vs = _mm256_set1_epi32(start);
145 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0);
146 | __m256i vi0 = _mm256_loadu_si256((__m256i *) ip);
147 | __m256i vi1 = _mm256_loadu_si256((__m256i *)(ip+8)); __m256i v0 = mm256_delta_epi32(vi0,vs); vs = vi0;
148 | __m256i v1 = mm256_delta_epi32(vi1,vs); vs = vi1;
149 | vo0 = _mm256_or_si256(vo0, v0);
150 | vo1 = _mm256_or_si256(vo1, v1);
151 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0));
152 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0));
153 | } start = (unsigned)_mm256_extract_epi32(vs, 7);
154 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0);
155 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0);
156 | #elif defined(__SSE2__) || defined(__ARM_NEON)
157 | __m128i vb0 = _mm_set1_epi32(u0),
158 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(),
159 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi32(start);
160 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0);
161 | __m128i vi0 = _mm_loadu_si128((__m128i *)ip);
162 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+4)); __m128i v0 = mm_delta_epi32(vi0,vs); vs = vi0;
163 | __m128i v1 = mm_delta_epi32(vi1,vs); vs = vi1;
164 | vo0 = _mm_or_si128(vo0, v0);
165 | vo1 = _mm_or_si128(vo1, v1);
166 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0));
167 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0));
168 | } start = _mm_cvtsi128_si32(_mm_srli_si128(vs,12));
169 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0);
170 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0);
171 | #else
172 | ip = in; o = x = 0;
173 | #endif
174 | for(;ip != in+n; ip++) {
175 | uint32_t u = *ip - start; start = *ip;
176 | o |= u;
177 | x |= u ^ u0;
178 | }
179 | if(px) *px = x;
180 | return o;
181 | }
182 |
183 | //----- Undelta: In-place prefix sum (min. Delta = 0) -------------------
184 | #define DD(i) _ip[i] = (start += _ip[i] + _md);
185 | #define BITDD(_t_, _in_, _n_, _md_) { _t_ *_ip; const _md = _md_;\
186 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { DD(0); DD(1); DD(2); DD(3); }\
187 | for(;_ip != _in_+_n_; _ip++) DD(0);\
188 | }
189 |
190 | void bitddec8( uint8_t *p, unsigned n, uint8_t start) { BITDD(uint8_t, p, n, 0); }
191 | void bitddec16(uint16_t *p, unsigned n, uint16_t start) { BITDD(uint16_t, p, n, 0); }
192 | void bitddec64(uint64_t *p, unsigned n, uint64_t start) { BITDD(uint64_t, p, n, 0); }
193 | void bitddec32(uint32_t *p, unsigned n, unsigned start) {
194 | #if defined(__AVX2__) && defined(USE_AVX2)
195 | __m256i vs = _mm256_set1_epi32(start);
196 | unsigned *ip;
197 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) {
198 | __m256i v = _mm256_loadu_si256((__m256i *)ip);
199 | vs = mm256_scan_epi32(v,vs);
200 | _mm256_storeu_si256((__m256i *)ip, vs);
201 | }
202 | start = (unsigned)_mm256_extract_epi32(vs, 7);
203 | while(ip != p+n) {
204 | *ip = (start += (*ip));
205 | ip++;
206 | }
207 | #elif defined(__SSE2__) || defined(__ARM_NEON)
208 | __m128i vs = _mm_set1_epi32(start);
209 | unsigned *ip;
210 | for(ip = p; ip != p+(n&~(4-1)); ip += 4) {
211 | __m128i v = _mm_loadu_si128((__m128i *)ip);
212 | vs = mm_scan_epi32(v, vs);
213 | _mm_storeu_si128((__m128i *)ip, vs);
214 | }
215 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12));
216 | while(ip != p+n) {
217 | *ip = (start += (*ip));
218 | ip++;
219 | }
220 | #else
221 | BITDD(uint32_t, p, n, 0);
222 | #endif
223 | }
224 |
225 | //----------- Zigzag of Delta --------------------------
226 | #define ZDE(i, _usize_) d = (_ip[i]-start)-_md; u = TEMPLATE2(zigzagenc, _usize_)(d - startd); startd = d; start = _ip[i]
227 | #define BITZDE(_t_, _in_, _n_, _md_, _usize_, _act_) { _t_ *_ip, _md = _md_;\
228 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZDE(0, _usize_);_act_; ZDE(1, _usize_);_act_; ZDE(2, _usize_);_act_; ZDE(3, _usize_);_act_; }\
229 | for(;_ip != _in_+_n_;_ip++) { ZDE(0, _usize_); _act_; }\
230 | }
231 |
232 | uint8_t bitzz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t o=0, x=0,d,startd=0,u; BITZDE(uint8_t, in, n, 1, 8, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; }
233 | uint16_t bitzz16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { uint16_t o=0, x=0,d,startd=0,u; BITZDE(uint16_t, in, n, 1, 16, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; }
234 | uint32_t bitzz32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { uint64_t o=0, x=0,d,startd=0,u; BITZDE(uint32_t, in, n, 1, 32, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; }
235 | uint64_t bitzz64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t o=0, x=0,d,startd=0,u; BITZDE(uint64_t, in, n, 1, 64, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; }
236 | uint8_t bitzzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta) { uint8_t o=0,*op = out,u,d,startd=0; BITZDE(uint8_t, in, n, mindelta, 8,o |= u;*op++ = u); return o;}
237 | uint16_t bitzzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta) { uint16_t o=0,*op = out,u,d,startd=0; BITZDE(uint16_t, in, n, mindelta, 16,o |= u;*op++ = u); return o;}
238 | uint32_t bitzzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta) { uint32_t o=0,*op = out,u,d,startd=0; BITZDE(uint32_t, in, n, mindelta, 32,o |= u;*op++ = u); return o;}
239 | uint64_t bitzzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta) { uint64_t o=0,*op = out,u,d,startd=0; BITZDE(uint64_t, in, n, mindelta, 64,o |= u;*op++ = u); return o;}
240 |
241 | #define ZDD(i) u = _ip[i]; d = u - start; _ip[i] = zigzagdec64(u)+(int64_t)startd+_md; startd = d; start = u
242 | #define BITZDD(_t_, _in_, _n_, _md_) { _t_ *_ip, startd=0,d,u; const _md = _md_;\
243 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZDD(0); ZDD(1); ZDD(2); ZDD(3); }\
244 | for(;_ip != _in_+_n_; _ip++) ZDD(0);\
245 | }
246 | void bitzzdec8( uint8_t *p, unsigned n, uint8_t start) { BITZDD(uint8_t, p, n, 1); }
247 | void bitzzdec16(uint16_t *p, unsigned n, uint16_t start) { BITZDD(uint16_t, p, n, 1); }
248 | void bitzzdec64(uint64_t *p, unsigned n, uint64_t start) { BITZDD(uint64_t, p, n, 1); }
249 | void bitzzdec32(uint32_t *p, unsigned n, uint32_t start) { BITZDD(uint32_t, p, n, 1); }
250 |
251 | //-----Undelta: In-place prefix sum (min. Delta = 1) -------------------
252 | uint8_t bitd18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t o=0,x=0,u,*ip; BITDE(uint8_t, in, n, 1, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; }
253 | uint16_t bitd116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { uint16_t o=0,x=0,u,*ip; BITDE(uint16_t, in, n, 1, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; }
254 | uint64_t bitd164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t o=0,x=0,u,*ip; BITDE(uint64_t, in, n, 1, o |= u; x |= u ^ in[0]); if(px) *px = x; return o; }
255 |
256 | uint32_t bitd132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) {
257 | uint32_t o, x, *ip, u0 = in[0]-start-1;
258 | #if defined(__AVX2__) && defined(USE_AVX2)
259 | __m256i vb0 = _mm256_set1_epi32(u0),
260 | vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(),
261 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256(); __m256i vs = _mm256_set1_epi32(start), cv = _mm256_set1_epi32(1);
262 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0);
263 | __m256i vi0 = _mm256_loadu_si256((__m256i *)ip);
264 | __m256i vi1 = _mm256_loadu_si256((__m256i *)(ip+8)); __m256i v0 = _mm256_sub_epi32(mm256_delta_epi32(vi0,vs),cv); vs = vi0;
265 | __m256i v1 = _mm256_sub_epi32(mm256_delta_epi32(vi1,vs),cv); vs = vi1;
266 | vo0 = _mm256_or_si256(vo0, v0);
267 | vo1 = _mm256_or_si256(vo1, v1);
268 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0));
269 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0));
270 | } start = (unsigned)_mm256_extract_epi32(vs, 7);
271 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0);
272 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0);
273 | #elif defined(__SSE2__) || defined(__ARM_NEON)
274 | __m128i vb0 = _mm_set1_epi32(u0),
275 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(),
276 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi32(start), cv = _mm_set1_epi32(1);
277 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0);
278 | __m128i vi0 = _mm_loadu_si128((__m128i *)ip);
279 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+4)); __m128i v0 = _mm_sub_epi32(mm_delta_epi32(vi0,vs),cv); vs = vi0;
280 | __m128i v1 = _mm_sub_epi32(mm_delta_epi32(vi1,vs),cv); vs = vi1;
281 | vo0 = _mm_or_si128(vo0, v0);
282 | vo1 = _mm_or_si128(vo1, v1);
283 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0));
284 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0));
285 | } start = _mm_cvtsi128_si32(_mm_srli_si128(vs,12));
286 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0);
287 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0);
288 | #else
289 | ip = in; o = x = 0;
290 | #endif
291 | for(;ip != in+n; ip++) {
292 | uint32_t u = ip[0] - start-1; start = *ip;
293 | o |= u;
294 | x |= u ^ u0;
295 | }
296 | if(px) *px = x;
297 | return o;
298 | }
299 |
300 | uint16_t bits128v16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) {
301 | #if defined(__SSE2__) || defined(__ARM_NEON)
302 | unsigned *ip,b; __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi16(start), cv = _mm_set1_epi16(8);
303 | for(ip = in; ip != in+(n&~(4-1)); ip += 4) {
304 | __m128i iv = _mm_loadu_si128((__m128i *)ip);
305 | bv = _mm_or_si128(bv,_mm_sub_epi16(SUBI16x8(iv,vs),cv));
306 | vs = iv;
307 | }
308 | start = (unsigned short)_mm_cvtsi128_si32(_mm_srli_si128(vs,14));
309 | b = mm_hor_epi16(bv);
310 | if(px) *px = 0;
311 | return b;
312 | #endif
313 | }
314 |
315 | unsigned bits128v32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) {
316 | #if defined(__SSE2__) || defined(__ARM_NEON)
317 | unsigned *ip,b; __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi32(start), cv = _mm_set1_epi32(4);
318 | for(ip = in; ip != in+(n&~(4-1)); ip += 4) {
319 | __m128i iv = _mm_loadu_si128((__m128i *)ip);
320 | bv = _mm_or_si128(bv,_mm_sub_epi32(SUBI32x4(iv,vs),cv));
321 | vs = iv;
322 | }
323 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12));
324 | b = mm_hor_epi32(bv);
325 | if(px) *px = 0;
326 | return b;
327 | #endif
328 | }
329 |
330 | void bitd1dec8( uint8_t *p, unsigned n, uint8_t start) { BITDD(uint8_t, p, n, 1); }
331 | void bitd1dec16(uint16_t *p, unsigned n, uint16_t start) { BITDD(uint16_t, p, n, 1); }
332 | void bitd1dec64(uint64_t *p, unsigned n, uint64_t start) { BITDD(uint64_t, p, n, 1); }
333 | void bitd1dec32(uint32_t *p, unsigned n, uint32_t start) {
334 | #if defined(__AVX2__) && defined(USE_AVX2)
335 | __m256i vs = _mm256_set1_epi32(start),zv = _mm256_setzero_si256(), cv = _mm256_set_epi32(8,7,6,5,4,3,2,1);
336 | unsigned *ip;
337 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) {
338 | __m256i v = _mm256_loadu_si256((__m256i *)ip); vs = mm256_scani_epi32(v, vs, cv);
339 | _mm256_storeu_si256((__m256i *)ip, vs);
340 | }
341 | start = (unsigned)_mm256_extract_epi32(vs, 7);
342 | while(ip != p+n) {
343 | *ip = (start += (*ip) + 1);
344 | ip++;
345 | }
346 | #elif defined(__SSE2__) || defined(__ARM_NEON)
347 | __m128i vs = _mm_set1_epi32(start), cv = _mm_set_epi32(4,3,2,1);
348 | unsigned *ip;
349 | for(ip = p; ip != p+(n&~(4-1)); ip += 4) {
350 | __m128i v = _mm_loadu_si128((__m128i *)ip);
351 | vs = mm_scani_epi32(v, vs, cv);
352 | _mm_storeu_si128((__m128i *)ip, vs);
353 | }
354 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12));
355 | while(ip != p+n) {
356 | *ip = (start += (*ip) + 1);
357 | ip++;
358 | }
359 | #else
360 | BITDD(uint32_t, p, n, 1);
361 | #endif
362 | }
363 |
364 | //---------Delta encoding/decoding (min. Delta = mindelta) -------------------
365 | //determine min. delta for encoding w/ bitdiencNN function
366 | #define DI(_ip_,_i_) u = _ip_[_i_] - start; start = _ip_[_i_]; if(u < mindelta) mindelta = u
367 | #define BITDIE(_in_, _n_) {\
368 | for(_ip = _in_,mindelta = _ip[0]; _ip != _in_+(_n_&~(4-1)); _ip+=4) { DI(_ip,0); DI(_ip,1); DI(_ip,2); DI(_ip,3); }\
369 | for(;_ip != _in_+_n_;_ip++) DI(_ip,0);\
370 | }
371 |
372 | uint8_t bitdi8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; }
373 | uint16_t bitdi16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { uint16_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; }
374 | uint32_t bitdi32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { uint32_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; }
375 | uint64_t bitdi64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t mindelta,u,*_ip; BITDIE(in, n); if(px) *px = 0; return mindelta; }
376 |
377 | uint8_t bitdienc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta) { uint8_t o=0,x=0,*op = out,u,*ip; BITDE(uint8_t, in, n, mindelta, o |= u; x |= u ^ in[0]; *op++ = u); return o; }
378 | uint16_t bitdienc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta) { uint16_t o=0,x=0,*op = out,u,*ip; BITDE(uint16_t, in, n, mindelta, o |= u; x |= u ^ in[0]; *op++ = u); return o; }
379 | uint64_t bitdienc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta) { uint64_t o=0,x=0,*op = out,u,*ip; BITDE(uint64_t, in, n, mindelta, o |= u; x |= u ^ in[0]; *op++ = u); return o; }
380 | uint32_t bitdienc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta) {
381 | #if defined(__SSE2__) || defined(__ARM_NEON)
382 | unsigned *ip,b,*op = out;
383 | __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi32(start), cv = _mm_set1_epi32(mindelta), dv;
384 | for(ip = in; ip != in+(n&~(4-1)); ip += 4,op += 4) {
385 | __m128i iv = _mm_loadu_si128((__m128i *)ip);
386 | bv = _mm_or_si128(bv, dv = _mm_sub_epi32(mm_delta_epi32(iv,vs),cv));
387 | vs = iv;
388 | _mm_storeu_si128((__m128i *)op, dv);
389 | }
390 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12));
391 | b = mm_hor_epi32(bv);
392 | while(ip != in+n) {
393 | unsigned x = *ip-start-mindelta;
394 | start = *ip++;
395 | b |= x;
396 | *op++ = x;
397 | }
398 | #else
399 | uint32_t b = 0,*op = out, x, *_ip;
400 | BITDE(uint32_t, in, n, mindelta, b |= x; *op++ = x);
401 | #endif
402 | return b;
403 | }
404 |
405 | void bitdidec8( uint8_t *p, unsigned n, uint8_t start, uint8_t mindelta) { BITDD(uint8_t, p, n, mindelta); }
406 | void bitdidec16( uint16_t *p, unsigned n, uint16_t start, uint16_t mindelta) { BITDD(uint16_t, p, n, mindelta); }
407 | void bitdidec32( uint32_t *p, unsigned n, uint32_t start, uint32_t mindelta) { BITDD(uint32_t, p, n, mindelta); }
408 | void bitdidec64( uint64_t *p, unsigned n, uint64_t start, uint64_t mindelta) { BITDD(uint64_t, p, n, mindelta); }
409 |
410 | //------------------- For ------------------------------
411 | uint8_t bitf8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { if(px) *px = 0; return n?in[n-1] - start :0; }
412 | uint8_t bitf18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; }
413 | uint16_t bitf16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { if(px) *px = 0; return n?in[n-1] - start :0; }
414 | uint16_t bitf116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; }
415 | uint32_t bitf32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { if(px) *px = 0; return n?in[n-1] - start :0; }
416 | uint32_t bitf132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; }
417 | uint64_t bitf64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { if(px) *px = 0; return n?in[n-1] - start :0; }
418 | uint64_t bitf164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { if(px) *px = 0; return n?in[n-1] - start - n:0; }
419 |
420 | //------------------- Zigzag ---------------------------
421 | #define ZE(i,_it_,_usize_) u = TEMPLATE2(zigzagenc, _usize_)((_it_)_ip[i]-(_it_)start); start = _ip[i]
422 | #define BITZENC(_ut_, _it_, _usize_, _in_,_n_, _act_) { _ut_ *_ip; o = 0; x = -1;\
423 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZE(0,_it_,_usize_);_act_; ZE(1,_it_,_usize_);_act_; ZE(2,_it_,_usize_);_act_; ZE(3,_it_,_usize_);_act_; }\
424 | for(;_ip != _in_+_n_; _ip++) { ZE(0,_it_,_usize_); _act_; }\
425 | }
426 |
427 | // 'or' bits for zigzag encoding
428 | uint8_t bitz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start) { uint8_t o, u,x; BITZENC(uint8_t, int8_t, 8, in, n, o |= x); if(px) *px = 0; return o; }
429 | uint64_t bitz64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start) { uint64_t o, u,x; BITZENC(uint64_t, int64_t,64,in, n, o |= x); if(px) *px = 0; return o; }
430 |
431 | uint16_t bitz16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start) {
432 | uint16_t o, x, *ip; uint32_t u0 = zigzagenc16((int)in[0] - (int)start);
433 |
434 | #if defined(__SSE2__) || defined(__ARM_NEON)
435 | __m128i vb0 = _mm_set1_epi16(u0), vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(),
436 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi16(start);
437 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0);
438 | __m128i vi0 = _mm_loadu_si128((__m128i *) ip);
439 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+8)); __m128i v0 = mm_delta_epi16(vi0,vs); vs = vi0; v0 = mm_zzage_epi16(v0);
440 | __m128i v1 = mm_delta_epi16(vi1,vs); vs = vi1; v1 = mm_zzage_epi16(v1);
441 | vo0 = _mm_or_si128(vo0, v0);
442 | vo1 = _mm_or_si128(vo1, v1);
443 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0));
444 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0));
445 | } start = _mm_cvtsi128_si16(_mm_srli_si128(vs,14));
446 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi16(vo0);
447 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi16(vx0);
448 | #else
449 | ip = in; //uint16_t u; o=x=0; BITDE(uint16_t, in, n, 0, o |= u; x |= u^u0); //BITZENC(uint16_t, int16_t, 16, in, n, o |= u,x &= u^u0);
450 | #endif
451 | for(;ip != in+n; ip++) {
452 | uint16_t u = zigzagenc16((int)ip[0] - (int)start); //int i = ((int)(*ip) - (int)start); i = (i << 1) ^ (i >> 15);
453 | start = *ip;
454 | o |= u;
455 | x |= u ^ u0;
456 | }
457 | if(px) *px = x;
458 | return o;
459 | }
460 |
461 | uint32_t bitz32(unsigned *in, unsigned n, uint32_t *px, unsigned start) {
462 | uint32_t o, x, *ip; uint32_t u0 = zigzagenc32((int)in[0] - (int)start);
463 | #if defined(__AVX2__) && defined(USE_AVX2)
464 | __m256i vb0 = _mm256_set1_epi32(u0), vo0 = _mm256_setzero_si256(), vx0 = _mm256_setzero_si256(),
465 | vo1 = _mm256_setzero_si256(), vx1 = _mm256_setzero_si256(); __m256i vs = _mm256_set1_epi32(start);
466 | for(ip = in; ip != in+(n&~(16-1)); ip += 16) { PREFETCH(ip+512,0);
467 | __m256i vi0 = _mm256_loadu_si256((__m256i *) ip);
468 | __m256i vi1 = _mm256_loadu_si256((__m256i *)(ip+8)); __m256i v0 = mm256_delta_epi32(vi0,vs); vs = vi0; v0 = mm256_zzage_epi32(v0);
469 | __m256i v1 = mm256_delta_epi32(vi1,vs); vs = vi1; v1 = mm256_zzage_epi32(v1);
470 | vo0 = _mm256_or_si256(vo0, v0);
471 | vo1 = _mm256_or_si256(vo1, v1);
472 | vx0 = _mm256_or_si256(vx0, _mm256_xor_si256(v0, vb0));
473 | vx1 = _mm256_or_si256(vx1, _mm256_xor_si256(v1, vb0));
474 | } start = (unsigned)_mm256_extract_epi32(vs, 7);
475 | vo0 = _mm256_or_si256(vo0, vo1); o = mm256_hor_epi32(vo0);
476 | vx0 = _mm256_or_si256(vx0, vx1); x = mm256_hor_epi32(vx0);
477 |
478 | #elif defined(__SSE2__) || defined(__ARM_NEON)
479 | __m128i vb0 = _mm_set1_epi32(u0),
480 | vo0 = _mm_setzero_si128(), vx0 = _mm_setzero_si128(),
481 | vo1 = _mm_setzero_si128(), vx1 = _mm_setzero_si128(); __m128i vs = _mm_set1_epi32(start);
482 | for(ip = in; ip != in+(n&~(8-1)); ip += 8) { PREFETCH(ip+512,0);
483 | __m128i vi0 = _mm_loadu_si128((__m128i *) ip);
484 | __m128i vi1 = _mm_loadu_si128((__m128i *)(ip+4)); __m128i v0 = mm_delta_epi32(vi0,vs); vs = vi0; v0 = mm_zzage_epi32(v0);
485 | __m128i v1 = mm_delta_epi32(vi1,vs); vs = vi1; v1 = mm_zzage_epi32(v1);
486 | vo0 = _mm_or_si128(vo0, v0);
487 | vo1 = _mm_or_si128(vo1, v1);
488 | vx0 = _mm_or_si128(vx0, _mm_xor_si128(v0, vb0));
489 | vx1 = _mm_or_si128(vx1, _mm_xor_si128(v1, vb0));
490 | } start = _mm_cvtsi128_si16(_mm_srli_si128(vs,12));
491 | vo0 = _mm_or_si128(vo0, vo1); o = mm_hor_epi32(vo0);
492 | vx0 = _mm_or_si128(vx0, vx1); x = mm_hor_epi32(vx0);
493 | #else
494 | ip = in; o = x = 0; //uint32_t u; BITDE(uint32_t, in, n, 0, o |= u; x |= u^u0);
495 | #endif
496 | for(;ip != in+n; ip++) {
497 | uint32_t u = zigzagenc32((int)ip[0] - (int)start); start = *ip; //((int)(*ip) - (int)start); //i = (i << 1) ^ (i >> 31);
498 | o |= u;
499 | x |= u ^ u0;
500 | }
501 | if(px) *px = x;
502 | return o;
503 | }
504 |
505 | uint8_t bitzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta) { uint8_t o,x,u,*op = out; BITZENC(uint8_t, int8_t, 8,in, n, o |= u; *op++ = u); return o; }
506 | uint16_t bitzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta) { uint16_t o,x,u,*op = out; BITZENC(uint16_t, int16_t,16,in, n, o |= u; *op++ = u); return o; }
507 | uint64_t bitzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta) { uint64_t o,x,u,*op = out; BITZENC(uint64_t, int64_t,64,in, n, o |= u; *op++ = u); return o; }
508 | uint32_t bitzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta) {
509 | #if defined(__SSE2__) || defined(__ARM_NEON)
510 | unsigned *ip,b,*op = out;
511 | __m128i bv = _mm_setzero_si128(), vs = _mm_set1_epi32(start), dv;
512 | for(ip = in; ip != in+(n&~(4-1)); ip += 4,op += 4) {
513 | __m128i iv = _mm_loadu_si128((__m128i *)ip);
514 | dv = mm_delta_epi32(iv,vs); vs = iv;
515 | dv = mm_zzage_epi32(dv);
516 | bv = _mm_or_si128(bv, dv);
517 | _mm_storeu_si128((__m128i *)op, dv);
518 | }
519 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12));
520 | b = mm_hor_epi32(bv);
521 | while(ip != in+n) {
522 | int x = ((int)(*ip)-(int)start);
523 | x = (x << 1) ^ (x >> 31);
524 | start = *ip++;
525 | b |= x;
526 | *op++ = x;
527 | }
528 | #else
529 | uint32_t b = 0, *op = out,x;
530 | BITZENC(uint32_t, int32_t, 32,in, n, b |= x; *op++ = x);
531 | #endif
532 | return bsr32(b);
533 | }
534 |
535 | #define ZD(_t_, _usize_, i) { _t_ _z = _ip[i]; _ip[i] = (start += TEMPLATE2(zigzagdec, _usize_)(_z)); }
536 | #define BITZDEC(_t_, _usize_, _in_, _n_) { _t_ *_ip;\
537 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { ZD(_t_, _usize_, 0); ZD(_t_, _usize_, 1); ZD(_t_, _usize_, 2); ZD(_t_, _usize_, 3); }\
538 | for(;_ip != _in_+_n_;_ip++) ZD(_t_, _usize_, 0);\
539 | }
540 |
541 | void bitzdec8( uint8_t *p, unsigned n, uint8_t start) { BITZDEC(uint8_t, 8, p, n); }
542 | void bitzdec64(uint64_t *p, unsigned n, uint64_t start) { BITZDEC(uint64_t, 64,p, n); }
543 |
544 | void bitzdec16(uint16_t *p, unsigned n, uint16_t start) {
545 | #if defined(__SSSE3__) || defined(__ARM_NEON)
546 | __m128i vs = _mm_set1_epi16(start); //, c1 = _mm_set1_epi32(1), cz = _mm_setzero_si128();
547 | uint16_t *ip;
548 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) {
549 | __m128i iv = _mm_loadu_si128((__m128i *)ip);
550 | iv = mm_zzagd_epi16(iv);
551 | vs = mm_scan_epi16(iv, vs);
552 | _mm_storeu_si128((__m128i *)ip, vs);
553 | }
554 | start = (uint16_t)_mm_cvtsi128_si32(_mm_srli_si128(vs,14));
555 | while(ip != p+n) {
556 | uint16_t z = *ip;
557 | *ip++ = (start += (z >> 1 ^ -(z & 1)));
558 | }
559 | #else
560 | BITZDEC(uint16_t, 16, p, n);
561 | #endif
562 | }
563 |
564 | void bitzdec32(unsigned *p, unsigned n, unsigned start) {
565 | #if defined(__AVX2__) && defined(USE_AVX2)
566 | __m256i vs = _mm256_set1_epi32(start); //, zv = _mm256_setzero_si256()*/; //, c1 = _mm_set1_epi32(1), cz = _mm_setzero_si128();
567 | unsigned *ip;
568 | for(ip = p; ip != p+(n&~(8-1)); ip += 8) {
569 | __m256i iv = _mm256_loadu_si256((__m256i *)ip);
570 | iv = mm256_zzagd_epi32(iv);
571 | vs = mm256_scan_epi32(iv,vs);
572 | _mm256_storeu_si256((__m256i *)ip, vs);
573 | }
574 | start = (unsigned)_mm256_extract_epi32(_mm256_srli_si256(vs,12), 4);
575 | while(ip != p+n) {
576 | unsigned z = *ip;
577 | *ip++ = (start += (z >> 1 ^ -(z & 1)));
578 | }
579 | #elif defined(__SSE2__) || defined(__ARM_NEON)
580 | __m128i vs = _mm_set1_epi32(start); //, c1 = _mm_set1_epi32(1), cz = _mm_setzero_si128();
581 | unsigned *ip;
582 | for(ip = p; ip != p+(n&~(4-1)); ip += 4) {
583 | __m128i iv = _mm_loadu_si128((__m128i *)ip);
584 | iv = mm_zzagd_epi32(iv);
585 | vs = mm_scan_epi32(iv, vs);
586 | _mm_storeu_si128((__m128i *)ip, vs);
587 | }
588 | start = (unsigned)_mm_cvtsi128_si32(_mm_srli_si128(vs,12));
589 | while(ip != p+n) {
590 | unsigned z = *ip;
591 | *ip++ = (start += zigzagdec32(z));
592 | }
593 | #else
594 | BITZDEC(uint32_t, 32, p, n);
595 | #endif
596 | }
597 |
598 | //----------------------- XOR : return max. bits ---------------------------------
599 | #define XE(i) x = _ip[i] ^ start; start = _ip[i]
600 | #define BITXENC(_t_, _in_, _n_, _act_) { _t_ *_ip;\
601 | for(_ip = _in_; _ip != _in_+(_n_&~(4-1)); _ip += 4) { XE(0);_act_; XE(1);_act_; XE(2);_act_; XE(3);_act_; }\
602 | for( ; _ip != _in_+ _n_; _ip++ ) { XE(0);_act_; }\
603 | }
604 | uint8_t bitxenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start) { uint8_t b = 0,*op = out,x; BITXENC(uint8_t, in, n, b |= x; *op++ = x); return b; }
605 | uint16_t bitxenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start) { uint16_t b = 0,*op = out,x; BITXENC(uint16_t, in, n, b |= x; *op++ = x); return b; }
606 | uint32_t bitxenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start) { uint32_t b = 0,*op = out,x; BITXENC(uint32_t, in, n, b |= x; *op++ = x); return b; }
607 | uint64_t bitxenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start) { uint64_t b = 0,*op = out,x; BITXENC(uint64_t, in, n, b |= x; *op++ = x); return b; }
608 |
609 | #define XD(i) _ip[i] = (start ^= _ip[i])
610 | #define BITXDEC(_t_, _in_, _n_) { _t_ *_ip, _x;\
611 | for(_ip = _in_;_ip != _in_+(_n_&~(4-1)); _ip += 4) { XD(0); XD(1); XD(2); XD(3); }\
612 | for( ;_ip != _in_+ _n_ ; _ip++ ) XD(0);\
613 | }
614 |
615 | void bitxdec8( uint8_t *p, unsigned n, uint8_t start) { BITXDEC(uint8_t, p, n); }
616 | void bitxdec16(uint16_t *p, unsigned n, uint16_t start) { BITXDEC(uint16_t, p, n); }
617 | void bitxdec32(uint32_t *p, unsigned n, uint32_t start) { BITXDEC(uint32_t, p, n); }
618 | void bitxdec64(uint64_t *p, unsigned n, uint64_t start) { BITXDEC(uint64_t, p, n); }
619 |
620 | //-------------- For : calc max. bits, min,max value ------------------------
621 | #define FM(i) mi = _ip[i] < mi?_ip[i]:mi; mx = _ip[i] > mx?_ip[i]:mx
622 | #define BITFM(_t_, _in_,_n_) { _t_ *_ip; \
623 | for(_ip = _in_, mi = mx = *_ip; _ip != _in_+(_n_&~(4-1)); _ip += 4) { FM(0); FM(1); FM(2); FM(3); }\
624 | for(;_ip != _in_+_n_; _ip++) FM(0);\
625 | }
626 |
627 | uint8_t bitfm8( uint8_t *in, unsigned n, uint8_t *px, uint8_t *pmin) { uint8_t mi,mx; BITFM(uint8_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; }
628 | uint16_t bitfm16(uint16_t *in, unsigned n, uint16_t *px, uint16_t *pmin) { uint16_t mi,mx; BITFM(uint16_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; }
629 | uint32_t bitfm32(uint32_t *in, unsigned n, uint32_t *px, uint32_t *pmin) { uint32_t mi,mx; BITFM(uint32_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; }
630 | uint64_t bitfm64(uint64_t *in, unsigned n, uint64_t *px, uint64_t *pmin) { uint64_t mi,mx; BITFM(uint64_t, in, n); *pmin = mi; if(px) *px = 0; return mx - mi; }
631 |
632 | //----------- Lossy floating point conversion: pad the trailing mantissa bits with zero bits according to the relative error e (ex. 0.00001) ----------
633 | #ifdef USE_FLOAT16
634 | // https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point
635 | #define ctof16(_cp_) (*(_Float16 *)(_cp_))
636 |
637 | static inline _Float16 _fppad16(_Float16 d, float e, int lg2e) {
638 | uint16_t u, du = ctou16(&d);
639 | int b = (du>>10 & 0x1f)-15; // mantissa=10 bits, exponent=5bits, bias=15
640 | if ((b = 12 - b - lg2e) <= 0) return d;
641 | b = (b > 10) ? 10 : b;
642 | do { u = du & (~((1u<<(--b))-1)); } while (fabs((ctof16(&u) - d)/d) > e);
643 | return ctof16(&u);
644 | }
645 |
646 | void fppad16(_Float16 *in, size_t n, _Float16 *out, float e) { int lg2e = -log(e)/log(2.0); _Float16 *ip; for (ip = in; ip < in+n; ip++,out++) *out = _fppad16(*ip, e, lg2e); }
647 | #endif
648 |
649 | //do u = du & (~((1u<<(--b))-1)); while(fabsf((ctof32(&u) - d)/d) > e);
650 | #define OP(t,s) sign = du & ((t)1<<(s-1)); du &= ~((t)1<<(s-1)); d = TEMPLATE2(ctof,s)(&du);\
651 | do u = du & (~(((t)1<<(--b))-1)); while(d - TEMPLATE2(ctof,s)(&u) > e*d);\
652 | u |= sign;\
653 | return TEMPLATE2(ctof,s)(&u);
654 |
655 | static inline float _fppad32(float d, float e, int lg2e) {
656 | uint32_t u, du = ctou32(&d), sign;
657 | int b = (du>>23 & 0xff)-0x7e;
658 | if((b = 25 - b - lg2e) <= 0)
659 | return d;
660 | b = b > 23?23:b;
661 | sign = du & (1<<31);
662 | du &= 0x7fffffffu;
663 | d = ctof32(&du);
664 | do u = du & (~((1u<<(--b))-1)); while(d - ctof32(&u) > e*d);
665 | u |= sign;
666 | return ctof32(&u);
667 | }
668 |
669 | void fppad32(float *in, size_t n, float *out, float e) { int lg2e = -log(e)/log(2.0); float *ip; for(ip = in; ip < in+n; ip++,out++) *out = _fppad32(*ip, e, lg2e); }
670 |
671 | static inline double _fppad64(double d, double e, int lg2e) {
672 | union r { uint64_t u; double d; } u,du; du.d = d;
673 | uint64_t sign;
674 | int b = (du.u>>52 & 0x7ff)-0x3fe;
675 | if((b = 54 - b - lg2e) <= 0)
676 | return d;
677 | b = b > 52?52:b;
678 | sign = du.u & (1ull<<63); du.u &= 0x7fffffffffffffffull;
679 | int _b = b;
680 | for(;;) { if((_b -= 8) <= 0) break; u.u = du.u & (~((1ull<<_b)-1)); if(d - u.d <= e*d) break; b = _b; }
681 | do u.u = du.u & (~((1ull<<(--b))-1)); while(d - u.d > e*d);
682 | u.u |= sign;
683 | return ctof64(&u);
684 | }
685 |
686 | void fppad64(double *in, size_t n, double *out, double e) { int lg2e = -log(e)/log(2.0); double *ip; for(ip = in; ip < in+n; ip++,out++) *out = _fppad64(*ip, e, lg2e); }
687 |
--------------------------------------------------------------------------------
/bitutil.h:
--------------------------------------------------------------------------------
1 | /**
2 | Copyright (C) powturbo 2013-2019
3 | GPL v2 License
4 |
5 | This program is free software; you can redistribute it and/or modify
6 | it under the terms of the GNU General Public License as published by
7 | the Free Software Foundation; either version 2 of the License, or
8 | (at your option) any later version.
9 |
10 | This program is distributed in the hope that it will be useful,
11 | but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | GNU General Public License for more details.
14 |
15 | You should have received a copy of the GNU General Public License along
16 | with this program; if not, write to the Free Software Foundation, Inc.,
17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
18 |
19 | - homepage : https://sites.google.com/site/powturbo/
20 | - github : https://github.com/powturbo
21 | - twitter : https://twitter.com/powturbo
22 | - email : powturbo [_AT_] gmail [_DOT_] com
23 | **/
24 | // "Integer Compression: max.bits, delta, zigzag, xor"
25 |
26 | #ifdef BITUTIL_IN
27 | #ifdef __AVX2__
28 | #include
29 | #elif defined(__AVX__)
30 | #include
31 | #elif defined(__SSE4_1__)
32 | #include
33 | #elif defined(__SSSE3__)
34 | #include
35 | #elif defined(__SSE2__)
36 | #include
37 | #elif defined(__ARM_NEON)
38 | #include
39 | #endif
40 | #if defined(_MSC_VER) && _MSC_VER < 1600
41 | #include "vs/stdint.h"
42 | #else
43 | #include
44 | #endif
45 | #include "sse_neon.h"
46 |
47 | #ifdef __ARM_NEON
48 | #define PREFETCH(_ip_,_rw_)
49 | #else
50 | #define PREFETCH(_ip_,_rw_) __builtin_prefetch(_ip_,_rw_)
51 | #endif
52 | //------------------------ zigzag encoding -------------------------------------------------------------
53 | static inline unsigned char zigzagenc8( signed char x) { return x << 1 ^ x >> 7; }
54 | static inline char zigzagdec8( unsigned char x) { return x >> 1 ^ -(x & 1); }
55 |
56 | static inline unsigned short zigzagenc16(short x) { return x << 1 ^ x >> 15; }
57 | static inline short zigzagdec16(unsigned short x) { return x >> 1 ^ -(x & 1); }
58 |
59 | static inline unsigned zigzagenc32(int x) { return x << 1 ^ x >> 31; }
60 | static inline int zigzagdec32(unsigned x) { return x >> 1 ^ -(x & 1); }
61 |
62 | static inline uint64_t zigzagenc64(int64_t x) { return x << 1 ^ x >> 63; }
63 | static inline int64_t zigzagdec64(uint64_t x) { return x >> 1 ^ -(x & 1); }
64 |
65 | #if defined(__SSE2__) || defined(__ARM_NEON)
66 | static ALWAYS_INLINE __m128i mm_zzage_epi16(__m128i v) { return _mm_xor_si128(_mm_slli_epi16(v,1), _mm_srai_epi16(v,15)); }
67 | static ALWAYS_INLINE __m128i mm_zzage_epi32(__m128i v) { return _mm_xor_si128(_mm_slli_epi32(v,1), _mm_srai_epi32(v,31)); }
68 | //static ALWAYS_INLINE __m128i mm_zzage_epi64(__m128i v) { return _mm_xor_si128(_mm_slli_epi64(v,1), _mm_srai_epi64(v,63)); }
69 |
70 | static ALWAYS_INLINE __m128i mm_zzagd_epi16(__m128i v) { return _mm_xor_si128(_mm_srli_epi16(v,1), _mm_srai_epi16(_mm_slli_epi16(v,15),15) ); }
71 | static ALWAYS_INLINE __m128i mm_zzagd_epi32(__m128i v) { return _mm_xor_si128(_mm_srli_epi32(v,1), _mm_srai_epi32(_mm_slli_epi32(v,31),31) ); }
72 | //static ALWAYS_INLINE __m128i mm_zzagd_epi64(__m128i v) { return _mm_xor_si128(_mm_srli_epi64(v,1), _mm_srai_epi64(_mm_slli_epi64(v,63),63) ); }
73 |
74 | #endif
75 | #ifdef __AVX2__
76 | static ALWAYS_INLINE __m256i mm256_zzage_epi32(__m256i v) { return _mm256_xor_si256(_mm256_slli_epi32(v,1), _mm256_srai_epi32(v,31)); }
77 | static ALWAYS_INLINE __m256i mm256_zzagd_epi32(__m256i v) { return _mm256_xor_si256(_mm256_srli_epi32(v,1), _mm256_srai_epi32(_mm256_slli_epi32(v,31),31) ); }
78 | #endif
79 |
80 | //-------------- AVX2 delta + prefix sum (scan) / xor encode/decode ---------------------------------------------------------------------------------------
81 | #ifdef __AVX2__
82 | static ALWAYS_INLINE __m256i mm256_delta_epi32(__m256i v, __m256i sv) { return _mm256_sub_epi32(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 12)); }
83 | static ALWAYS_INLINE __m256i mm256_delta_epi64(__m256i v, __m256i sv) { return _mm256_sub_epi64(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 8)); }
84 | static ALWAYS_INLINE __m256i mm256_xore_epi32( __m256i v, __m256i sv) { return _mm256_xor_si256(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 12)); }
85 | static ALWAYS_INLINE __m256i mm256_xore_epi64( __m256i v, __m256i sv) { return _mm256_xor_si256(v, _mm256_alignr_epi8(v, _mm256_permute2f128_si256(sv, v, _MM_SHUFFLE(0, 2, 0, 1)), 8)); }
86 |
87 | static ALWAYS_INLINE __m256i mm256_scan_epi32(__m256i v, __m256i sv) {
88 | v = _mm256_add_epi32(v, _mm256_slli_si256(v, 4));
89 | v = _mm256_add_epi32(v, _mm256_slli_si256(v, 8));
90 | return _mm256_add_epi32( _mm256_permute2x128_si256( _mm256_shuffle_epi32(sv,_MM_SHUFFLE(3, 3, 3, 3)), sv, 0x11),
91 | _mm256_add_epi32(v, _mm256_permute2x128_si256(_mm256_setzero_si256(),_mm256_shuffle_epi32(v, _MM_SHUFFLE(3, 3, 3, 3)), 0x20)));
92 | }
93 | static ALWAYS_INLINE __m256i mm256_xord_epi32(__m256i v, __m256i sv) {
94 | v = _mm256_xor_si256(v, _mm256_slli_si256(v, 4));
95 | v = _mm256_xor_si256(v, _mm256_slli_si256(v, 8));
96 | return _mm256_xor_si256( _mm256_permute2x128_si256( _mm256_shuffle_epi32(sv,_MM_SHUFFLE(3, 3, 3, 3)), sv, 0x11),
97 | _mm256_xor_si256(v, _mm256_permute2x128_si256(_mm256_setzero_si256(),_mm256_shuffle_epi32(v, _MM_SHUFFLE(3, 3, 3, 3)), 0x20)));
98 | }
99 |
100 | static ALWAYS_INLINE __m256i mm256_scan_epi64(__m256i v, __m256i sv) {
101 | v = _mm256_add_epi64(v, _mm256_alignr_epi8(v, _mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), 8));
102 | return _mm256_add_epi64(_mm256_permute4x64_epi64(sv, _MM_SHUFFLE(3, 3, 3, 3)), _mm256_add_epi64(_mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), v) );
103 | }
104 | static ALWAYS_INLINE __m256i mm256_xord_epi64(__m256i v, __m256i sv) {
105 | v = _mm256_xor_si256(v, _mm256_alignr_epi8(v, _mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), 8));
106 | return _mm256_xor_si256(_mm256_permute4x64_epi64(sv, _MM_SHUFFLE(3, 3, 3, 3)), _mm256_xor_si256(_mm256_permute2x128_si256(v, v, _MM_SHUFFLE(0, 0, 2, 0)), v) );
107 | }
108 |
109 | static ALWAYS_INLINE __m256i mm256_scani_epi32(__m256i v, __m256i sv, __m256i vi) { return _mm256_add_epi32(mm256_scan_epi32(v, sv), vi); }
110 | #endif
111 |
112 | #if defined(__SSSE3__) || defined(__ARM_NEON)
113 | static ALWAYS_INLINE __m128i mm_delta_epi16(__m128i v, __m128i sv) { return _mm_sub_epi16(v, _mm_alignr_epi8(v, sv, 14)); }
114 | static ALWAYS_INLINE __m128i mm_delta_epi32(__m128i v, __m128i sv) { return _mm_sub_epi32(v, _mm_alignr_epi8(v, sv, 12)); }
115 | static ALWAYS_INLINE __m128i mm_xore_epi16( __m128i v, __m128i sv) { return _mm_xor_si128(v, _mm_alignr_epi8(v, sv, 14)); }
116 | static ALWAYS_INLINE __m128i mm_xore_epi32( __m128i v, __m128i sv) { return _mm_xor_si128(v, _mm_alignr_epi8(v, sv, 12)); }
117 |
118 | #define MM_HDEC_EPI16(_v_,_sv_,_hop_) {\
119 | _v_ = _hop_( _v_, _mm_slli_si128(_v_, 2));\
120 | _v_ = _hop_( _v_, _mm_slli_si128(_v_, 4));\
121 | _v_ = _hop_(_hop_(_v_, _mm_slli_si128(_v_, 8)), _mm_shuffle_epi8(_sv_, _mm_set1_epi16(0x0f0e)));\
122 | }
123 |
124 | static ALWAYS_INLINE __m128i mm_scan_epi16(__m128i v, __m128i sv) { MM_HDEC_EPI16(v,sv,_mm_add_epi16); return v; }
125 | static ALWAYS_INLINE __m128i mm_xord_epi16(__m128i v, __m128i sv) { MM_HDEC_EPI16(v,sv,_mm_xor_si128); return v; }
126 | #elif defined(__SSE2__)
127 | static ALWAYS_INLINE __m128i mm_delta_epi16(__m128i v, __m128i sv) { return _mm_sub_epi16(v, _mm_or_si128(_mm_srli_si128(sv, 14), _mm_slli_si128(v, 2))); }
128 | static ALWAYS_INLINE __m128i mm_xore_epi16( __m128i v, __m128i sv) { return _mm_xor_epi16(v, _mm_or_si128(_mm_srli_si128(sv, 14), _mm_slli_si128(v, 2))); }
129 | static ALWAYS_INLINE __m128i mm_delta_epi32(__m128i v, __m128i sv) { return _mm_sub_epi32(v, _mm_or_si128(_mm_srli_si128(sv, 12), _mm_slli_si128(v, 4))); }
130 | static ALWAYS_INLINE __m128i mm_xore_epi32( __m128i v, __m128i sv) { return _mm_xor_epi32(v, _mm_or_si128(_mm_srli_si128(sv, 12), _mm_slli_si128(v, 4))); }
131 | #endif
132 |
133 | #if defined(__SSE2__) || defined(__ARM_NEON)
134 | #define MM_HDEC_EPI32(_v_,_sv_,_hop_) { _v_ = _hop_(_v_, _mm_slli_si128(_v_, 4)); _v_ = _hop_(mm_shuffle_nnnn_epi32(_sv_, 3), _hop_(_mm_slli_si128(_v_, 8), _v_)); }
135 | static ALWAYS_INLINE __m128i mm_scan_epi32(__m128i v, __m128i sv) { MM_HDEC_EPI32(v,sv,_mm_add_epi32); return v; }
136 | static ALWAYS_INLINE __m128i mm_xord_epi32(__m128i v, __m128i sv) { MM_HDEC_EPI32(v,sv,_mm_xor_si128); return v; }
137 |
138 | //-------- scan with vi delta > 0 -----------------------------
139 | static ALWAYS_INLINE __m128i mm_scani_epi16(__m128i v, __m128i sv, __m128i vi) { return _mm_add_epi16(mm_scan_epi16(v, sv), vi); }
140 | static ALWAYS_INLINE __m128i mm_scani_epi32(__m128i v, __m128i sv, __m128i vi) { return _mm_add_epi32(mm_scan_epi32(v, sv), vi); }
141 | #endif
142 |
143 | //------------------ Horizontal OR -----------------------------------------------
144 | #ifdef __AVX2__
145 | static ALWAYS_INLINE unsigned mm256_hor_epi32(__m256i v) {
146 | v = _mm256_or_si256(v, _mm256_srli_si256(v, 8));
147 | v = _mm256_or_si256(v, _mm256_srli_si256(v, 4));
148 | return _mm256_extract_epi32(v,0) | _mm256_extract_epi32(v, 4);
149 | }
150 |
151 | static ALWAYS_INLINE uint64_t mm256_hor_epi64(__m256i v) {
152 | v = _mm256_or_si256(v, _mm256_permute2x128_si256(v, v, _MM_SHUFFLE(2, 0, 0, 1)));
153 | return _mm256_extract_epi64(v, 1) | _mm256_extract_epi64(v,0);
154 | }
155 | #endif
156 |
157 | #if defined(__SSE2__) || defined(__ARM_NEON)
158 | #define MM_HOZ_EPI16(v,_hop_) {\
159 | v = _hop_(v, _mm_srli_si128(v, 8));\
160 | v = _hop_(v, _mm_srli_si128(v, 6));\
161 | v = _hop_(v, _mm_srli_si128(v, 4));\
162 | v = _hop_(v, _mm_srli_si128(v, 2));\
163 | }
164 |
165 | #define MM_HOZ_EPI32(v,_hop_) {\
166 | v = _hop_(v, _mm_srli_si128(v, 8));\
167 | v = _hop_(v, _mm_srli_si128(v, 4));\
168 | }
169 |
170 | static ALWAYS_INLINE uint16_t mm_hor_epi16( __m128i v) { MM_HOZ_EPI16(v,_mm_or_si128); return (unsigned short)_mm_cvtsi128_si32(v); }
171 | static ALWAYS_INLINE uint32_t mm_hor_epi32( __m128i v) { MM_HOZ_EPI32(v,_mm_or_si128); return (unsigned )_mm_cvtsi128_si32(v); }
172 | static ALWAYS_INLINE uint64_t mm_hor_epi64( __m128i v) { v = _mm_or_si128( v, _mm_srli_si128(v, 8)); return (uint64_t )_mm_cvtsi128_si64(v); }
173 | #endif
174 |
175 | //----------------- sub / add ----------------------------------------------------------
176 | #if defined(__SSE2__) || defined(__ARM_NEON)
177 | #define SUBI16x8(_v_, _sv_) _mm_sub_epi16(_v_, _sv_)
178 | #define SUBI32x4(_v_, _sv_) _mm_sub_epi32(_v_, _sv_)
179 | #define ADDI16x8(_v_, _sv_, _vi_) _sv_ = _mm_add_epi16(_mm_add_epi16(_sv_, _vi_),_v_)
180 | #define ADDI32x4(_v_, _sv_, _vi_) _sv_ = _mm_add_epi32(_mm_add_epi32(_sv_, _vi_),_v_)
181 |
182 | //---------------- Convert _mm_cvtsi128_siXX -------------------------------------------
183 | static ALWAYS_INLINE uint8_t _mm_cvtsi128_si8 (__m128i v) { return (uint8_t )_mm_cvtsi128_si32(v); }
184 | static ALWAYS_INLINE uint16_t _mm_cvtsi128_si16(__m128i v) { return (uint16_t)_mm_cvtsi128_si32(v); }
185 | #endif
186 |
187 | //--------- memset -----------------------------------------
188 | #define BITFORSET_(_out_, _n_, _start_, _mindelta_) do { unsigned _i;\
189 | for(_i = 0; _i != (_n_&~3); _i+=4) { \
190 | _out_[_i+0] = _start_+(_i )*_mindelta_; \
191 | _out_[_i+1] = _start_+(_i+1)*_mindelta_; \
192 | _out_[_i+2] = _start_+(_i+2)*_mindelta_; \
193 | _out_[_i+3] = _start_+(_i+3)*_mindelta_; \
194 | } \
195 | while(_i != _n_) \
196 | _out_[_i] = _start_+_i*_mindelta_, ++_i; \
197 | } while(0)
198 |
199 | //--------- SIMD zero -----------------------------------------
200 | #ifdef __AVX2__
201 | #define BITZERO32(_out_, _n_, _start_) do {\
202 | __m256i _sv_ = _mm256_set1_epi32(_start_), *_ov = (__m256i *)(_out_), *_ove = (__m256i *)(_out_ + _n_);\
203 | do _mm256_storeu_si256(_ov++, _sv_); while(_ov < _ove);\
204 | } while(0)
205 |
206 | #define BITFORZERO32(_out_, _n_, _start_, _mindelta_) do {\
207 | __m256i _sv = _mm256_set1_epi32(_start_), *_ov=(__m256i *)(_out_), *_ove = (__m256i *)(_out_ + _n_), _cv = _mm256_set_epi32(7+_mindelta_,6+_mindelta_,5+_mindelta_,4+_mindelta_,3*_mindelta_,2*_mindelta_,1*_mindelta_,0); \
208 | _sv = _mm256_add_epi32(_sv, _cv);\
209 | _cv = _mm256_set1_epi32(4);\
210 | do { _mm256_storeu_si256(_ov++, _sv); _sv = _mm256_add_epi32(_sv, _cv); } while(_ov < _ove);\
211 | } while(0)
212 |
213 | #define BITDIZERO32(_out_, _n_, _start_, _mindelta_) do { __m256i _sv = _mm256_set1_epi32(_start_), _cv = _mm256_set_epi32(7+_mindelta_,6+_mindelta_,5+_mindelta_,4+_mindelta_,3+_mindelta_,2+_mindelta_,1+_mindelta_,_mindelta_), *_ov=(__m256i *)(_out_), *_ove = (__m256i *)(_out_ + _n_);\
214 | _sv = _mm256_add_epi32(_sv, _cv); _cv = _mm256_set1_epi32(4*_mindelta_); do { _mm256_storeu_si256(_ov++, _sv), _sv = _mm256_add_epi32(_sv, _cv); } while(_ov < _ove);\
215 | } while(0)
216 |
217 | #elif defined(__SSE2__) || defined(__ARM_NEON) // -------------
218 | // SIMD set value (memset)
219 | #define BITZERO32(_out_, _n_, _v_) do {\
220 | __m128i _sv_ = _mm_set1_epi32(_v_), *_ov = (__m128i *)(_out_), *_ove = (__m128i *)(_out_ + _n_);\
221 | do _mm_storeu_si128(_ov++, _sv_); while(_ov < _ove); \
222 | } while(0)
223 |
224 | #define BITFORZERO32(_out_, _n_, _start_, _mindelta_) do {\
225 | __m128i _sv = _mm_set1_epi32(_start_), *_ov=(__m128i *)(_out_), *_ove = (__m128i *)(_out_ + _n_), _cv = _mm_set_epi32(3*_mindelta_,2*_mindelta_,1*_mindelta_,0); \
226 | _sv = _mm_add_epi32(_sv, _cv);\
227 | _cv = _mm_set1_epi32(4);\
228 | do { _mm_storeu_si128(_ov++, _sv); _sv = _mm_add_epi32(_sv, _cv); } while(_ov < _ove);\
229 | } while(0)
230 |
231 | #define BITDIZERO32(_out_, _n_, _start_, _mindelta_) do { __m128i _sv = _mm_set1_epi32(_start_), _cv = _mm_set_epi32(3+_mindelta_,2+_mindelta_,1+_mindelta_,_mindelta_), *_ov=(__m128i *)(_out_), *_ove = (__m128i *)(_out_ + _n_);\
232 | _sv = _mm_add_epi32(_sv, _cv); _cv = _mm_set1_epi32(4*_mindelta_); do { _mm_storeu_si128(_ov++, _sv), _sv = _mm_add_epi32(_sv, _cv); } while(_ov < _ove);\
233 | } while(0)
234 | #else
235 | #define BITFORZERO32(_out_, _n_, _start_, _mindelta_) BITFORSET_(_out_, _n_, _start_, _mindelta_)
236 | #define BITZERO32( _out_, _n_, _start_) BITFORSET_(_out_, _n_, _start_, 0)
237 | #endif
238 |
239 | #define DELTR( _in_, _n_, _start_, _mindelta_, _out_) { unsigned _v; for( _v = 0; _v < _n_; _v++) _out_[_v] = _in_[_v] - (_start_) - _v*(_mindelta_) - (_mindelta_); }
240 | #define DELTRB(_in_, _n_, _start_, _mindelta_, _b_, _out_) { unsigned _v; for(_b_=0,_v = 0; _v < _n_; _v++) _out_[_v] = _in_[_v] - (_start_) - _v*(_mindelta_) - (_mindelta_), _b_ |= _out_[_v]; _b_ = bsr32(_b_); }
241 |
242 | //----------------------------------------- bitreverse scalar + SIMD -------------------------------------------
243 | #if __clang__ //__has_builtin(__builtin_bitreverse64)
244 | #define rbit8(x) __builtin_bitreverse8( x)
245 | #define rbit16(x) __builtin_bitreverse16(x)
246 | #define rbit32(x) __builtin_bitreverse32(x)
247 | #define rbit64(x) __builtin_bitreverse64(x)
248 | #else
249 |
250 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u)
251 | static ALWAYS_INLINE uint32_t _rbit_(uint32_t x) { uint32_t rc; __asm volatile ("rbit %0, %1" : "=r" (rc) : "r" (x) ); }
252 | #endif
253 | static ALWAYS_INLINE uint8_t rbit8(uint8_t x) {
254 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u)
255 | return _rbit_(x) >> 24;
256 | #elif 0
257 | x = (x & 0xaa) >> 1 | (x & 0x55) << 1;
258 | x = (x & 0xcc) >> 2 | (x & 0x33) << 2;
259 | return x << 4 | x >> 4;
260 | #else
261 | return (x * 0x0202020202ull & 0x010884422010ull) % 1023;
262 | #endif
263 | }
264 |
265 | static ALWAYS_INLINE uint16_t rbit16(uint16_t x) {
266 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u)
267 | return _rbit_(x) >> 16;
268 | #else
269 | x = (x & 0xaaaa) >> 1 | (x & 0x5555) << 1;
270 | x = (x & 0xcccc) >> 2 | (x & 0x3333) << 2;
271 | x = (x & 0xf0f0) >> 4 | (x & 0x0f0f) << 4;
272 | return x << 8 | x >> 8;
273 | #endif
274 | }
275 |
276 | static ALWAYS_INLINE uint32_t rbit32(uint32_t x) {
277 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u)
278 | return _rbit_(x);
279 | #else
280 | x = ((x & 0xaaaaaaaa) >> 1 | (x & 0x55555555) << 1);
281 | x = ((x & 0xcccccccc) >> 2 | (x & 0x33333333) << 2);
282 | x = ((x & 0xf0f0f0f0) >> 4 | (x & 0x0f0f0f0f) << 4);
283 | x = ((x & 0xff00ff00) >> 8 | (x & 0x00ff00ff) << 8);
284 | return x << 16 | x >> 16;
285 | #endif
286 | }
287 | static ALWAYS_INLINE uint64_t rbit64(uint64_t x) {
288 | #if (__CORTEX_M >= 0x03u) || (__CORTEX_SC >= 300u)
289 | return (uint64_t)_rbit_(x) << 32 | _rbit_(x >> 32);
290 | #else
291 | x = (x & 0xaaaaaaaaaaaaaaaa) >> 1 | (x & 0x5555555555555555) << 1;
292 | x = (x & 0xcccccccccccccccc) >> 2 | (x & 0x3333333333333333) << 2;
293 | x = (x & 0xf0f0f0f0f0f0f0f0) >> 4 | (x & 0x0f0f0f0f0f0f0f0f) << 4;
294 | x = (x & 0xff00ff00ff00ff00) >> 8 | (x & 0x00ff00ff00ff00ff) << 8;
295 | x = (x & 0xffff0000ffff0000) >> 16 | (x & 0x0000ffff0000ffff) << 16;
296 | return x << 32 | x >> 32;
297 | #endif
298 | }
299 | #endif
300 |
301 | #if defined(__SSSE3__) || defined(__ARM_NEON)
302 | static ALWAYS_INLINE __m128i mm_rbit_epi16(__m128i v) { return mm_rbit_epi8(mm_rev_epi16(v)); }
303 | static ALWAYS_INLINE __m128i mm_rbit_epi32(__m128i v) { return mm_rbit_epi8(mm_rev_epi32(v)); }
304 | static ALWAYS_INLINE __m128i mm_rbit_epi64(__m128i v) { return mm_rbit_epi8(mm_rev_epi64(v)); }
305 | //static ALWAYS_INLINE __m128i mm_rbit_si128(__m128i v) { return mm_rbit_epi8(mm_rev_si128(v)); }
306 | #endif
307 |
308 | #ifdef __AVX2__
309 | static ALWAYS_INLINE __m256i mm256_rbit_epi8(__m256i v) {
310 | __m256i fv = _mm256_setr_epi8(0, 8, 4,12, 2,10, 6,14, 1, 9, 5,13, 3,11, 7,15, 0, 8, 4,12, 2,10, 6,14, 1, 9, 5,13, 3,11, 7,15), cv0f_8 = _mm256_set1_epi8(0xf);
311 | __m256i lv = _mm256_shuffle_epi8(fv,_mm256_and_si256( v, cv0f_8));
312 | __m256i hv = _mm256_shuffle_epi8(fv,_mm256_and_si256(_mm256_srli_epi64(v, 4), cv0f_8));
313 | return _mm256_or_si256(_mm256_slli_epi64(lv,4), hv);
314 | }
315 |
316 | static ALWAYS_INLINE __m256i mm256_rev_epi16(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8( 1, 0, 3, 2, 5, 4, 7, 6, 9, 8,11,10,13,12,15,14, 1, 0, 3, 2, 5, 4, 7, 6, 9, 8,11,10,13,12,15,14)); }
317 | static ALWAYS_INLINE __m256i mm256_rev_epi32(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8( 3, 2, 1, 0, 7, 6, 5, 4, 11,10, 9, 8,15,14,13,12, 3, 2, 1, 0, 7, 6, 5, 4, 11,10, 9, 8,15,14,13,12)); }
318 | static ALWAYS_INLINE __m256i mm256_rev_epi64(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8( 7, 6, 5, 4, 3, 2, 1, 0, 15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15,14,13,12,11,10, 9, 8)); }
319 | static ALWAYS_INLINE __m256i mm256_rev_si128(__m256i v) { return _mm256_shuffle_epi8(v, _mm256_setr_epi8(15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15,14,13,12,11,10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)); }
320 |
321 | static ALWAYS_INLINE __m256i mm256_rbit_epi16(__m256i v) { return mm256_rbit_epi8(mm256_rev_epi16(v)); }
322 | static ALWAYS_INLINE __m256i mm256_rbit_epi32(__m256i v) { return mm256_rbit_epi8(mm256_rev_epi32(v)); }
323 | static ALWAYS_INLINE __m256i mm256_rbit_epi64(__m256i v) { return mm256_rbit_epi8(mm256_rev_epi64(v)); }
324 | static ALWAYS_INLINE __m256i mm256_rbit_si128(__m256i v) { return mm256_rbit_epi8(mm256_rev_si128(v)); }
325 | #endif
326 | #endif
327 |
328 | //---------- max. bit length + transform for sorted/unsorted arrays, delta,delta 1, delta > 1, zigzag, zigzag of delta, xor, FOR,----------------
329 | #ifdef __cplusplus
330 | extern "C" {
331 | #endif
332 | //------ ORed array, for maximum bit length of the elements in the unsorted integer array ---------------------
333 | uint8_t bit8( uint8_t *in, unsigned n, uint8_t *px);
334 | uint16_t bit16(uint16_t *in, unsigned n, uint16_t *px);
335 | uint32_t bit32(uint32_t *in, unsigned n, uint32_t *px);
336 | uint64_t bit64(uint64_t *in, unsigned n, uint64_t *px);
337 |
338 | //-------------- delta = 0: Sorted integer array w/ mindelta = 0 ----------------------------------------------
339 | //-- ORed array, maximum bit length of the non decreasing integer array. out[i] = in[i] - in[i-1]
340 | uint8_t bitd8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
341 | uint16_t bitd16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
342 | uint32_t bitd32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
343 | uint64_t bitd64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
344 |
345 | //-- in-place reverse delta 0
346 | void bitddec8( uint8_t *p, unsigned n, uint8_t start); // non decreasing (out[i] = in[i] - in[i-1])
347 | void bitddec16( uint16_t *p, unsigned n, uint16_t start);
348 | void bitddec32( uint32_t *p, unsigned n, uint32_t start);
349 | void bitddec64( uint64_t *p, unsigned n, uint64_t start);
350 |
351 | //-- vectorized fast delta4 one: out[0] = in[4]-in[0], out[1]=in[5]-in[1], out[2]=in[6]-in[2], out[3]=in[7]-in[3],...
352 | uint16_t bits128v16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
353 | uint32_t bits128v32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
354 |
355 | //------------- delta = 1: Sorted integer array w/ mindelta = 1 ---------------------------------------------
356 | //-- get delta maximum bit length of the non strictly decreasing integer array. out[i] = in[i] - in[i-1] - 1
357 | uint8_t bitd18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
358 | uint16_t bitd116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
359 | uint32_t bitd132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
360 | uint64_t bitd164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
361 |
362 | //-- in-place reverse delta one
363 | void bitd1dec8( uint8_t *p, unsigned n, uint8_t start); // non strictly decreasing (out[i] = in[i] - in[i-1] - 1)
364 | void bitd1dec16( uint16_t *p, unsigned n, uint16_t start);
365 | void bitd1dec32( uint32_t *p, unsigned n, uint32_t start);
366 | void bitd1dec64( uint64_t *p, unsigned n, uint64_t start);
367 |
368 | //------------- delta > 1: Sorted integer array w/ mindelta > 1 ---------------------------------------------
369 | //-- ORed array, for max. bit length get min. delta ()
370 | uint8_t bitdi8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
371 | uint16_t bitdi16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
372 | uint32_t bitdi32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
373 | uint64_t bitdi64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
374 | //-- transform sorted integer array to delta array: out[i] = in[i] - in[i-1] - mindelta
375 | uint8_t bitdienc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta);
376 | uint16_t bitdienc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta);
377 | uint32_t bitdienc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta);
378 | uint64_t bitdienc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta);
379 | //-- in-place reverse delta
380 | void bitdidec8( uint8_t *in, unsigned n, uint8_t start, uint8_t mindelta);
381 | void bitdidec16(uint16_t *in, unsigned n, uint16_t start, uint16_t mindelta);
382 | void bitdidec32(uint32_t *in, unsigned n, uint32_t start, uint32_t mindelta);
383 | void bitdidec64(uint64_t *in, unsigned n, uint64_t start, uint64_t mindelta);
384 |
385 | //------------- FOR : array bit length: ---------------------------------------------------------------------
386 | //------ ORed array, for max. bit length of the non decreasing integer array. out[i] = in[i] - start
387 | uint8_t bitf8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
388 | uint16_t bitf16(uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
389 | uint32_t bitf32(uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
390 | uint64_t bitf64(uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
391 |
392 | //------ ORed array, for max. bit length of the non strictly decreasing integer array out[i] = in[i] - 1 - start
393 | uint8_t bitf18( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
394 | uint16_t bitf116(uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
395 | uint32_t bitf132(uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
396 | uint64_t bitf164(uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
397 |
398 | //------ ORed array, for max. bit length for usorted array
399 | uint8_t bitfm8( uint8_t *in, unsigned n, uint8_t *px, uint8_t *pmin); // unsorted
400 | uint16_t bitfm16(uint16_t *in, unsigned n, uint16_t *px, uint16_t *pmin);
401 | uint32_t bitfm32(uint32_t *in, unsigned n, uint32_t *px, uint32_t *pmin);
402 | uint64_t bitfm64(uint64_t *in, unsigned n, uint64_t *px, uint64_t *pmin);
403 |
404 | //------------- Zigzag encoding for unsorted integer lists: out[i] = in[i] - in[i-1] ------------------------
405 | //-- ORed array, to get maximum zigzag bit length integer array
406 | uint8_t bitz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
407 | uint16_t bitz16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
408 | uint32_t bitz32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
409 | uint64_t bitz64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
410 | //-- Zigzag transform
411 | uint8_t bitzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta);
412 | uint16_t bitzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta);
413 | uint32_t bitzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta);
414 | uint64_t bitzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta);
415 | //-- in-place zigzag reverse transform
416 | void bitzdec8( uint8_t *in, unsigned n, uint8_t start);
417 | void bitzdec16( uint16_t *in, unsigned n, uint16_t start);
418 | void bitzdec32( uint32_t *in, unsigned n, uint32_t start);
419 | void bitzdec64( uint64_t *in, unsigned n, uint64_t start);
420 |
421 | //------------- Zigzag of zigzag/delta : unsorted/sorted integer array ----------------------------------------------------
422 | //-- get delta maximum bit length of the non strictly decreasing integer array. out[i] = in[i] - in[i-1] - 1
423 | uint8_t bitzz8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
424 | uint16_t bitzz16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
425 | uint32_t bitzz32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
426 | uint64_t bitzz64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
427 |
428 | uint8_t bitzzenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start, uint8_t mindelta);
429 | uint16_t bitzzenc16(uint16_t *in, unsigned n, uint16_t *out, uint16_t start, uint16_t mindelta);
430 | uint32_t bitzzenc32(uint32_t *in, unsigned n, uint32_t *out, uint32_t start, uint32_t mindelta);
431 | uint64_t bitzzenc64(uint64_t *in, unsigned n, uint64_t *out, uint64_t start, uint64_t mindelta);
432 |
433 | //-- in-place reverse zigzag of delta (encoded w/ bitdiencNN and parameter mindelta = 1)
434 | void bitzzdec8( uint8_t *in, unsigned n, uint8_t start); // non strictly decreasing (out[i] = in[i] - in[i-1] - 1)
435 | void bitzzdec16( uint16_t *in, unsigned n, uint16_t start);
436 | void bitzzdec32( uint32_t *in, unsigned n, uint32_t start);
437 | void bitzzdec64( uint64_t *in, unsigned n, uint64_t start);
438 |
439 | //------------- XOR encoding for unsorted integer lists: out[i] = in[i] - in[i-1] -------------
440 | //-- ORed array, to get maximum zigzag bit length integer array
441 | uint8_t bitx8( uint8_t *in, unsigned n, uint8_t *px, uint8_t start);
442 | uint16_t bitx16( uint16_t *in, unsigned n, uint16_t *px, uint16_t start);
443 | uint32_t bitx32( uint32_t *in, unsigned n, uint32_t *px, uint32_t start);
444 | uint64_t bitx64( uint64_t *in, unsigned n, uint64_t *px, uint64_t start);
445 |
446 | //-- XOR transform
447 | uint8_t bitxenc8( uint8_t *in, unsigned n, uint8_t *out, uint8_t start);
448 | uint16_t bitxenc16( uint16_t *in, unsigned n, uint16_t *out, uint16_t start);
449 | uint32_t bitxenc32( uint32_t *in, unsigned n, uint32_t *out, uint32_t start);
450 | uint64_t bitxenc64( uint64_t *in, unsigned n, uint64_t *out, uint64_t start);
451 |
452 | //-- XOR in-place reverse transform
453 | void bitxdec8( uint8_t *p, unsigned n, uint8_t start);
454 | void bitxdec16( uint16_t *p, unsigned n, uint16_t start);
455 | void bitxdec32( uint32_t *p, unsigned n, uint32_t start);
456 | void bitxdec64( uint64_t *p, unsigned n, uint64_t start);
457 |
458 | //------- Lossy floating point transform: pad the trailing mantissa bits with zeros according to the error e (ex. e=0.00001)
459 | #ifdef USE_FLOAT16
460 | void fppad16(_Float16 *in, size_t n, _Float16 *out, float e);
461 | #endif
462 | void fppad32(float *in, size_t n, float *out, float e);
463 | void fppad64(double *in, size_t n, double *out, double e);
464 |
465 | #ifdef __cplusplus
466 | }
467 | #endif
468 |
469 | //---- Floating point to Integer decomposition ---------------------------------
470 | // seeeeeeee21098765432109876543210 (s:sign, e:exponent, 0-9:mantissa)
471 | #ifdef BITUTIL_IN
472 | #define MANTF32 23
473 | #define MANTF64 52
474 |
475 | #define BITFENC(_u_, _sgn_, _expo_, _mant_, _mantbits_, _one_) _sgn_ = _u_ >> (sizeof(_u_)*8-1); _expo_ = ((_u_ >> (_mantbits_)) & ( (_one_<<(sizeof(_u_)*8 - 1 - _mantbits_)) -1)); _mant_ = _u_ & ((_one_<<_mantbits_)-1);
476 | #define BITFDEC( _sgn_, _expo_, _mant_, _u_, _mantbits_) _u_ = (_sgn_) << (sizeof(_u_)*8-1) | (_expo_) << _mantbits_ | (_mant_)
477 | #endif
478 |
--------------------------------------------------------------------------------
/conf.h:
--------------------------------------------------------------------------------
1 | /**
2 | Copyright (C) powturbo 2013-2019
3 | GPL v2 License
4 |
5 | This program is free software; you can redistribute it and/or modify
6 | it under the terms of the GNU General Public License as published by
7 | the Free Software Foundation; either version 2 of the License, or
8 | (at your option) any later version.
9 |
10 | This program is distributed in the hope that it will be useful,
11 | but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | GNU General Public License for more details.
14 |
15 | You should have received a copy of the GNU General Public License along
16 | with this program; if not, write to the Free Software Foundation, Inc.,
17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
18 |
19 | - homepage : https://sites.google.com/site/powturbo/
20 | - github : https://github.com/powturbo
21 | - twitter : https://twitter.com/powturbo
22 | - email : powturbo [_AT_] gmail [_DOT_] com
23 | **/
24 |
25 | // conf.h - config & common
26 | #ifndef CONF_H
27 | #define CONF_H
28 | //------------------------- Compiler ------------------------------------------
29 | #if defined(__GNUC__)
30 | #include
31 | #define ALIGNED(t,v,n) t v __attribute__ ((aligned (n)))
32 | #define ALWAYS_INLINE inline __attribute__((always_inline))
33 | #define NOINLINE __attribute__((noinline))
34 | #define _PACKED __attribute__ ((packed))
35 | #define likely(x) __builtin_expect((x),1)
36 | #define unlikely(x) __builtin_expect((x),0)
37 |
38 | #define popcnt32(_x_) __builtin_popcount(_x_)
39 | #define popcnt64(_x_) __builtin_popcountll(_x_)
40 |
41 | #if defined(__i386__) || defined(__x86_64__)
42 | //__bsr32: 1:0,2:1,3:1,4:2,5:2,6:2,7:2,8:3,9:3,10:3,11:3,12:3,13:3,14:3,15:3,16:4,17:4,18:4,19:4,20:4,21:4,22:4,23:4,24:4,25:4,26:4,27:4,28:4,29:4,30:4,31:4,32:5
43 | // bsr32: 0:0,1:1,2:2,3:2,4:3,5:3,6:3,7:3,8:4,9:4,10:4,11:4,12:4,13:4,14:4,15:4,16:5,17:5,18:5,19:5,20:5,21:5,22:5,23:5,24:5,25:5,26:5,27:5,28:5,29:5,30:5,31:5,32:6,
44 | static inline int __bsr32( int x) { asm("bsr %1,%0" : "=r" (x) : "rm" (x) ); return x; }
45 | static inline int bsr32( int x) { int b = -1; asm("bsrl %1,%0" : "+r" (b) : "rm" (x) ); return b + 1; }
46 | static inline int bsr64(uint64_t x) { return x?64 - __builtin_clzll(x):0; }
47 |
48 | static inline unsigned rol32(unsigned x, int s) { asm ("roll %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; }
49 | static inline unsigned ror32(unsigned x, int s) { asm ("rorl %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; }
50 | static inline uint64_t rol64(uint64_t x, int s) { asm ("rolq %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; }
51 | static inline uint64_t ror64(uint64_t x, int s) { asm ("rorq %%cl,%0" :"=r" (x) :"0" (x),"c" (s)); return x; }
52 | #else
53 | static inline int __bsr32(unsigned x ) { return 31 - __builtin_clz( x); }
54 | static inline int bsr32(int x ) { return x?32 - __builtin_clz( x):0; }
55 | static inline int bsr64(uint64_t x) { return x?64 - __builtin_clzll(x):0; }
56 |
57 | static inline unsigned rol32(unsigned x, int s) { return x << s | x >> (32 - s); }
58 | static inline unsigned ror32(unsigned x, int s) { return x >> s | x << (32 - s); }
59 | static inline unsigned rol64(unsigned x, int s) { return x << s | x >> (64 - s); }
60 | static inline unsigned ror64(unsigned x, int s) { return x >> s | x << (64 - s); }
61 | #endif
62 |
63 | #define ctz64(_x_) __builtin_ctzll(_x_)
64 | #define ctz32(_x_) __builtin_ctz(_x_) // 0:32 ctz32(1< 4 || __GNUC__ == 4 && __GNUC_MINOR__ >= 8
70 | #define bswap16(x) __builtin_bswap16(x)
71 | #else
72 | static inline unsigned short bswap16(unsigned short x) { return __builtin_bswap32(x << 16); }
73 | #endif
74 | #define bswap32(x) __builtin_bswap32(x)
75 | #define bswap64(x) __builtin_bswap64(x)
76 |
77 | #elif _MSC_VER //----------------------------------------------------
78 | #include
79 | #include
80 | #if _MSC_VER < 1600
81 | #include "vs/stdint.h"
82 | #define __builtin_prefetch(x,a)
83 | #define inline __inline
84 | #else
85 | #include
86 | #define __builtin_prefetch(x,a) _mm_prefetch(x, _MM_HINT_NTA)
87 | #endif
88 |
89 | #define ALIGNED(t,v,n) __declspec(align(n)) t v
90 | #define ALWAYS_INLINE __forceinline
91 | #define NOINLINE __declspec(noinline)
92 | #define THREADLOCAL __declspec(thread)
93 | #define likely(x) (x)
94 | #define unlikely(x) (x)
95 |
96 | static inline int __bsr32(unsigned x) { unsigned long z=0; _BitScanReverse(&z, x); return z; }
97 | static inline int bsr32( unsigned x) { unsigned long z; _BitScanReverse(&z, x); return x?z+1:0; }
98 | static inline int ctz32( unsigned x) { unsigned long z; _BitScanForward(&z, x); return x?z:32; }
99 | static inline int clz32( unsigned x) { unsigned long z; _BitScanReverse(&z, x); return x?31-z:32; }
100 | #if !defined(_M_ARM64) && !defined(_M_X64)
101 | static inline unsigned char _BitScanForward64(unsigned long* ret, uint64_t x) {
102 | unsigned long x0 = (unsigned long)x, top, bottom; _BitScanForward(&top, (unsigned long)(x >> 32)); _BitScanForward(&bottom, x0);
103 | *ret = x0 ? bottom : 32 + top; return x != 0;
104 | }
105 | static unsigned char _BitScanReverse64(unsigned long* ret, uint64_t x) {
106 | unsigned long x1 = (unsigned long)(x >> 32), top, bottom; _BitScanReverse(&top, x1); _BitScanReverse(&bottom, (unsigned long)x);
107 | *ret = x1 ? top + 32 : bottom; return x != 0;
108 | }
109 | #endif
110 | static inline int bsr64(uint64_t x) { unsigned long z=0; _BitScanReverse64(&z, x); return x?z+1:0; }
111 | static inline int ctz64(uint64_t x) { unsigned long z; _BitScanForward64(&z, x); return x?z:64; }
112 | static inline int clz64(uint64_t x) { unsigned long z; _BitScanReverse64(&z, x); return x?63-z:64; }
113 |
114 | #define rol32(x,s) _lrotl(x, s)
115 | #define ror32(x,s) _lrotr(x, s)
116 |
117 | #define bswap16(x) _byteswap_ushort(x)
118 | #define bswap32(x) _byteswap_ulong(x)
119 | #define bswap64(x) _byteswap_uint64(x)
120 |
121 | #define popcnt32(x) __popcnt(x)
122 | #ifdef _WIN64
123 | #define popcnt64(x) __popcnt64(x)
124 | #else
125 | #define popcnt64(x) (popcnt32(x) + popcnt32(x>>32))
126 | #endif
127 |
128 | #define sleep(x) Sleep(x/1000)
129 | #define fseeko _fseeki64
130 | #define ftello _ftelli64
131 | #define strcasecmp _stricmp
132 | #define strncasecmp _strnicmp
133 | #define strtoull _strtoui64
134 | static inline double round(double num) { return (num > 0.0) ? floor(num + 0.5) : ceil(num - 0.5); }
135 | #endif
136 |
137 | #define bsr8(_x_) bsr32(_x_)
138 | #define bsr16(_x_) bsr32(_x_)
139 | #define ctz8(_x_) ctz32(_x_)
140 | #define ctz16(_x_) ctz32(_x_)
141 | #define clz8(_x_) (clz32(_x_)-24)
142 | #define clz16(_x_) (clz32(_x_)-16)
143 |
144 | #define popcnt8(x) popcnt32(x)
145 | #define popcnt16(x) popcnt32(x)
146 |
147 | //--------------- Unaligned memory access -------------------------------------
148 | /*# || defined(i386) || defined(_X86_) || defined(__THW_INTEL)*/
149 | #if defined(__i386__) || defined(__x86_64__) || \
150 | defined(_M_IX86) || defined(_M_AMD64) || _MSC_VER ||\
151 | defined(__powerpc__) ||\
152 | defined(__ARM_FEATURE_UNALIGNED) || defined(__aarch64__) || defined(__arm__) ||\
153 | defined(__ARM_ARCH_4__) || defined(__ARM_ARCH_4T__) || \
154 | defined(__ARM_ARCH_5__) || defined(__ARM_ARCH_5T__) || defined(__ARM_ARCH_5TE__) || defined(__ARM_ARCH_5TEJ__) || \
155 | defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) || defined(__ARM_ARCH_6T2__) || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__)
156 | #define ctou16(_cp_) (*(unsigned short *)(_cp_))
157 | #define ctou32(_cp_) (*(unsigned *)(_cp_))
158 | #define ctof32(_cp_) (*(float *)(_cp_))
159 |
160 | #if defined(__i386__) || defined(__x86_64__) || defined(__powerpc__) || defined(_MSC_VER)
161 | #define ctou64(_cp_) (*(uint64_t *)(_cp_))
162 | #define ctof64(_cp_) (*(double *)(_cp_))
163 | #elif defined(__ARM_FEATURE_UNALIGNED)
164 | struct _PACKED longu { uint64_t l; };
165 | struct _PACKED doubleu { double d; };
166 | #define ctou64(_cp_) ((struct longu *)(_cp_))->l
167 | #define ctof64(_cp_) ((struct doubleu *)(_cp_))->d
168 | #endif
169 |
170 | #elif defined(__ARM_ARCH_7__) || defined(__ARM_ARCH_7A__) || defined(__ARM_ARCH_7M__) || defined(__ARM_ARCH_7R__) || defined(__ARM_ARCH_7S__)
171 | struct _PACKED shortu { unsigned short s; };
172 | struct _PACKED unsignedu { unsigned u; };
173 | struct _PACKED longu { uint64_t l; };
174 | struct _PACKED floatu { float f; };
175 | struct _PACKED doubleu { double d; };
176 |
177 | #define ctou16(_cp_) ((struct shortu *)(_cp_))->s
178 | #define ctou32(_cp_) ((struct unsignedu *)(_cp_))->u
179 | #define ctou64(_cp_) ((struct longu *)(_cp_))->l
180 | #define ctof32(_cp_) ((struct floatu *)(_cp_))->f
181 | #define ctof64(_cp_) ((struct doubleu *)(_cp_))->d
182 | #else
183 | #error "unknown cpu"
184 | #endif
185 |
186 | #ifdef ctou16
187 | //#define utoc16(_x_,_cp_) ctou16(_cp_) = _x_
188 | #else
189 | static inline unsigned short ctou16(void *cp) { unsigned short x; memcpy((void *)&x, cp, (unsigned int)sizeof(x)); return x; }
190 | //static inline void utoc16(unsigned short x, void *cp ) { memcpy(cp, &x, sizeof(x)); }
191 | #endif
192 |
193 | #ifdef ctou32
194 | //#define utoc32(_x_,_cp_) ctou32(_cp_) = _x_
195 | #else
196 | static inline unsigned ctou32(void *cp) { unsigned x; memcpy(void *)&x, cp, (unsigned int)sizeof(x)); return x; }
197 | //static inline void utoc32(unsigned x, void *cp ) { memcpy(cp, &x, sizeof(x)); }
198 | #endif
199 |
200 | #ifdef ctou64
201 | //#define utoc64(_x_,_cp_) ctou64(_cp_) = _x_
202 | #else
203 | static inline uint64_t ctou64(void *cp) { uint64_t x; memcpy((void *)&x, cp, (unsigned int)sizeof(x)); return x; }
204 | //static inline void utoc64(uint64_t x, void *cp ) { memcpy(cp, &x, sizeof(x)); }
205 | #endif
206 |
207 | #define ctou24(_cp_) (ctou32(_cp_) & 0xffffff)
208 | #define ctou48(_cp_) (ctou64(_cp_) & 0xffffffffffffull)
209 | #define ctou8(_cp_) (*(_cp_))
210 | //--------------------- wordsize ----------------------------------------------
211 | #if defined(__64BIT__) || defined(_LP64) || defined(__LP64__) || defined(_WIN64) ||\
212 | defined(__x86_64__) || defined(_M_X64) ||\
213 | defined(__ia64) || defined(_M_IA64) ||\
214 | defined(__aarch64__) ||\
215 | defined(__mips64) ||\
216 | defined(__powerpc64__) || defined(__ppc64__) || defined(__PPC64__) ||\
217 | defined(__s390x__)
218 | #define __WORDSIZE 64
219 | #else
220 | #define __WORDSIZE 32
221 | #endif
222 | #endif
223 |
224 | //---------------------misc ---------------------------------------------------
225 | #define BZHI64(_u_, _b_) ((_u_) & ((1ull<<(_b_))-1))
226 | #define BZHI32(_u_, _b_) ((_u_) & ((1u <<(_b_))-1))
227 | #define BZHI16(_u_, _b_) BZHI32(_u_, _b_)
228 | #define BZHI8(_u_, _b_) BZHI32(_u_, _b_)
229 |
230 | #define SIZE_ROUNDUP(_n_, _a_) (((size_t)(_n_) + (size_t)((_a_) - 1)) & ~(size_t)((_a_) - 1))
231 | #define ALIGN_DOWN(__ptr, __a) ((void *)((uintptr_t)(__ptr) & ~(uintptr_t)((__a) - 1)))
232 |
233 | #define TEMPLATE2_(_x_, _y_) _x_##_y_
234 | #define TEMPLATE2(_x_, _y_) TEMPLATE2_(_x_,_y_)
235 |
236 | #define TEMPLATE3_(_x_,_y_,_z_) _x_##_y_##_z_
237 | #define TEMPLATE3(_x_,_y_,_z_) TEMPLATE3_(_x_, _y_, _z_)
238 |
239 | #define CACHE_LINE_SIZE 64
240 | #define PREFETCH_DISTANCE (CACHE_LINE_SIZE*4)
241 | //--- NDEBUG -------
242 | #include
243 | #ifdef _MSC_VER
244 | #ifdef NDEBUG
245 | #define AS(expr, fmt, ...)
246 | #define AC(expr, fmt, ...) do { if(!(expr)) { fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); abort(); } } while(0)
247 | #define die(fmt, ...) do { fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); exit(-1); } while(0)
248 | #else
249 | #define AS(expr, fmt, ...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); abort(); } } while(0)
250 | #define AC(expr, fmt, ...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); abort(); } } while(0)
251 | #define die(fmt, ...) do { fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ##__VA_ARGS__ ); fflush(stderr); exit(-1); } while(0)
252 | #endif
253 | #else
254 | #ifdef NDEBUG
255 | #define AS(expr, fmt,args...)
256 | #define AC(expr, fmt,args...) do { if(!(expr)) { fprintf(stderr, fmt, ## args ); fflush(stderr); abort(); } } while(0)
257 | #define die(fmt,args...) do { fprintf(stderr, fmt, ## args ); fflush(stderr); exit(-1); } while(0)
258 | #else
259 | #define AS(expr, fmt,args...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ## args ); fflush(stderr); abort(); } } while(0)
260 | #define AC(expr, fmt,args...) do { if(!(expr)) { fflush(stdout);fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ## args ); fflush(stderr); abort(); } } while(0)
261 | #define die(fmt,args...) do { fprintf(stderr, "%s:%s:%d:", __FILE__, __FUNCTION__, __LINE__); fprintf(stderr, fmt, ## args ); fflush(stderr); exit(-1); } while(0)
262 | #endif
263 | #endif
264 |
265 |
--------------------------------------------------------------------------------
/makefile:
--------------------------------------------------------------------------------
1 | # powturbo (c) Copyright 2013-2019
2 | # ----------- Downloading + Compiling ----------------------
3 | # Download or clone TurboTranspose:
4 | # git clone git://github.com/powturbo/TurboTranspose.git
5 | # make
6 |
7 | # Linux: "export CC=clang" "export CXX=clang". windows mingw: "set CC=gcc" "set CXX=g++" or uncomment the CC,CXX lines
8 | CC ?= gcc
9 | CXX ?= g++
10 | #CC=clang-8
11 | #CXX=clang++-8
12 |
13 | #CC = gcc-8
14 | #CXX = g++-8
15 |
16 | #CC=powerpc64le-linux-gnu-gcc
17 | #CXX=powerpc64le-linux-gnu-g++
18 |
19 | DDEBUG=-DNDEBUG -s
20 | #DDEBUG=-g
21 |
22 | ifneq (,$(filter Windows%,$(OS)))
23 | OS := Windows
24 | CFLAGS+=-D__int64_t=int64_t
25 | else
26 | OS := $(shell uname -s)
27 | ARCH := $(shell uname -m)
28 | ifneq (,$(findstring powerpc64le,$(CC)))
29 | ARCH = ppc64le
30 | endif
31 | ifneq (,$(findstring aarch64,$(CC)))
32 | ARCH = aarch64
33 | endif
34 | endif
35 |
36 | #------ ARMv8
37 | ifeq ($(ARCH),aarch64)
38 | CFLAGS+=-march=armv8-a
39 | ifneq (,$(findstring clang, $(CC)))
40 | MSSE=-O3 -mcpu=cortex-a72 -falign-loops -fomit-frame-pointer
41 | else
42 | MSSE=-O3 -mcpu=cortex-a72 -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer
43 | endif
44 |
45 | else
46 | # ----- Power9
47 | ifeq ($(ARCH),ppc64le)
48 | MSSE=-D__SSE__ -D__SSE2__ -D__SSE3__ -D__SSSE3__
49 | MARCH=-march=power9 -mtune=power9
50 | CFLAGS+=-DNO_WARN_X86_INTRINSICS
51 | CXXFLAGS+=-DNO_WARN_X86_INTRINSICS
52 | #------ x86_64 : minimum SSE = Sandy Bridge, AVX2 = haswell
53 | else
54 | MSSE=-march=corei7-avx -mtune=corei7-avx
55 | # -mno-avx -mno-aes (add for Pentium based Sandy bridge)
56 | CFLAGS+=-mssse3
57 | MAVX2=-march=haswell
58 | endif
59 | endif
60 |
61 | ifeq ($(OS),$(filter $(OS),Linux Darwin GNU/kFreeBSD GNU OpenBSD FreeBSD DragonFly NetBSD MSYS_NT Haiku))
62 | #LDFLAGS+=-lpthread -lm
63 | ifneq ($(OS),Darwin)
64 | LDFLAGS+=-lrt
65 | endif
66 | endif
67 |
68 | # Minimum CPU architecture
69 | #MARCH=-march=native
70 | MARCH=$(MSSE)
71 |
72 | ifeq ($(AVX2),1)
73 | MARCH+=-mbmi2 -mavx2
74 | CFLAGS+=-DUSE_AVX2
75 | CXXFLAGS+=-DUSE_AVX2
76 | else
77 | AVX2=0
78 | endif
79 |
80 | #----------------------------------------------
81 | ifeq ($(STATIC),1)
82 | LDFLAGS+=-static
83 | endif
84 |
85 | #---------------------- make args --------------------------
86 | ifeq ($(BLOSC),1)
87 | DEFS+=-DBLOSC
88 | endif
89 |
90 | ifeq ($(LZ4),1)
91 | CFLAGS+=-DLZ4 -Ilz4/lib
92 | endif
93 |
94 | ifeq ($(BITSHUFFLE),1)
95 | CFLAGS+=-DBITSHUFFLE -Iext/bitshuffle/lz4
96 | endif
97 |
98 | OB=transpose.o tpbench.o
99 |
100 | ifneq ($(NSIMD),1)
101 | OB+=transpose_sse.o
102 | CFLAGS+=-DUSE_SSE
103 |
104 | ifeq ($(AVX2),1)
105 | MARCH+=-mavx2 -mbmi2
106 | CFLAGS+=-DUSE_AVX2
107 | OB+=transpose_avx2.o
108 | endif
109 | endif
110 |
111 | CFLAGS+=$(DDEBUG) -w -Wall -std=gnu99 -DUSE_THREADS -fstrict-aliasing -Iext $(DEFS)
112 | CXXFLAGS+=$(DDEBUG) -w -fpermissive -Wall -fno-rtti -Iext/FastPFor/headers $(DEFS)
113 |
114 |
115 | all: tpbench
116 |
117 | transpose.o: transpose.c
118 | $(CC) -O3 $(CFLAGS) $(COPT) -c -DUSE_SSE -falign-loops transpose.c -o transpose.o
119 |
120 | transpose_sse.o: transpose.c
121 | $(CC) -O3 $(CFLAGS) $(COPT) -DSSE2_ON $(MSSE) -falign-loops -c transpose.c -o transpose_sse.o
122 |
123 | transpose_avx2.o: transpose.c
124 | $(CC) -O3 $(CFLAGS) $(COPT) -DAVX2_ON $(MAVX2) -falign-loops -c transpose.c -o transpose_avx2.o
125 |
126 |
127 | #-------- BLOSC + BitShuffle -----------------------
128 | ifeq ($(BLOSC),1)
129 | LDFLAGS+=-lpthread
130 |
131 | CFLAGS+=-DBLOSC
132 | #-DPREFER_EXTERNAL_LZ4=ON -DHAVE_LZ4 -DHAVE_LZ4HC -Ibitshuffle/lz4
133 |
134 | c-blosc2/blosc/shuffle-sse2.o: c-blosc2/blosc/shuffle-sse2.c
135 | $(CC) -O3 $(CFLAGS) -msse2 -c c-blosc2/blosc/shuffle-sse2.c -o c-blosc2/blosc/shuffle-sse2.o
136 |
137 | c-blosc2/blosc/shuffle-generic.o: c-blosc2/blosc/shuffle-generic.c
138 | $(CC) -O3 $(CFLAGS) -c c-blosc2/blosc/shuffle-generic.c -o c-blosc2/blosc/shuffle-generic.o
139 |
140 | c-blosc2/blosc/shuffle-avx2.o: c-blosc2/blosc/shuffle-avx2.c
141 | $(CC) -O3 $(CFLAGS) -mavx2 -c c-blosc2/blosc/shuffle-avx2.c -o c-blosc2/blosc/shuffle-avx2.o
142 |
143 | c-blosc2/blosc/shuffle-neon.o: c-blosc2/blosc/shuffle-neon.c
144 | $(CC) -O3 $(CFLAGS) -flax-vector-conversions -c c-blosc2/blosc/shuffle-neon.c -o c-blosc2/blosc/shuffle-neon.o
145 |
146 | c-blosc2/blosc/bitshuffle-neon.o: c-blosc2/blosc/bitshuffle-neon.c
147 | $(CC) -O3 $(CFLAGS) -flax-vector-conversions -c c-blosc2/blosc/bitshuffle-neon.c -o c-blosc2/blosc/bitshuffle-neon.o
148 |
149 | OB+=c-blosc2/blosc/blosc2.o c-blosc2/blosc/blosclz.o c-blosc2/blosc/shuffle.o c-blosc2/blosc/shuffle-generic.o \
150 | c-blosc2/blosc/bitshuffle-generic.o c-blosc2/blosc/btune.o c-blosc2/blosc/fastcopy.o c-blosc2/blosc/delta.o c-blosc2/blosc/timestamp.o c-blosc2/blosc/trunc-prec.o
151 |
152 | ifeq ($(AVX2),1)
153 | CFLAGS+=-DSHUFFLE_AVX2_ENABLED
154 | OB+=c-blosc2/blosc/shuffle-avx2.o c-blosc2/blosc/bitshuffle-avx2.o
155 | endif
156 | ifeq ($(ARCH),aarch64)
157 | CFLAGS+=-DSHUFFLE_NEON_ENABLED
158 | OB+=c-blosc2/blosc/shuffle-neon.o c-blosc2/blosc/bitshuffle-neon.o
159 | else
160 | CFLAGS+=-DSHUFFLE_SSE2_ENABLED
161 | OB+=c-blosc2/blosc/bitshuffle-sse2.o c-blosc2/blosc/shuffle-sse2.o
162 | endif
163 |
164 | else
165 |
166 | ifeq ($(BITSHUFFLE),1)
167 | CFLAGS+=-DBITSHUFFLE -Ibitshuffle/lz4 -DLZ4_ON
168 |
169 | ifeq ($(ARCH),aarch64)
170 | CFLAGS+=-DUSEARMNEON
171 | else
172 | ifeq ($(AVX2),1)
173 | CFLAGS+=-DUSEAVX2
174 | endif
175 | endif
176 |
177 | OB+=bitshuffle/src/bitshuffle.o bitshuffle/src/iochain.o bitshuffle/src/bitshuffle_core.o
178 | OB+=bitshuffle/lz4/lz4.o
179 | endif
180 |
181 | endif
182 | #---------------
183 |
184 | tpbench: $(OB) tpbench.o transpose.o
185 | $(CC) $^ $(LDFLAGS) -o tpbench
186 |
187 | .c.o:
188 | $(CC) -O3 $(MARCH) $(CFLAGS) $< -c -o $@
189 |
190 | ifeq ($(OS),Windows_NT)
191 | clean:
192 | del /S *.o
193 | del /S *.exe
194 | else
195 | clean:
196 | @find . -type f -name "*\.o" -delete -or -name "*\~" -delete -or -name "core" -delete
197 | endif
198 |
199 |
--------------------------------------------------------------------------------
/makefile.vs:
--------------------------------------------------------------------------------
1 | # powturbo (c) Copyright 2015-2018
2 | # nmake /f makefile.msc
3 | # or
4 | # nmake "AVX2=1" /f makefile.msc
5 |
6 | .SUFFIXES: .c .obj .sobj
7 |
8 | CC = cl
9 | LD = link
10 | AR = lib
11 | CFLAGS = /MD /O2 -I.
12 |
13 | LIB_LIB = libtp.lib
14 | LIB_DLL = tp.dll
15 | LIB_IMP = tp.lib
16 |
17 | OBJS = transpose.obj
18 |
19 | !if "$(NSIMD)" == "1"
20 | !else
21 | OBJS = $(OBJS) transpose_sse.obj
22 | CFLAGS = $(CFLAGS) /DUSE_SSE /D__SSE2__
23 |
24 | !IF "$(AVX2)" == "1"
25 | CFLAGS = $(CFLAGS) /DUSE_AVX2
26 | OBJS = $(OBJS) transpose_avx2.obj
27 | !endif
28 |
29 | !endif
30 |
31 | DLL_OBJS = $(OBJS:.obj=.sobj)
32 |
33 | all: $(LIB_LIB) $(LIB_DLL) tpbench.exe tpbenchdll.exe
34 |
35 | #$(LIB_DLL): $(LIB_IMP)
36 |
37 | transpose.obj: transpose.c
38 | $(CC) /O2 $(CFLAGS) /DUSE_SSE -c transpose.c /Fotranspose.obj
39 |
40 | transpose_sse.obj: transpose.c
41 | $(CC) /O2 $(CFLAGS) /DSSE2_ON /D__SSE2__ /arch:SSE2 /c transpose.c /Fotranspose_sse.obj
42 |
43 | transpose_avx2.obj: transpose.c
44 | $(CC) /O2 $(CFLAGS) /DAVX2_ON /D__AVX2__ /arch:avx2 /c transpose.c /Fotranspose_avx2.obj
45 |
46 | transpose.sobj: transpose.c
47 | $(CC) /O2 $(CFLAGS) /DLIB_DLL=1 /DUSE_SSE -c transpose.c /Fotranspose.sobj
48 |
49 | transpose_sse.sobj: transpose.c
50 | $(CC) /O2 $(CFLAGS) /DLIB_DLL=1 /DSSE2_ON /D__SSE2__ /arch:SSE2 /c transpose.c /Fotranspose_sse.sobj
51 |
52 | transpose_avx2.sobj: transpose.c
53 | $(CC) /O2 $(CFLAGS) /DLIB_DLL=1 /DAVX2_ON /D__AVX2__ /arch:avx2 /c transpose.c /Fotranspose_avx2.sobj
54 |
55 | tpbench.sobj: tpbench.c
56 | $(CC) /O2 $(CFLAGS) /DLIB_DLL -c tpbench.c /Fotpbench.sobj
57 |
58 | .c.obj:
59 | $(CC) -c /Fo$@ /O2 $(CFLAGS) $**
60 |
61 | .c.sobj:
62 | $(CC) -c /Fo$@ /O2 $(CFLAGS) /DLIB_DLL $**
63 |
64 | $(LIB_LIB): $(OBJS)
65 | $(AR) $(ARFLAGS) -out:$@ $(OBJS)
66 |
67 | $(LIB_DLL): $(DLL_OBJS)
68 | $(LD) $(LDFLAGS) -out:$@ -dll -implib:$(LIB_IMP) $(DLL_OBJS)
69 |
70 | $(LIB_IMP): $(LIB_DLL)
71 |
72 | tpbench.exe: tpbench.obj vs/getopt.obj $(LIB_LIB)
73 | $(LD) $(LDFLAGS) -out:$@ $**
74 |
75 | tpbenchdll.exe: tpbench.sobj vs/getopt.obj
76 | $(LD) $(LDFLAGS) -out:$@ $** tp.lib
77 |
78 | clean:
79 | -del *.dll *.exe *.exp *.lib *.obj *.sobj 2>nul
80 |
--------------------------------------------------------------------------------
/sse_neon.h:
--------------------------------------------------------------------------------
1 | /**
2 | Copyright (C) powturbo 2013-2019
3 | GPL v2 License
4 |
5 | This program is free software; you can redistribute it and/or modify
6 | it under the terms of the GNU General Public License as published by
7 | the Free Software Foundation; either version 2 of the License, or
8 | (at your option) any later version.
9 |
10 | This program is distributed in the hope that it will be useful,
11 | but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | GNU General Public License for more details.
14 |
15 | You should have received a copy of the GNU General Public License along
16 | with this program; if not, write to the Free Software Foundation, Inc.,
17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
18 |
19 | - homepage : https://sites.google.com/site/powturbo/
20 | - github : https://github.com/powturbo
21 | - twitter : https://twitter.com/powturbo
22 | - email : powturbo [_AT_] gmail [_DOT_] com
23 | **/
24 | // intel sse to arm neon
25 |
26 | #ifndef _SSE_NEON_H_
27 | #define _SSE_NEON_H_
28 | #include "conf.h"
29 |
30 | #ifdef __ARM_NEON //--------------------------------------------------------------------------------------------------
31 | #include
32 | #define __m128i uint32x4_t
33 |
34 | //#define USE_MACROS
35 | #define uint8x16_to_8x8x2(_a_) ((uint8x8x2_t) { vget_low_u8(_a_), vget_high_u8(_a_) })
36 |
37 | #ifdef USE_MACROS //---------------------------- Set : _mm_set_epi/_mm_set1_epi ----------------------------------------------------------
38 | #define _mm_set_epi8(u15,u14,u13,u12,\
39 | u11,u10, u9, u8,\
40 | u7,u6,u5,u4,\
41 | u3,u2,u1,u0) ({ uint8_t __attribute__((aligned(16))) _u[16] = { u0,u1,u2,u3,u4,u5,u6,u7,u8,u9,u10,u11,u12,u13,u14,u15 }; (uint32x4_t)vld1q_u8( _u);})
42 | #define _mm_set_epi16( u7,u6,u5,u4,\
43 | u3,u2,u1,u0) ({ uint16_t __attribute__((aligned(16))) _u[ 8] = { u0,u1,u2,u3,u4,u5,u6,u7 }; (uint32x4_t)vld1q_u16(_u);})
44 | #define _mm_set_epi32( u3,u2,u1,u0) ({ uint32_t __attribute__((aligned(16))) _u[ 4] = { u0,u1,u2,u3 }; vld1q_u32(_u);})
45 | #define _mm_set_epi64x( u1,u0) ({ uint64_t __attribute__((aligned(16))) _u[ 2] = { u0,u1 }; (uint32x4_t)vld1q_u64(_u);})
46 | #define _mm_set_epi32(u3, u2, u1, u0) vcombine_u32(vcreate_u32((uint64_t)u1 << 32 | u0), vcreate_u32((uint64_t)u3 << 32 | u2))
47 | #define _mm_set_epi64x(u1, u0) (__m128i)vcombine_u64(vcreate_u64(u0), vcreate_u64(u1))
48 | #else
49 | static ALWAYS_INLINE __m128i _mm_set_epi8( uint8_t u15, uint8_t u14, uint8_t u13, uint8_t u12, uint8_t u11, uint8_t u10, uint8_t u9, uint8_t u8,
50 | uint8_t u7, uint8_t u6, uint8_t u5, uint8_t u4,
51 | uint8_t u3, uint8_t u2, uint8_t u1, uint8_t u0) {
52 | uint8_t __attribute__((aligned(16))) u[16] = { u0,u1,u2,u3,u4,u5,u6,u7,u8,u9,u10,u11,u12,u13,u14,u15 }; return (uint32x4_t)vld1q_u8( u); }
53 | static ALWAYS_INLINE __m128i _mm_set_epi16( uint16_t u7, uint16_t u6, uint16_t u5, uint16_t u4,
54 | uint16_t u3, uint16_t u2, uint16_t u1, uint16_t u0) { uint16_t __attribute__((aligned(16))) u[ 8] = { u0,u1,u2,u3,u4,u5,u6,u7 }; return (uint32x4_t)vld1q_u16(u); }
55 | static ALWAYS_INLINE __m128i _mm_set_epi32( uint32_t u3, uint32_t u2, uint32_t u1, uint32_t u0) { uint32_t __attribute__((aligned(16))) u[ 4] = { u0,u1,u2,u3 }; return vld1q_u32(u); }
56 | static ALWAYS_INLINE __m128i _mm_set_epi64x( uint64_t u1, uint64_t u0) { uint64_t __attribute__((aligned(16))) u[ 2] = { u0,u1 }; return (uint32x4_t)vld1q_u64(u); }
57 | #endif
58 |
59 | #define _mm_set1_epi8( _u8_ ) (__m128i)vdupq_n_u8( _u8_ )
60 | #define _mm_set1_epi16( _u16_) (__m128i)vdupq_n_u16(_u16_)
61 | #define _mm_set1_epi32( _u32_) vdupq_n_u32(_u32_)
62 | #define _mm_set1_epi64x(_u64_) (__m128i)vdupq_n_u64(_u64_)
63 | #define _mm_setzero_si128() vdupq_n_u32( 0 )
64 | //---------------------------------------------- Arithmetic -----------------------------------------------------------------------
65 | #define _mm_add_epi8( _a_,_b_) (__m128i)vaddq_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_))
66 | #define _mm_add_epi16( _a_,_b_) (__m128i)vaddq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_))
67 | #define _mm_add_epi32( _a_,_b_) vaddq_u32( _a_, _b_ )
68 | #define _mm_sub_epi16( _a_,_b_) (__m128i)vsubq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_))
69 | #define _mm_sub_epi32( _a_,_b_) (__m128i)vsubq_u32((uint32x4_t)(_a_), (uint32x4_t)(_b_))
70 | #define _mm_subs_epu8( _a_,_b_) (__m128i)vqsubq_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_))
71 |
72 | #define _mm_mullo_epi32(_a_,_b_) (__m128i)vmulq_s32(( int32x4_t)(_a_), ( int32x4_t)(_b_))
73 | #define mm_mullo_epu32(_a_,_b_) vmulq_u32(_a_,_b_)
74 | #define _mm_mul_epu32( _a_,_b_) (__m128i)vmull_u32(vget_low_u32(_a_),vget_low_u32(_b_))
75 | #define _mm_adds_epu16( _a_,_b_) (__m128i)vqaddq_u16((uint16x8_t)(_a_),(uint16x8_t)(_b_))
76 | static ALWAYS_INLINE __m128i _mm_madd_epi16(__m128i a, __m128i b) {
77 | int32x4_t mlo = vmull_s16(vget_low_s16( (int16x8_t)a), vget_low_s16( (int16x8_t)b));
78 | int32x4_t mhi = vmull_s16(vget_high_s16((int16x8_t)a), vget_high_s16((int16x8_t)b));
79 | int32x2_t alo = vpadd_s32(vget_low_s32(mlo), vget_high_s32(mlo));
80 | int32x2_t ahi = vpadd_s32(vget_low_s32(mhi), vget_high_s32(mhi));
81 | return (__m128i)vcombine_s32(alo, ahi);
82 | }
83 | //---------------------------------------------- Special math functions -----------------------------------------------------------
84 | #define _mm_min_epu8( _a_,_b_) (__m128i)vminq_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_))
85 | #define _mm_min_epu16( _a_,_b_) (__m128i)vminq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_))
86 | #define _mm_min_epi16( _a_,_b_) (__m128i)vminq_s16((int16x8_t)(_a_), (int16x8_t)(_b_))
87 | //---------------------------------------------- Logical --------------------------------------------------------------------------
88 | #define mm_testnz_epu32(_a_) vmaxvq_u32(_a_) //vaddvq_u32(_a_)
89 | #define mm_testnz_epu8(_a_) vmaxv_u8(_a_)
90 | #define _mm_or_si128( _a_,_b_) (__m128i)vorrq_u32( (uint32x4_t)(_a_), (uint32x4_t)(_b_))
91 | #define _mm_and_si128( _a_,_b_) (__m128i)vandq_u32( (uint32x4_t)(_a_), (uint32x4_t)(_b_))
92 | #define _mm_xor_si128( _a_,_b_) (__m128i)veorq_u32( (uint32x4_t)(_a_), (uint32x4_t)(_b_))
93 | //---------------------------------------------- Shift ----------------------------------------------------------------------------
94 | #define _mm_slli_epi16( _a_,_m_) (__m128i)vshlq_n_u16((uint16x8_t)(_a_), _m_)
95 | #define _mm_slli_epi32( _a_,_m_) (__m128i)vshlq_n_u32((uint32x4_t)(_a_), _m_)
96 | #define _mm_slli_epi64( _a_,_m_) (__m128i)vshlq_n_u64((uint64x2_t)(_a_), _m_)
97 | #define _mm_slli_si128( _a_,_m_) (__m128i)vextq_u8(vdupq_n_u8(0), (uint8x16_t)(_a_), 16 - (_m_) ) // _m_: 1 - 15
98 |
99 | #define _mm_srli_epi16( _a_,_m_) (__m128i)vshrq_n_u16((uint16x8_t)(_a_), _m_)
100 | #define _mm_srli_epi32( _a_,_m_) (__m128i)vshrq_n_u32((uint32x4_t)(_a_), _m_)
101 | #define _mm_srli_epi64( _a_,_m_) (__m128i)vshlq_n_u64((uint64x2_t)(_a_), _m_)
102 | #define _mm_srli_si128( _a_,_m_) (__m128i)vextq_s8((int8x16_t)(_a_), vdupq_n_s8(0), (_m_))
103 |
104 | #define _mm_srai_epi16( _a_,_m_) (__m128i)vshrq_n_s16((int16x8_t)(_a_), _m_)
105 | #define _mm_srai_epi32( _a_,_m_) (__m128i)vshrq_n_s32((int32x4_t)(_a_), _m_)
106 | #define _mm_srai_epi64( _a_,_m_) (__m128i)vshrq_n_s64((int64x2_t)(_a_), _m_)
107 |
108 | #define _mm_sllv_epi32( _a_,_b_) (__m128i)vshlq_u32((uint32x4_t)(_a_), (uint32x4_t)(_b_))
109 | #define _mm_srlv_epi32( _a_,_b_) (__m128i)vshlq_u32((uint32x4_t)(_a_), vnegq_s32((int32x4_t)(_b_)))
110 | //---------------------------------------------- Compare --------- true/false->1/0 (all bits set) ---------------------------------
111 | #define _mm_cmpeq_epi8( _a_,_b_) (__m128i)vceqq_s8( ( int8x16_t)(_a_), ( int8x16_t)(_b_))
112 | #define _mm_cmpeq_epi16(_a_,_b_) (__m128i)vceqq_s16(( int16x8_t)(_a_), ( int16x8_t)(_b_))
113 | #define _mm_cmpeq_epi32(_a_,_b_) (__m128i)vceqq_s32(( int32x4_t)(_a_), ( int32x4_t)(_b_))
114 |
115 | #define _mm_cmpgt_epi16(_a_,_b_) (__m128i)vcgtq_s16(( int16x8_t)(_a_), ( int16x8_t)(_b_))
116 | #define _mm_cmpgt_epi32(_a_,_b_) (__m128i)vcgtq_s32(( int32x4_t)(_a_), ( int32x4_t)(_b_))
117 |
118 | #define _mm_cmpgt_epu16(_a_,_b_) (__m128i)vcgtq_u16((uint16x8_t)(_a_), (uint16x8_t)(_b_))
119 | #define mm_cmpgt_epu32(_a_,_b_) (__m128i)vcgtq_u32( _a_, _b_)
120 | //---------------------------------------------- Load -----------------------------------------------------------------------------
121 | #define _mm_loadl_epi64( _u64p_) (__m128i)vcombine_s32(vld1_s32((int32_t const *)(_u64p_)), vcreate_s32(0))
122 | #define mm_loadu_epi64p( _u64p_,_a_) (__m128i)vld1q_lane_u64((uint64_t *)(_u64p_), (uint64x2_t)(_a_), 0)
123 | #define _mm_loadu_si128( _ip_) vld1q_u32(_ip_)
124 | #define _mm_load_si128( _ip_) vld1q_u32(_ip_)
125 | //---------------------------------------------- Store ----------------------------------------------------------------------------
126 | #define _mm_storel_epi64(_ip_,_a_) vst1q_lane_u64((uint64_t *)(_ip_), (uint64x2_t)(_a_), 0)
127 | #define _mm_storeu_si128(_ip_,_a_) vst1q_u32((__m128i *)(_ip_),_a_)
128 | //---------------------------------------------- Convert --------------------------------------------------------------------------
129 | #define mm_cvtsi64_si128p(_u64p_,_a_) mm_loadu_epi64p(_u64p_,_a_)
130 | #define _mm_cvtsi64_si128(_a_) (__m128i)vdupq_n_u64(_a_) //vld1q_s64(_a_)
131 | //---------------------------------------------- Reverse bits/bytes ---------------------------------------------------------------
132 | #define mm_rbit_epi8(a) (__m128i)vrbitq_u8( (uint8x16_t)(a)) // reverse bits
133 | #define mm_rev_epi16(a) vrev16q_u8((uint8x16_t)(a)) // reverse bytes
134 | #define mm_rev_epi32(a) vrev32q_u8((uint8x16_t)(a))
135 | #define mm_rev_epi64(a) vrev64q_u8((uint8x16_t)(a))
136 | //--------------------------------------------- Insert/extract --------------------------------------------------------------------
137 | #define mm_extract_epi32x(_a_,_u32_,_id_) vst1q_lane_u32((uint32_t *)&(_u32_), _a_, _id_)
138 | #define _mm_extract_epi64x(_a_,_u64_,_id_) vst1q_lane_u64((uint64_t *)&(_u64_), (uint64x2_t)(_a_), _id_)
139 |
140 | #define _mm_extract_epi8(_a_, _id_) vgetq_lane_u8( (uint8x16_t)(_a_), _id_)
141 | #define _mm_extract_epi16(_a_, _id_) vgetq_lane_u16(_a_, _id_)
142 | #define _mm_extract_epi32(_a_, _id_) vgetq_lane_u32(_a_, _id_)
143 | #define mm_extract_epu32(_a_, _id_) vgetq_lane_u32(_a_, _id_)
144 | #define _mm_cvtsi128_si32(_a_) vgetq_lane_u32((uint32x4_t)(_a_),0)
145 | #define _mm_cvtsi128_si64(_a_) vgetq_lane_u64((uint64x2_t)(_a_),0)
146 |
147 | #define _mm_insert_epu32p(_a_,_u32p_,_id_) vsetq_lane_u32(_x_, _a_, _id_)
148 | #define mm_insert_epi32p(_a_,_u32p_,_id_) vld1q_lane_u32(_u32p_, (uint32x4_t)(_a_), _id_)
149 | #define _mm_cvtsi32_si128(_a_) (__m128i)vsetq_lane_s32(_a_, vdupq_n_s32(0), 0)
150 |
151 | #define _mm_blendv_epi8(_a_,_b_,_m_) vbslq_u32(_m_,_b_,_a_)
152 | //---------------------------------------------- Miscellaneous --------------------------------------------------------------------
153 | #define _mm_alignr_epi8(_a_,_b_,_m_) (__m128i)vextq_u8( (uint8x16_t)(_b_), (uint8x16_t)(_a_), _m_)
154 | #define _mm_packs_epi16( _a_,_b_) (__m128i)vcombine_s8( vqmovn_s16((int16x8_t)(_a_)), vqmovn_s16((int16x8_t)(_b_)))
155 | #define _mm_packs_epi32( _a_,_b_) (__m128i)vcombine_s16(vqmovn_s32((int32x4_t)(_a_)), vqmovn_s32((int32x4_t)(_b_)))
156 |
157 | #define _mm_packs_epu16( _a_,_b_) (__m128i)vcombine_u8((uint16x8_t)(_a_), (uint16x8_t)(_b_))
158 | #define _mm_packus_epi16( _a_,_b_) (__m128i)vcombine_u8(vqmovun_s16((int16x8_t)(_a_)), vqmovun_s16((int16x8_t)(_b_)))
159 |
160 | static ALWAYS_INLINE uint16_t _mm_movemask_epi8(__m128i v) {
161 | const uint8x16_t __attribute__ ((aligned (16))) m = {1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7, 1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7};
162 | uint8x16_t mv = (uint8x16_t)vpaddlq_u32(vpaddlq_u16(vpaddlq_u8(vandq_u8(vcltq_s8((int8x16_t)v, vdupq_n_s8(0)), m))));
163 | return vgetq_lane_u8(mv, 8) << 8 | vgetq_lane_u8(mv, 0);
164 | }
165 | //-------- Neon movemask ------ All lanes must be 0 or -1 (=0xff, 0xffff or 0xffffffff)
166 | #ifdef __aarch64__
167 | static ALWAYS_INLINE uint8_t mm_movemask_epi8s(uint8x8_t sv) { const uint8x8_t m = { 1, 1<<1, 1<<2, 1<<3, 1<<4, 1<< 5, 1<< 6, 1<<7 }; return vaddv_u8( vand_u8( sv, m)); } // short only ARM
168 | //static ALWAYS_INLINE uint16_t mm_movemask_epu16(uint32x4_t v) { const uint16x8_t m = { 1, 1<<2, 1<<4, 1<<6, 1<<8, 1<<10, 1<<12, 1<<14}; return vaddvq_u16(vandq_u16((uint16x8_t)v, m)); }
169 | static ALWAYS_INLINE uint16_t mm_movemask_epu16(__m128i v) { const uint16x8_t m = { 1, 1<<1, 1<<2, 1<<3, 1<<4, 1<< 5, 1<< 6, 1<<7 }; return vaddvq_u16(vandq_u16((uint16x8_t)v, m)); }
170 | static ALWAYS_INLINE uint32_t mm_movemask_epu32(__m128i v) { const uint32x4_t m = { 1, 1<<1, 1<<2, 1<<3 }; return vaddvq_u32(vandq_u32((uint32x4_t)v, m)); }
171 | static ALWAYS_INLINE uint64_t mm_movemask_epu64(__m128i v) { const uint64x2_t m = { 1, 1<<1 }; return vaddvq_u64(vandq_u64((uint64x2_t)v, m)); }
172 | #else
173 | static ALWAYS_INLINE uint32_t mm_movemask_epu32(uint32x4_t v) { const uint32x4_t mask = {1,2,4,8}, av = vandq_u32(v, mask), xv = vextq_u32(av, av, 2), ov = vorrq_u32(av, xv); return vgetq_lane_u32(vorrq_u32(ov, vextq_u32(ov, ov, 3)), 0); }
174 | #endif
175 | // --------------------------------------------- Swizzle : _mm_shuffle_epi8 / _mm_shuffle_epi32 / Pack/Unpack -----------------------------------------
176 | #define _MM_SHUFFLE(u3,u2,u1,u0) ((u3) << 6 | (u2) << 4 | (u1) << 2 | (u0))
177 |
178 | #define _mm_shuffle_epi8(_a_, _b_) (__m128i)vqtbl1q_u8((uint8x16_t)(_a_), (uint8x16_t)(_b_))
179 | #if defined(__aarch64__)
180 | #define mm_shuffle_nnnn_epi32(_a_,_m_) (__m128i)vdupq_laneq_u32(_a_, _m_)
181 | #else
182 | #define mm_shuffle_nnnn_epi32(_a_,_m_) (__m128i)vdupq_n_u32(vgetq_lane_u32(_a_, _m_)
183 | #endif
184 |
185 | #ifdef USE_MACROS
186 | #define mm_shuffle_2031_epi32(_a_) ({ uint32x4_t _zv = (uint32x4_t)vrev64q_u32(_a_); uint32x2x2_t _zv = vtrn_u32(vget_low_u32(_zv), vget_high_u32(_zv)); vcombine_u32(_zv.val[0], _zv.val[1]);})
187 | #define mm_shuffle_3120_epi32(_a_) ({ uint32x4_t _zv = _a_; _zv = vtrn_u32(vget_low_u32(_zv), vget_high_u32(_zv)); vcombine_u32(_zv.val[0], _zv.val[1]);})
188 | #else
189 | static ALWAYS_INLINE __m128i mm_shuffle_2031_epi32(__m128i a) { uint32x4_t v = (uint32x4_t)vrev64q_u32(a); uint32x2x2_t z = vtrn_u32(vget_low_u32(v), vget_high_u32(v)); return vcombine_u32(z.val[0], z.val[1]);}
190 | static ALWAYS_INLINE __m128i mm_shuffle_3120_epi32(__m128i a) { uint32x2x2_t z = vtrn_u32(vget_low_u32(a), vget_high_u32(a)); return vcombine_u32(z.val[0], z.val[1]);}
191 | #endif
192 |
193 | #if defined(USE_MACROS) || defined(__clang__)
194 | #define _mm_shuffle_epi32(_a_, _m_) ({ const uint32x4_t _av =_a_;\
195 | uint32x4_t _v = vmovq_n_u32(vgetq_lane_u32(_av, (_m_) & 0x3));\
196 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 2) & 0x3), _v, 1);\
197 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 4) & 0x3), _v, 2);\
198 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 6) & 0x3), _v, 3); _v;\
199 | })
200 | #define _mm_shuffle_epi32s(_a_, _m_) _mm_set_epi32(vgetq_lane_u32(_a_, ((_m_) ) & 0x3),\
201 | vgetq_lane_u32(_a_, ((_m_) >> 2) & 0x3),\
202 | vgetq_lane_u32(_a_, ((_m_) >> 4) & 0x3),\
203 | vgetq_lane_u32(_a_, ((_m_) >> 6) & 0x3))
204 | #else
205 | static ALWAYS_INLINE __m128i _mm_shuffle_epi32(__m128i _a_, const unsigned _m_) { const uint32x4_t _av =_a_;
206 | uint32x4_t _v = vmovq_n_u32(vgetq_lane_u32(_av, (_m_) & 0x3));
207 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 2) & 0x3), _v, 1);
208 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 4) & 0x3), _v, 2);
209 | _v = vsetq_lane_u32(vgetq_lane_u32(_av, ((_m_) >> 6) & 0x3), _v, 3);
210 | return _v;
211 | }
212 | static ALWAYS_INLINE __m128i _mm_shuffle_epi32s(__m128i _a_, const unsigned _m_) {
213 | return _mm_set_epi32(vgetq_lane_u32(_a_, ((_m_) ) & 0x3),
214 | vgetq_lane_u32(_a_, ((_m_) >> 2) & 0x3),
215 | vgetq_lane_u32(_a_, ((_m_) >> 4) & 0x3),
216 | vgetq_lane_u32(_a_, ((_m_) >> 6) & 0x3));
217 | }
218 | #endif
219 | #ifdef USE_MACROS
220 | #define _mm_unpacklo_epi8( _a_,_b_) ({ uint8x8x2_t _zv = vzip_u8 ( vget_low_u8( (uint8x16_t)(_a_)), vget_low_u8 ((uint8x16_t)(_b_))); (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]);})
221 | #define _mm_unpacklo_epi16(_a_,_b_) ({ uint16x4x2_t _zv = vzip_u16( vget_low_u16((uint16x8_t)(_a_)), vget_low_u16((uint16x8_t)(_b_))); (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]);})
222 | #define _mm_unpacklo_epi32(_a_,_b_) ({ uint32x2x2_t _zv = vzip_u32( vget_low_u32( _a_ ), vget_low_u32( _b_ )); vcombine_u32(_zv.val[0], _zv.val[1]);})
223 | #define _mm_unpacklo_epi64(_a_,_b_) (uint32x4_t)vcombine_u64(vget_low_u64((uint64x2_t)(_a_)), vget_low_u64((uint64x2_t)(_b_)))
224 |
225 | #define _mm_unpackhi_epi8( _a_,_b_) ({ uint8x8x2_t _zv = vzip_u8 (vget_high_u8( (uint8x16_t)(_a_)), vget_high_u8( (uint8x16_t)(_b_))); (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]);})
226 | #define _mm_unpackhi_epi16(_a_,_b_) ({ uint16x4x2_t _zv = vzip_u16(vget_high_u16((uint16x8_t)(_a_)), vget_high_u16((uint16x8_t)(_b_))); (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]);})
227 | #define _mm_unpackhi_epi32(_a_,_b_) ({ uint32x2x2_t _zv = vzip_u32(vget_high_u32( _a_ ), vget_high_u32( _b_ )); vcombine_u32(_zv.val[0], _zv.val[1]);})
228 | #define _mm_unpackhi_epi64(_a_,_b_) (uint32x4_t)vcombine_u64(vget_high_u64((uint64x2_t)(_a_)), vget_high_u64((uint64x2_t)(_b_)))
229 | #else
230 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi8( __m128i _a_, __m128i _b_) { uint8x8x2_t _zv = vzip_u8 ( vget_low_u8( (uint8x16_t)(_a_)), vget_low_u8 ((uint8x16_t)(_b_))); return (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]);}
231 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi16(__m128i _a_, __m128i _b_) { uint16x4x2_t _zv = vzip_u16( vget_low_u16((uint16x8_t)(_a_)), vget_low_u16((uint16x8_t)(_b_))); return (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]);}
232 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi32(__m128i _a_, __m128i _b_) { uint32x2x2_t _zv = vzip_u32( vget_low_u32( _a_ ), vget_low_u32( _b_ )); return vcombine_u32(_zv.val[0], _zv.val[1]);}
233 | static ALWAYS_INLINE __m128i _mm_unpacklo_epi64(__m128i _a_, __m128i _b_) { return (uint32x4_t)vcombine_u64(vget_low_u64((uint64x2_t)(_a_)), vget_low_u64((uint64x2_t)(_b_))); }
234 |
235 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi8( __m128i _a_, __m128i _b_) { uint8x8x2_t _zv = vzip_u8 (vget_high_u8( (uint8x16_t)(_a_)), vget_high_u8( (uint8x16_t)(_b_))); return (uint32x4_t)vcombine_u8( _zv.val[0], _zv.val[1]); }
236 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi16(__m128i _a_, __m128i _b_) { uint16x4x2_t _zv = vzip_u16(vget_high_u16((uint16x8_t)(_a_)), vget_high_u16((uint16x8_t)(_b_))); return (uint32x4_t)vcombine_u16(_zv.val[0], _zv.val[1]); }
237 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi32(__m128i _a_, __m128i _b_) { uint32x2x2_t _zv = vzip_u32(vget_high_u32( _a_ ), vget_high_u32( _b_ )); return vcombine_u32(_zv.val[0], _zv.val[1]); }
238 | static ALWAYS_INLINE __m128i _mm_unpackhi_epi64(__m128i _a_, __m128i _b_) { return (uint32x4_t)vcombine_u64(vget_high_u64((uint64x2_t)(_a_)), vget_high_u64((uint64x2_t)(_b_))); }
239 | #endif
240 |
241 | #else //------------------------------------- intel SSE2/SSSE3 --------------------------------------------------------------
242 | #define mm_movemask_epu32(_a_) _mm_movemask_ps(_mm_castsi128_ps(_a_))
243 | #define mm_movemask_epu16(_a_) _mm_movemask_epi8(_a_)
244 | #define mm_loadu_epi64p( _u64p_,_a_) _a_ = _mm_cvtsi64_si128(ctou64(_u64p_))
245 |
246 | #define mm_extract_epu32( _a_, _id_) _mm_extract_epi32(_a_, _id_)
247 | #define mm_extract_epi32x(_a_,_u32_, _id_) _u32_ = _mm_extract_epi32(_a_, _id_)
248 | #define mm_extract_epi64x(_a_,_u64_, _id_) _u64_ = _mm_extract_epi64(_a_, _id_)
249 | #define mm_insert_epi32p( _a_,_u32p_,_c_) _mm_insert_epi32( _a_,ctou32(_u32p_),_c_)
250 |
251 | #define mm_mullo_epu32( _a_,_b_) _mm_mullo_epi32(_a_,_b_)
252 | #define mm_cvtsi64_si128p(_u64p_,_a_) _a_ = _mm_cvtsi64_si128(ctou64(_u64p_))
253 |
254 | #define mm_cmpgt_epu32( _a_, _b_) _mm_cmpgt_epi32(_mm_xor_si128(_a_, cv80000000), _mm_xor_si128(_b_, cv80000000))
255 |
256 | #define mm_shuffle_nnnn_epi32(_a_, _n_) _mm_shuffle_epi32(_a_, _MM_SHUFFLE(_n_,_n_,_n_,_n_))
257 | #define mm_shuffle_2031_epi32(_a_) _mm_shuffle_epi32(_a_, _MM_SHUFFLE(2,0,3,1))
258 | #define mm_shuffle_3120_epi32(_a_) _mm_shuffle_epi32(_a_, _MM_SHUFFLE(3,1,2,0))
259 |
260 | static ALWAYS_INLINE __m128i mm_rbit_epi8(__m128i v) { // reverse bits in bytes
261 | __m128i fv = _mm_set_epi8(15, 7,11, 3,13, 5, 9, 1,14, 6,10, 2,12, 4, 8, 0), cv0f_8 = _mm_set1_epi8(0xf);
262 | __m128i lv = _mm_shuffle_epi8(fv,_mm_and_si128( v, cv0f_8));
263 | __m128i hv = _mm_shuffle_epi8(fv,_mm_and_si128(_mm_srli_epi64(v, 4), cv0f_8));
264 | return _mm_or_si128(_mm_slli_epi64(lv,4), hv);
265 | }
266 |
267 | static ALWAYS_INLINE __m128i mm_rev_epi16(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8(14,15,12,13,10,11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1)); } // reverse vector bytes in uint??_t
268 | static ALWAYS_INLINE __m128i mm_rev_epi32(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8(12,13,14,15, 8, 9,10,11, 4, 5, 6, 7, 0, 1, 2, 3)); }
269 | static ALWAYS_INLINE __m128i mm_rev_epi64(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8( 8, 9,10,11,12,13,14,15, 0, 1, 2, 3, 4, 5, 6, 7)); }
270 | static ALWAYS_INLINE __m128i mm_rev_si128(__m128i v) { return _mm_shuffle_epi8(v, _mm_set_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15)); }
271 | #endif
272 | #endif
273 |
--------------------------------------------------------------------------------
/time_.h:
--------------------------------------------------------------------------------
1 | /**
2 | Copyright (C) powturbo 2013-2019
3 | GPL v2 License
4 |
5 | This program is free software; you can redistribute it and/or modify
6 | it under the terms of the GNU General Public License as published by
7 | the Free Software Foundation; either version 2 of the License, or
8 | (at your option) any later version.
9 |
10 | This program is distributed in the hope that it will be useful,
11 | but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | GNU General Public License for more details.
14 |
15 | You should have received a copy of the GNU General Public License along
16 | with this program; if not, write to the Free Software Foundation, Inc.,
17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
18 |
19 | - homepage : https://sites.google.com/site/powturbo/
20 | - github : https://github.com/powturbo
21 | - twitter : https://twitter.com/powturbo
22 | - email : powturbo [_AT_] gmail [_DOT_] com
23 | **/
24 | // time_.h : time functions
25 | #include
26 |
27 | #ifdef _WIN32
28 | #include
29 | #ifndef sleep
30 | #define sleep(n) Sleep((n) * 1000)
31 | #endif
32 | typedef unsigned __int64 uint64_t;
33 | typedef unsigned __int64 tm_t;
34 | #else
35 | #include
36 | #include
37 | typedef uint64_t tm_t;
38 | #define Sleep(ms) usleep((ms) * 1000)
39 | #endif
40 |
41 | #if defined (__i386__) || defined( __x86_64__ )
42 | #ifdef _MSC_VER
43 | #include // __rdtsc
44 | #else
45 | #include
46 | #endif
47 |
48 | #ifdef __corei7__
49 | #define RDTSC_INI(_c_) do { unsigned _cl, _ch; \
50 | __asm volatile ("couid\n\t" \
51 | "rdtsc\n\t" \
52 | "mov %%edx, %0\n" \
53 | "mov %%eax, %1\n": "=r" (_ch), "=r" (_cl):: \
54 | "%rax", "%rbx", "%rcx", "%rdx"); \
55 | _c_ = (uint64_t)_ch << 32 | _cl; \
56 | } while(0)
57 |
58 | #define RDTSC(_c_) do { unsigned _cl, _ch; \
59 | __asm volatile("rdtscp\n" \
60 | "mov %%edx, %0\n" \
61 | "mov %%eax, %1\n" \
62 | "cpuid\n\t": "=r" (_ch), "=r" (_cl):: "%rax",\
63 | "%rbx", "%rcx", "%rdx");\
64 | _c_ = (uint64_t)_ch << 32 | _cl;\
65 | } while(0)
66 | #else
67 | #define RDTSC(_c_) do { unsigned _cl, _ch;\
68 | __asm volatile ("cpuid \n"\
69 | "rdtsc"\
70 | : "=a"(_cl), "=d"(_ch)\
71 | : "a"(0)\
72 | : "%ebx", "%ecx");\
73 | _c_ = (uint64_t)_ch << 32 | _cl;\
74 | } while(0)
75 | #define RDTSC_INI(_c_) RDTSC(_c_)
76 | #endif
77 | #else
78 | #define RDTSC_INI(_c_)
79 | #define RDTSC(_c_)
80 | #endif
81 |
82 | #define tmrdtscini() ({ tm_t _c; __asm volatile("" ::: "memory"); RDTSC_INI(_c); _c; })
83 | #define tmrdtsc() ({ tm_t _c; RDTSC(_c); _c; })
84 |
85 | #ifndef TM_F
86 | #define TM_F 1.0 // TM_F=4 -> MI/s
87 | #endif
88 |
89 | #ifdef RDTSC_ON
90 | #define tminit() tmrdtscini()
91 | #define tmtime() tmrdtsc()
92 | #define TM_T CLOCKS_PER_SEC
93 | static double TMBS(unsigned l, tm_t t) { double dt=t,dl=l; return t/l; }
94 | #define TM_C 1000
95 | #else
96 | #define TM_T 1000000.0
97 | #define TM_C 1
98 | static double TMBS(unsigned l, tm_t tm) { double dl=l,dt=tm; return dt>=0.000001?(dl/(1000000.0*TM_F))/(dt/TM_T):0.0; }
99 | #ifdef _WIN32
100 | static LARGE_INTEGER tps;
101 | static tm_t tmtime(void) {
102 | LARGE_INTEGER tm;
103 | tm_t t;
104 | double d;
105 | QueryPerformanceCounter(&tm);
106 | d = tm.QuadPart;
107 | t = d*1000000.0/tps.QuadPart;
108 | return t;
109 | }
110 |
111 | static tm_t tminit() { tm_t t0,ts; QueryPerformanceFrequency(&tps); t0 = tmtime(); while((ts = tmtime())==t0); return ts; }
112 | #else
113 | #ifdef __APPLE__
114 | #include
115 | #ifndef MAC_OS_X_VERSION_10_12
116 | #define MAC_OS_X_VERSION_10_12 101200
117 | #endif
118 | #define CIVETWEB_APPLE_HAVE_CLOCK_GETTIME defined(__APPLE__) && MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_X_VERSION_10_12
119 | #if !(CIVETWEB_APPLE_HAVE_CLOCK_GETTIME)
120 | #include
121 | #define CLOCK_REALTIME 0
122 | #define CLOCK_MONOTONIC 0
123 | int clock_gettime(int /*clk_id*/, struct timespec* t) {
124 | struct timeval now;
125 | int rv = gettimeofday(&now, NULL);
126 | if (rv) return rv;
127 | t->tv_sec = now.tv_sec;
128 | t->tv_nsec = now.tv_usec * 1000;
129 | return 0;
130 | }
131 | #endif
132 | #endif
133 | static tm_t tmtime(void) { struct timespec tm; clock_gettime(CLOCK_MONOTONIC, &tm); return (tm_t)tm.tv_sec*1000000 + tm.tv_nsec/1000; }
134 | static tm_t tminit() { tm_t t0=tmtime(),ts; while((ts = tmtime())==t0); return ts; }
135 | #endif
136 | static double tmsec( tm_t tm) { double d = tm; return d/1000000.0; }
137 | static double tmmsec(tm_t tm) { double d = tm; return d/1000.0; }
138 | #endif
139 | //---------------------------------------- bench ----------------------------------------------------------------------
140 | #define TM_TX TM_T
141 |
142 | #define TMSLEEP do { tm_T = tmtime(); if(!tm_0) tm_0 = tm_T; else if(tm_T - tm_0 > tm_TX) { if(tm_verbose) { printf("S \b\b");fflush(stdout);} sleep(tm_slp); tm_0=tmtime();} } while(0)
143 |
144 | #define TMBEG(_tm_reps_, _tm_Reps_) { unsigned _tm_r,_tm_c=0,_tm_R; tm_t _tm_t0,_tm_t,_tm_ts;\
145 | for(tm_rm = _tm_reps_, tm_tm = (tm_t)1<<63,_tm_R = 0,_tm_ts=tmtime(); _tm_R < _tm_Reps_; _tm_R++) { tm_t _tm_t0 = tminit();\
146 | for(_tm_r=0;_tm_r < tm_rm;) {
147 |
148 | #define TMEND(_len_) _tm_r++; if((_tm_t = tmtime() - _tm_t0) > tm_tx) break; } \
149 | if(_tm_t < tm_tm) { if(tm_tm == (tm_t)1<<63) tm_rm = _tm_r; tm_tm = _tm_t; _tm_c++; } \
150 | else if(_tm_t>tm_tm*1.2) TMSLEEP; if(tm_verbose) { double d = tm_tm*TM_C,dr=tm_rm; printf("%8.2f %2d_%.2d\b\b\b\b\b\b\b\b\b\b\b\b\b\b",TMBS(_len_, d/dr),_tm_R+1,_tm_c),fflush(stdout); }\
151 | if(tmtime()-_tm_ts > tm_TX && _tm_R < tm_RepMin) break;\
152 | if((_tm_R & 7)==7) sleep(tm_slp),_tm_ts=tmtime(); } }
153 |
154 | static unsigned tm_rep = 1<<20, tm_Rep = 3, tm_rep2 = 1<<20, tm_Rep2 = 4, tm_slp = 20, tm_rm;
155 | static tm_t tm_tx = TM_T, tm_TX = 120*TM_T, tm_RepMin=1, tm_0, tm_T, tm_verbose=2, tm_tm;
156 | static void tm_init(int _tm_Rep, int _tm_verbose) { tm_verbose = _tm_verbose; if(_tm_Rep) tm_Rep = _tm_Rep; tm_tx = tminit(); Sleep(500); tm_tx = tmtime() - tm_tx; tm_TX = 10*tm_tx; }
157 |
158 | #define TMBENCH(_name_, _func_, _len_) do { if(tm_verbose>1) printf("%s ", _name_?_name_:#_func_); TMBEG(tm_rep, tm_Rep) _func_; TMEND(_len_); { double dm = tm_tm,dr=tm_rm; if(tm_verbose) printf("%8.2f \b\b\b\b\b", TMBS(_len_, dm*TM_C/dr) );} } while(0)
159 | #define TMBENCH2(_name_, _func_, _len_) do { TMBEG(tm_rep2, tm_Rep2) _func_; TMEND(_len_); { double dm = tm_tm,dr=tm_rm; if(tm_verbose) printf("%8.2f \b\b\b\b\b", TMBS(_len_, dm*TM_C/dr) );} if(tm_verbose>1) printf("%s ", _name_?_name_:#_func_); } while(0)
160 | #define TMBENCHT(_name_,_func_, _len_, _res_) do { TMBEG(tm_rep, tm_Rep) if(_func_ != _res_) { printf("ERROR: %lld != %lld", (long long)_func_, (long long)_res_ ); exit(0); }; TMEND(_len_); if(tm_verbose) printf("%8.2f \b\b\b\b\b", TMBS(_len_,(double)tm_tm*TM_C/(double)tm_rm) ); if(tm_verbose) printf("%s ", _name_?_name_:#_func_ ); } while(0)
161 |
162 | #define Kb (1u<<10)
163 | #define Mb (1u<<20)
164 | #define Gb (1u<<30)
165 | #define KB 1000
166 | #define MB 1000000
167 | #define GB 1000000000
168 |
169 | static unsigned argtoi(char *s, unsigned def) {
170 | char *p;
171 | unsigned n = strtol(s, &p, 10),f = 1;
172 | switch(*p) {
173 | case 'K': f = KB; break;
174 | case 'M': f = MB; break;
175 | case 'G': f = GB; break;
176 | case 'k': f = Kb; break;
177 | case 'm': f = Mb; break;
178 | case 'g': f = Gb; break;
179 | case 'b': def = 0;
180 | default: if(!def) return n>=32?0xffffffffu:(1u << n); f = def;
181 | }
182 | return n*f;
183 | }
184 | static uint64_t argtol(char *s) {
185 | char *p;
186 | uint64_t n = strtol(s, &p, 10),f=1;
187 | switch(*p) {
188 | case 'K': f = KB; break;
189 | case 'M': f = MB; break;
190 | case 'G': f = GB; break;
191 | case 'k': f = Kb; break;
192 | case 'm': f = Mb; break;
193 | case 'g': f = Gb; break;
194 | case 'b': return 1u << n;
195 | default: f = MB;
196 | }
197 | return n*f;
198 | }
199 |
200 | static uint64_t argtot(char *s) {
201 | char *p;
202 | uint64_t n = strtol(s, &p, 10),f=1;
203 | switch(*p) {
204 | case 'h': f = 3600000; break;
205 | case 'm': f = 60000; break;
206 | case 's': f = 1000; break;
207 | case 'M': f = 1; break;
208 | default: f = 1000;
209 | }
210 | return n*f;
211 | }
212 |
213 | static void memrcpy(unsigned char *out, unsigned char *in, unsigned n) { int i; for(i = 0; i < n; i++) out[i] = ~in[i]; }
214 |
--------------------------------------------------------------------------------
/tpbench.c:
--------------------------------------------------------------------------------
1 | /**
2 | Copyright (C) powturbo 2013-2018
3 | GPL v2 License
4 |
5 | This program is free software; you can redistribute it and/or modify
6 | it under the terms of the GNU General Public License as published by
7 | the Free Software Foundation; either version 2 of the License, or
8 | (at your option) any later version.
9 |
10 | This program is distributed in the hope that it will be useful,
11 | but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | GNU General Public License for more details.
14 |
15 | You should have received a copy of the GNU General Public License along
16 | with this program; if not, write to the Free Software Foundation, Inc.,
17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
18 |
19 | - homepage : https://sites.google.com/site/powturbo/
20 | - github : https://github.com/powturbo
21 | - twitter : https://twitter.com/powturbo
22 | - email : powturbo [_AT_] gmail [_DOT_] com
23 | **/
24 | #include
25 | #include
26 |
27 | #ifdef __APPLE__
28 | #include
29 | #else
30 | #include
31 | #endif
32 | #ifdef _MSC_VER
33 | #include "vs/getopt.h"
34 | #else
35 | #include
36 | #endif
37 |
38 | #include "conf.h"
39 | //#define RDTSC_ON
40 | #include "time_.h"
41 |
42 | #include "transpose.h"
43 |
44 | #ifdef BITSHUFFLE
45 | #include "bitshuffle/src/bitshuffle.h"
46 | #include "bitshuffle/lz4/lz4.h"
47 | #endif
48 |
49 | #ifdef BLOSC
50 | #include "c-blosc2/blosc/shuffle.h"
51 | #include "c-blosc2/blosc/blosc2.h"
52 | #endif
53 |
54 | int memcheck(unsigned char *in, unsigned n, unsigned char *cpy) {
55 | int i;
56 | for(i = 0; i < n; i++)
57 | if(in[i] != cpy[i]) {
58 | printf("ERROR in[%d]=%x, dec[%d]=%x\n", i, in[i], i, cpy[i]);
59 | return i+1;
60 | }
61 | return 0;
62 | }
63 |
64 | #ifdef LZ4_ON
65 | #ifdef USE_SSE
66 | unsigned tp4lz4enc(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) {
67 | tp4enc(in, n, tmp, esize);
68 | return LZ4_compress(tmp, out, n);
69 | }
70 |
71 | unsigned tp4lz4dec(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) {
72 | unsigned rc;
73 | LZ4_decompress_fast((char *)in, (char *)tmp, n);
74 | tp4dec(tmp, n, (unsigned char *)out, esize);
75 | return rc;
76 | }
77 | #endif
78 |
79 | unsigned tplz4enc(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) {
80 | tpenc(in, n, tmp, esize);
81 | return LZ4_compress(tmp, out, n);
82 | }
83 |
84 | unsigned tplz4dec(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) {
85 | unsigned rc;
86 | LZ4_decompress_fast((char *)in, (char *)tmp, n);
87 | tpdec(tmp, n, (unsigned char *)out, esize);
88 | return rc;
89 | }
90 | #endif
91 |
92 | #ifdef BITSHUFFLE
93 | #define BITSHUFFLE(in,n,out,esize) bshuf_bitshuffle( in, out, (n)/esize, esize, 0); memcpy((char *)out+((n)&(~(8*esize-1))),(char *)in+((n)&(~(8*esize-1))),(n)&(8*esize-1))
94 | #define BITUNSHUFFLE(in,n,out,esize) bshuf_bitunshuffle(in, out, (n)/esize, esize, 0); memcpy((char *)out+((n)&(~(8*esize-1))),(char *)in+((n)&(~(8*esize-1))),(n)&(8*esize-1))
95 |
96 | unsigned bslz4enc(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) {
97 | BITSHUFFLE(in, n, tmp, esize);
98 | return LZ4_compress(tmp, out, n);
99 | }
100 |
101 | unsigned bslz4dec(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *tmp) {
102 | unsigned rc;
103 | LZ4_decompress_fast((char *)in, (char *)tmp, n);
104 | BITUNSHUFFLE(tmp, n, (unsigned char *)out, esize);
105 | return rc;
106 | }
107 | #endif
108 |
109 | #define ID_MEMCPY 7
110 | void bench(unsigned char *in, unsigned n, unsigned char *out, unsigned esize, unsigned char *cpy, int id) {
111 | memrcpy(cpy,in,n);
112 |
113 | switch(id) {
114 | case 1: { TMBENCH("", tpenc(in, n,out,esize) ,n); TMBENCH2("tp_byte ",tpdec(out,n,cpy,esize) ,n); } break;
115 | #ifdef USE_SSE
116 | case 2: { TMBENCH("", tp4enc(in,n,out,esize) ,n); TMBENCH2("tp_nibble ",tp4dec(out,n,cpy,esize) ,n); } break;
117 | #endif
118 | #ifdef BLOSC
119 | case 3: { TMBENCH("",shuffle(esize,n,in,out), n); TMBENCH2("blosc shuffle ",unshuffle(esize,n,out,cpy), n); } break;
120 | case 4: { unsigned char *tmp = malloc(n); TMBENCH("",bitshuffle(esize,n,in,out,tmp), n); TMBENCH2("blosc bitshuffle ",bitunshuffle(esize,n,out,cpy,tmp), n); free(tmp); } break;
121 | #endif
122 | #ifdef BITSHUFFLE
123 | case 5: { TMBENCH("",bshuf_bitshuffle(in,out,(n)/esize,esize,0), n); TMBENCH2("bitshuffle ",bshuf_bitunshuffle(out,cpy,(n)/esize,esize,0), n); } break;
124 | #endif
125 | case 6: TMBENCH("",memcpy(in,out,n) ,n); TMBENCH2("memcpy ",memcpy(cpy,out,n) ,n); break;
126 | case 7:
127 | switch(esize) {
128 | case 2: { TMBENCH("", tpenc2( in, n,out) ,n); TMBENCH2("tp_byte2 scalar", tpdec2( out,n,cpy) ,n); } break;
129 | case 4: { TMBENCH("", tpenc4( in, n,out) ,n); TMBENCH2("tp_byte4 scalar", tpdec4( out,n,cpy) ,n); } break;
130 | case 8: { TMBENCH("", tpenc8( in, n,out) ,n); TMBENCH2("tp_byte8 scalar", tpdec8( out,n,cpy) ,n); } break;
131 | case 16: { TMBENCH("", tpenc16(in, n,out) ,n); TMBENCH2("tp_byte16 scalar",tpdec16(out,n,cpy) ,n); } break;
132 | }
133 | break;
134 | default: return;
135 | }
136 | printf("\n");
137 | memcheck(in,n,cpy);
138 | }
139 |
140 | void usage(char *pgm) {
141 | fprintf(stderr, "\nTPBench Copyright (c) 2013-2019 Powturbo %s\n", __DATE__);
142 | fprintf(stderr, "Usage: %s [options] [file]\n", pgm);
143 | fprintf(stderr, " -e# # = function ids separated by ',' or ranges '#-#' (default='1-%d')\n", ID_MEMCPY);
144 | fprintf(stderr, " -B#s # = max. benchmark filesize (default 1GB) ex. -B4G\n");
145 | fprintf(stderr, " s = modifier s:K,M,G=(1000, 1.000.000, 1.000.000.000) s:k,m,h=(1024,1Mb,1Gb). (default m) ex. 64k or 64K\n");
146 | fprintf(stderr, "Benchmark:\n");
147 | fprintf(stderr, " -i#/-j# # = Minimum de/compression iterations per run (default=auto)\n");
148 | fprintf(stderr, " -I#/-J# # = Number of de/compression runs (default=3)\n");
149 | fprintf(stderr, " -e# # = function id\n");
150 | exit(0);
151 | }
152 |
153 | int main(int argc, char* argv[]) {
154 | unsigned cmp=1, b = 1 << 30, esize=4, lz=0, fno,id=0;
155 | unsigned char *scmd = NULL;
156 | int c, digit_optind = 0, this_option_optind = optind ? optind : 1, option_index = 0;
157 | static struct option long_options[] = { {"blocsize", 0, 0, 'b'}, {0, 0, 0} };
158 | for(;;) {
159 | if((c = getopt_long(argc, argv, "B:ce:i:I:j:J:q:s:z", long_options, &option_index)) == -1) break;
160 | switch(c) {
161 | case 0 : printf("Option %s", long_options[option_index].name); if(optarg) printf (" with arg %s", optarg); printf ("\n"); break;
162 | case 'e': scmd = optarg; break;
163 | case 's': esize = atoi(optarg); break;
164 | case 'i': if((tm_rep = atoi(optarg))<=0) tm_rep =tm_Rep=1; break;
165 | case 'I': if((tm_Rep = atoi(optarg))<=0) tm_rep =tm_Rep=1; break;
166 | case 'j': if((tm_rep2 = atoi(optarg))<=0) tm_rep2=tm_Rep2=1; break;
167 | case 'J': if((tm_Rep2 = atoi(optarg))<=0) tm_rep2=tm_Rep2=1; break;
168 | case 'B': b = argtoi(optarg,1); break;
169 | case 'z': lz++; break;
170 | case 'c': cmp++; break;
171 | case 'q': cpuini(atoi(optarg)); break;
172 | default:
173 | usage(argv[0]);
174 | exit(0);
175 | }
176 | }
177 |
178 | printf("tm_verbose=%d ", tm_verbose);
179 | if(argc - optind < 1) { fprintf(stderr, "File not specified\n"); exit(-1); }
180 | {
181 | unsigned char *in,*out,*cpy;
182 | uint64_t totlen=0,tot[3]={0};
183 | for(fno = optind; fno < argc; fno++) {
184 | uint64_t flen;
185 | int n,i;
186 | char *inname = argv[fno];
187 | FILE *fi = fopen(inname, "rb"); if(!fi ) { perror(inname); continue; }
188 | fseek(fi, 0, SEEK_END);
189 | flen = ftell(fi);
190 | fseek(fi, 0, SEEK_SET);
191 |
192 | if(flen > b) flen = b;
193 | n = flen;
194 | if(!(in = (unsigned char*)malloc(n+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); } cpy = in;
195 | if(!(out = (unsigned char*)malloc(flen*4/3+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); }
196 | if(cmp && !(cpy = (unsigned char*)malloc(n+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); }
197 | n = fread(in, 1, n, fi); printf("File='%s' Length=%u\n", inname, n);
198 | fclose(fi);
199 | if(n <= 0) exit(0);
200 | if(fno == optind) {
201 | tm_init(tm_Rep, 2);
202 | tpini(0);
203 | printf("size=%u, element size=%d. detected simd=%s\n\n", n, esize, cpustr(cpuini(0)));
204 | }
205 | printf(" E MB/s D MB/s function (size=%d )\n", esize);
206 | char *p = scmd?scmd:"1-10";
207 | do {
208 | unsigned id = strtoul(p, &p, 10),idx = id, i;
209 | while(isspace(*p)) p++; if(*p == '-') { if((idx = strtoul(p+1, &p, 10)) < id) idx = id; if(idx > ID_MEMCPY) idx = ID_MEMCPY; }
210 | for(i = id; i <= idx; i++) {
211 | bench(in,n,out,esize,cpy,i);
212 |
213 | if(lz) {
214 | char *tmp; int rc;
215 | totlen += n;
216 | // Test Transpose + lz
217 | if(!(tmp = (unsigned char*)malloc(n+1024))) { fprintf(stderr, "malloc error\n"); exit(-1); }
218 | #ifdef LZ4_ON
219 | memrcpy(cpy,in,n); TMBENCH("lz4",rc = LZ4_compress(in, out, n) ,n); tot[0]+=rc; TMBENCH("",LZ4_decompress_fast(out,cpy,n) ,n); memcheck(in,n,cpy);
220 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n);
221 |
222 | memrcpy(cpy,in,n); TMBENCH("tpbyte+lz4",rc = tplz4enc(in, n,out,esize,tmp) ,n); tot[0]+=rc; TMBENCH("",tplz4dec(out,n,cpy,esize,tmp) ,n); memcheck(in,n,cpy);
223 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n);
224 | #ifdef USE_SSE
225 | memrcpy(cpy,in,n); TMBENCH("tpnibble+lz4",rc = tp4lz4enc(in, n,out,esize,tmp) ,n); tot[1]+=rc; TMBENCH("",tp4lz4dec(out,n,cpy,esize,tmp) ,n); memcheck(in,n,cpy);
226 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n);
227 | #endif
228 | #endif
229 |
230 | #ifdef BITSHUFFLE
231 | memrcpy(cpy,in,n); TMBENCH("bitshuffle+lz4",rc=bslz4enc(in,n,out,esize,tmp), n); tot[2] += rc; TMBENCH("",bslz4dec(out,n,cpy,esize,tmp), n); memcheck(in,n,cpy);
232 | printf("compressed len=%u ratio=%.2f\n", rc, (double)(rc*100.0)/(double)n);
233 | #endif
234 | printf("\n");
235 | free(tmp);
236 | }
237 | }
238 | } while(*p++);
239 | if(lz) {
240 | #ifdef HAVE_LZ4
241 | printf("tplz4enc : compressed len=%llu ratio=%.2f %%\n", tot[0], (double)(tot[0]*100.0)/(double)totlen);
242 | #ifdef USE_SSE2
243 | printf("tp4lz4enc : compressed len=%llu ratio=%.2f %%\n", tot[1], (double)(tot[1]*100.0)/(double)totlen);
244 | #endif
245 | #endif
246 | #ifdef BITSHUFFLE
247 | printf("bshuf_compress_lz4: compressed len=%llu ratio=%.2f %%\n", tot[2], (double)(tot[2]*100.0)/(double)totlen);
248 | #endif
249 | }
250 | }
251 | }
252 | }
253 |
--------------------------------------------------------------------------------
/transpose.h:
--------------------------------------------------------------------------------
1 | /**
2 | Copyright (C) powturbo 2013-2019
3 | GPL v2 License
4 |
5 | This program is free software; you can redistribute it and/or modify
6 | it under the terms of the GNU General Public License as published by
7 | the Free Software Foundation; either version 2 of the License, or
8 | (at your option) any later version.
9 |
10 | This program is distributed in the hope that it will be useful,
11 | but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | GNU General Public License for more details.
14 |
15 | You should have received a copy of the GNU General Public License along
16 | with this program; if not, write to the Free Software Foundation, Inc.,
17 | 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
18 |
19 | - homepage : https://sites.google.com/site/powturbo/
20 | - github : https://github.com/powturbo
21 | - twitter : https://twitter.com/powturbo
22 | - email : powturbo [_AT_] gmail [_DOT_] com
23 | **/
24 | // transpose.h - Byte/Nibble transpose for further compressing with lz77 or other compressors
25 | #ifdef __cplusplus
26 | extern "C" {
27 | #endif
28 | // Syntax
29 | // in : Input buffer
30 | // n : Total number of bytes in input buffer
31 | // out : output buffer
32 | // esize : element size in bytes (ex. 2, 4, 8,... )
33 |
34 | //---------- High level functions with dynamic cpu detection and JIT scalar/sse/avx2 switching
35 | void tpenc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); // tranpose
36 | void tpdec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize); // reverse transpose
37 |
38 | void tp2denc(unsigned char *in, unsigned x, unsigned y, unsigned char *out, unsigned esize); //2D transpose
39 | void tp2ddec(unsigned char *in, unsigned x, unsigned y, unsigned char *out, unsigned esize);
40 | void tp3denc(unsigned char *in, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize); //3D transpose
41 | void tp3ddec(unsigned char *in, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize);
42 | void tp4denc(unsigned char *in, unsigned w, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize); //4D transpose
43 | void tp4ddec(unsigned char *in, unsigned w, unsigned x, unsigned y, unsigned z, unsigned char *out, unsigned esize);
44 |
45 | // Nibble transpose SIMD (SSE2,AVX2, ARM Neon)
46 | void tp4enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
47 | void tp4dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
48 |
49 | // bit transpose
50 | //void tp1enc( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
51 | //void tp1dec( unsigned char *in, unsigned n, unsigned char *out, unsigned esize);
52 |
53 | //---------- Low level functions ------------------------------------
54 | void tpenc2( unsigned char *in, unsigned n, unsigned char *out); // scalar
55 | void tpenc3( unsigned char *in, unsigned n, unsigned char *out);
56 | void tpenc4( unsigned char *in, unsigned n, unsigned char *out);
57 | void tpenc8( unsigned char *in, unsigned n, unsigned char *out);
58 | void tpenc16( unsigned char *in, unsigned n, unsigned char *out);
59 |
60 | void tpdec2( unsigned char *in, unsigned n, unsigned char *out);
61 | void tpdec3( unsigned char *in, unsigned n, unsigned char *out);
62 | void tpdec4( unsigned char *in, unsigned n, unsigned char *out);
63 | void tpdec8( unsigned char *in, unsigned n, unsigned char *out);
64 | void tpdec16( unsigned char *in, unsigned n, unsigned char *out);
65 |
66 | void tpenc128v2( unsigned char *in, unsigned n, unsigned char *out); // sse2
67 | void tpdec128v2( unsigned char *in, unsigned n, unsigned char *out);
68 | void tpenc128v4( unsigned char *in, unsigned n, unsigned char *out);
69 | void tpdec128v4( unsigned char *in, unsigned n, unsigned char *out);
70 | void tpenc128v8( unsigned char *in, unsigned n, unsigned char *out);
71 | void tpdec128v8( unsigned char *in, unsigned n, unsigned char *out);
72 |
73 | void tp4enc128v2( unsigned char *in, unsigned n, unsigned char *out);
74 | void tp4dec128v2( unsigned char *in, unsigned n, unsigned char *out);
75 | void tp4enc128v4( unsigned char *in, unsigned n, unsigned char *out);
76 | void tp4dec128v4( unsigned char *in, unsigned n, unsigned char *out);
77 | void tp4enc128v8( unsigned char *in, unsigned n, unsigned char *out);
78 | void tp4dec128v8( unsigned char *in, unsigned n, unsigned char *out);
79 |
80 | void tp1enc128v2( unsigned char *in, unsigned n, unsigned char *out);
81 | void tp1dec128v2( unsigned char *in, unsigned n, unsigned char *out);
82 | void tp1enc128v4( unsigned char *in, unsigned n, unsigned char *out);
83 | void tp1dec128v4( unsigned char *in, unsigned n, unsigned char *out);
84 | void tp1enc128v8( unsigned char *in, unsigned n, unsigned char *out);
85 | void tp1dec128v8( unsigned char *in, unsigned n, unsigned char *out);
86 |
87 | void tpenc256v2( unsigned char *in, unsigned n, unsigned char *out); // avx2
88 | void tpdec256v2( unsigned char *in, unsigned n, unsigned char *out);
89 | void tpenc256v4( unsigned char *in, unsigned n, unsigned char *out);
90 | void tpdec256v4( unsigned char *in, unsigned n, unsigned char *out);
91 | void tpenc256v8( unsigned char *in, unsigned n, unsigned char *out);
92 | void tpdec256v8( unsigned char *in, unsigned n, unsigned char *out);
93 |
94 | void tp4enc256v2( unsigned char *in, unsigned n, unsigned char *out);
95 | void tp4dec256v2( unsigned char *in, unsigned n, unsigned char *out);
96 | void tp4enc256v4( unsigned char *in, unsigned n, unsigned char *out);
97 | void tp4dec256v4( unsigned char *in, unsigned n, unsigned char *out);
98 | void tp4enc256v8( unsigned char *in, unsigned n, unsigned char *out);
99 | void tp4dec256v8( unsigned char *in, unsigned n, unsigned char *out);
100 |
101 | //------- CPU instruction set
102 | // cpuiset = 0: return current simd set,
103 | // cpuiset != 0: set simd set 0:scalar, 20:sse2, 52:avx2
104 | int cpuini(int cpuiset);
105 |
106 | // convert simd set to string "sse3", "sse3", "sse4.1" or "avx2"
107 | // Ex.: printf("current cpu set=%s\n", cpustr(cpuini(0)) );
108 | char *cpustr(int cpuiset);
109 |
110 | #ifdef __cplusplus
111 | }
112 | #endif
113 |
--------------------------------------------------------------------------------
/vs/getopt.c:
--------------------------------------------------------------------------------
1 | /* $OpenBSD: getopt_long.c,v 1.23 2007/10/31 12:34:57 chl Exp $ */
2 | /* $NetBSD: getopt_long.c,v 1.15 2002/01/31 22:43:40 tv Exp $ */
3 |
4 | /*
5 | * Copyright (c) 2002 Todd C. Miller
6 | *
7 | * Permission to use, copy, modify, and distribute this software for any
8 | * purpose with or without fee is hereby granted, provided that the above
9 | * copyright notice and this permission notice appear in all copies.
10 | *
11 | * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
12 | * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
13 | * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
14 | * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
15 | * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
16 | * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
17 | * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
18 | *
19 | * Sponsored in part by the Defense Advanced Research Projects
20 | * Agency (DARPA) and Air Force Research Laboratory, Air Force
21 | * Materiel Command, USAF, under agreement number F39502-99-1-0512.
22 | */
23 | /*-
24 | * Copyright (c) 2000 The NetBSD Foundation, Inc.
25 | * All rights reserved.
26 | *
27 | * This code is derived from software contributed to The NetBSD Foundation
28 | * by Dieter Baron and Thomas Klausner.
29 | *
30 | * Redistribution and use in source and binary forms, with or without
31 | * modification, are permitted provided that the following conditions
32 | * are met:
33 | * 1. Redistributions of source code must retain the above copyright
34 | * notice, this list of conditions and the following disclaimer.
35 | * 2. Redistributions in binary form must reproduce the above copyright
36 | * notice, this list of conditions and the following disclaimer in the
37 | * documentation and/or other materials provided with the distribution.
38 | *
39 | * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS
40 | * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
41 | * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
42 | * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS
43 | * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
44 | * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
45 | * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
46 | * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
47 | * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
48 | * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
49 | * POSSIBILITY OF SUCH DAMAGE.
50 | */
51 |
52 | #include
53 | #include
54 | #include
55 | #include "getopt.h"
56 | #include
57 | #include
58 | #include
59 |
60 | #define REPLACE_GETOPT /* use this getopt as the system getopt(3) */
61 |
62 | #ifdef REPLACE_GETOPT
63 | int opterr = 1; /* if error message should be printed */
64 | int optind = 1; /* index into parent argv vector */
65 | int optopt = '?'; /* character checked for validity */
66 | #undef optreset /* see getopt.h */
67 | #define optreset __mingw_optreset
68 | int optreset; /* reset getopt */
69 | char *optarg; /* argument associated with option */
70 | #endif
71 |
72 | #define PRINT_ERROR ((opterr) && (*options != ':'))
73 |
74 | #define FLAG_PERMUTE 0x01 /* permute non-options to the end of argv */
75 | #define FLAG_ALLARGS 0x02 /* treat non-options as args to option "-1" */
76 | #define FLAG_LONGONLY 0x04 /* operate as getopt_long_only */
77 |
78 | /* return values */
79 | #define BADCH (int)'?'
80 | #define BADARG ((*options == ':') ? (int)':' : (int)'?')
81 | #define INORDER (int)1
82 |
83 | #ifndef __CYGWIN__
84 | #define __progname __argv[0]
85 | #else
86 | extern char __declspec(dllimport) *__progname;
87 | #endif
88 |
89 | #ifdef __CYGWIN__
90 | static char EMSG[] = "";
91 | #else
92 | #define EMSG ""
93 | #endif
94 |
95 | static int getopt_internal(int, char * const *, const char *,
96 | const struct option *, int *, int);
97 | static int parse_long_options(char * const *, const char *,
98 | const struct option *, int *, int);
99 | static int gcd(int, int);
100 | static void permute_args(int, int, int, char * const *);
101 |
102 | static char *place = EMSG; /* option letter processing */
103 |
104 | /* XXX: set optreset to 1 rather than these two */
105 | static int nonopt_start = -1; /* first non option argument (for permute) */
106 | static int nonopt_end = -1; /* first option after non options (for permute) */
107 |
108 | /* Error messages */
109 | static const char recargchar[] = "option requires an argument -- %c";
110 | static const char recargstring[] = "option requires an argument -- %s";
111 | static const char ambig[] = "ambiguous option -- %.*s";
112 | static const char noarg[] = "option doesn't take an argument -- %.*s";
113 | static const char illoptchar[] = "unknown option -- %c";
114 | static const char illoptstring[] = "unknown option -- %s";
115 |
116 | static void
117 | _vwarnx(const char *fmt,va_list ap)
118 | {
119 | (void)fprintf(stderr,"%s: ",__progname);
120 | if (fmt != NULL)
121 | (void)vfprintf(stderr,fmt,ap);
122 | (void)fprintf(stderr,"\n");
123 | }
124 |
125 | static void
126 | warnx(const char *fmt,...)
127 | {
128 | va_list ap;
129 | va_start(ap,fmt);
130 | _vwarnx(fmt,ap);
131 | va_end(ap);
132 | }
133 |
134 | /*
135 | * Compute the greatest common divisor of a and b.
136 | */
137 | static int
138 | gcd(int a, int b)
139 | {
140 | int c;
141 |
142 | c = a % b;
143 | while (c != 0) {
144 | a = b;
145 | b = c;
146 | c = a % b;
147 | }
148 |
149 | return (b);
150 | }
151 |
152 | /*
153 | * Exchange the block from nonopt_start to nonopt_end with the block
154 | * from nonopt_end to opt_end (keeping the same order of arguments
155 | * in each block).
156 | */
157 | static void
158 | permute_args(int panonopt_start, int panonopt_end, int opt_end,
159 | char * const *nargv)
160 | {
161 | int cstart, cyclelen, i, j, ncycle, nnonopts, nopts, pos;
162 | char *swap;
163 |
164 | /*
165 | * compute lengths of blocks and number and size of cycles
166 | */
167 | nnonopts = panonopt_end - panonopt_start;
168 | nopts = opt_end - panonopt_end;
169 | ncycle = gcd(nnonopts, nopts);
170 | cyclelen = (opt_end - panonopt_start) / ncycle;
171 |
172 | for (i = 0; i < ncycle; i++) {
173 | cstart = panonopt_end+i;
174 | pos = cstart;
175 | for (j = 0; j < cyclelen; j++) {
176 | if (pos >= panonopt_end)
177 | pos -= nnonopts;
178 | else
179 | pos += nopts;
180 | swap = nargv[pos];
181 | /* LINTED const cast */
182 | ((char **) nargv)[pos] = nargv[cstart];
183 | /* LINTED const cast */
184 | ((char **)nargv)[cstart] = swap;
185 | }
186 | }
187 | }
188 |
189 | /*
190 | * parse_long_options --
191 | * Parse long options in argc/argv argument vector.
192 | * Returns -1 if short_too is set and the option does not match long_options.
193 | */
194 | static int
195 | parse_long_options(char * const *nargv, const char *options,
196 | const struct option *long_options, int *idx, int short_too)
197 | {
198 | char *current_argv, *has_equal;
199 | size_t current_argv_len;
200 | int i, ambiguous, match;
201 |
202 | #define IDENTICAL_INTERPRETATION(_x, _y) \
203 | (long_options[(_x)].has_arg == long_options[(_y)].has_arg && \
204 | long_options[(_x)].flag == long_options[(_y)].flag && \
205 | long_options[(_x)].val == long_options[(_y)].val)
206 |
207 | current_argv = place;
208 | match = -1;
209 | ambiguous = 0;
210 |
211 | optind++;
212 |
213 | if ((has_equal = strchr(current_argv, '=')) != NULL) {
214 | /* argument found (--option=arg) */
215 | current_argv_len = has_equal - current_argv;
216 | has_equal++;
217 | } else
218 | current_argv_len = strlen(current_argv);
219 |
220 | for (i = 0; long_options[i].name; i++) {
221 | /* find matching long option */
222 | if (strncmp(current_argv, long_options[i].name,
223 | current_argv_len))
224 | continue;
225 |
226 | if (strlen(long_options[i].name) == current_argv_len) {
227 | /* exact match */
228 | match = i;
229 | ambiguous = 0;
230 | break;
231 | }
232 | /*
233 | * If this is a known short option, don't allow
234 | * a partial match of a single character.
235 | */
236 | if (short_too && current_argv_len == 1)
237 | continue;
238 |
239 | if (match == -1) /* partial match */
240 | match = i;
241 | else if (!IDENTICAL_INTERPRETATION(i, match))
242 | ambiguous = 1;
243 | }
244 | if (ambiguous) {
245 | /* ambiguous abbreviation */
246 | if (PRINT_ERROR)
247 | warnx(ambig, (int)current_argv_len,
248 | current_argv);
249 | optopt = 0;
250 | return (BADCH);
251 | }
252 | if (match != -1) { /* option found */
253 | if (long_options[match].has_arg == no_argument
254 | && has_equal) {
255 | if (PRINT_ERROR)
256 | warnx(noarg, (int)current_argv_len,
257 | current_argv);
258 | /*
259 | * XXX: GNU sets optopt to val regardless of flag
260 | */
261 | if (long_options[match].flag == NULL)
262 | optopt = long_options[match].val;
263 | else
264 | optopt = 0;
265 | return (BADARG);
266 | }
267 | if (long_options[match].has_arg == required_argument ||
268 | long_options[match].has_arg == optional_argument) {
269 | if (has_equal)
270 | optarg = has_equal;
271 | else if (long_options[match].has_arg ==
272 | required_argument) {
273 | /*
274 | * optional argument doesn't use next nargv
275 | */
276 | optarg = nargv[optind++];
277 | }
278 | }
279 | if ((long_options[match].has_arg == required_argument)
280 | && (optarg == NULL)) {
281 | /*
282 | * Missing argument; leading ':' indicates no error
283 | * should be generated.
284 | */
285 | if (PRINT_ERROR)
286 | warnx(recargstring,
287 | current_argv);
288 | /*
289 | * XXX: GNU sets optopt to val regardless of flag
290 | */
291 | if (long_options[match].flag == NULL)
292 | optopt = long_options[match].val;
293 | else
294 | optopt = 0;
295 | --optind;
296 | return (BADARG);
297 | }
298 | } else { /* unknown option */
299 | if (short_too) {
300 | --optind;
301 | return (-1);
302 | }
303 | if (PRINT_ERROR)
304 | warnx(illoptstring, current_argv);
305 | optopt = 0;
306 | return (BADCH);
307 | }
308 | if (idx)
309 | *idx = match;
310 | if (long_options[match].flag) {
311 | *long_options[match].flag = long_options[match].val;
312 | return (0);
313 | } else
314 | return (long_options[match].val);
315 | #undef IDENTICAL_INTERPRETATION
316 | }
317 |
318 | /*
319 | * getopt_internal --
320 | * Parse argc/argv argument vector. Called by user level routines.
321 | */
322 | static int
323 | getopt_internal(int nargc, char * const *nargv, const char *options,
324 | const struct option *long_options, int *idx, int flags)
325 | {
326 | char *oli; /* option letter list index */
327 | int optchar, short_too;
328 | static int posixly_correct = -1;
329 |
330 | if (options == NULL)
331 | return (-1);
332 |
333 | /*
334 | * XXX Some GNU programs (like cvs) set optind to 0 instead of
335 | * XXX using optreset. Work around this braindamage.
336 | */
337 | if (optind == 0)
338 | optind = optreset = 1;
339 |
340 | /*
341 | * Disable GNU extensions if POSIXLY_CORRECT is set or options
342 | * string begins with a '+'.
343 | *
344 | * CV, 2009-12-14: Check POSIXLY_CORRECT anew if optind == 0 or
345 | * optreset != 0 for GNU compatibility.
346 | */
347 | if (posixly_correct == -1 || optreset != 0)
348 | posixly_correct = (getenv("POSIXLY_CORRECT") != NULL);
349 | if (*options == '-')
350 | flags |= FLAG_ALLARGS;
351 | else if (posixly_correct || *options == '+')
352 | flags &= ~FLAG_PERMUTE;
353 | if (*options == '+' || *options == '-')
354 | options++;
355 |
356 | optarg = NULL;
357 | if (optreset)
358 | nonopt_start = nonopt_end = -1;
359 | start:
360 | if (optreset || !*place) { /* update scanning pointer */
361 | optreset = 0;
362 | if (optind >= nargc) { /* end of argument vector */
363 | place = EMSG;
364 | if (nonopt_end != -1) {
365 | /* do permutation, if we have to */
366 | permute_args(nonopt_start, nonopt_end,
367 | optind, nargv);
368 | optind -= nonopt_end - nonopt_start;
369 | }
370 | else if (nonopt_start != -1) {
371 | /*
372 | * If we skipped non-options, set optind
373 | * to the first of them.
374 | */
375 | optind = nonopt_start;
376 | }
377 | nonopt_start = nonopt_end = -1;
378 | return (-1);
379 | }
380 | if (*(place = nargv[optind]) != '-' ||
381 | (place[1] == '\0' && strchr(options, '-') == NULL)) {
382 | place = EMSG; /* found non-option */
383 | if (flags & FLAG_ALLARGS) {
384 | /*
385 | * GNU extension:
386 | * return non-option as argument to option 1
387 | */
388 | optarg = nargv[optind++];
389 | return (INORDER);
390 | }
391 | if (!(flags & FLAG_PERMUTE)) {
392 | /*
393 | * If no permutation wanted, stop parsing
394 | * at first non-option.
395 | */
396 | return (-1);
397 | }
398 | /* do permutation */
399 | if (nonopt_start == -1)
400 | nonopt_start = optind;
401 | else if (nonopt_end != -1) {
402 | permute_args(nonopt_start, nonopt_end,
403 | optind, nargv);
404 | nonopt_start = optind -
405 | (nonopt_end - nonopt_start);
406 | nonopt_end = -1;
407 | }
408 | optind++;
409 | /* process next argument */
410 | goto start;
411 | }
412 | if (nonopt_start != -1 && nonopt_end == -1)
413 | nonopt_end = optind;
414 |
415 | /*
416 | * If we have "-" do nothing, if "--" we are done.
417 | */
418 | if (place[1] != '\0' && *++place == '-' && place[1] == '\0') {
419 | optind++;
420 | place = EMSG;
421 | /*
422 | * We found an option (--), so if we skipped
423 | * non-options, we have to permute.
424 | */
425 | if (nonopt_end != -1) {
426 | permute_args(nonopt_start, nonopt_end,
427 | optind, nargv);
428 | optind -= nonopt_end - nonopt_start;
429 | }
430 | nonopt_start = nonopt_end = -1;
431 | return (-1);
432 | }
433 | }
434 |
435 | /*
436 | * Check long options if:
437 | * 1) we were passed some
438 | * 2) the arg is not just "-"
439 | * 3) either the arg starts with -- we are getopt_long_only()
440 | */
441 | if (long_options != NULL && place != nargv[optind] &&
442 | (*place == '-' || (flags & FLAG_LONGONLY))) {
443 | short_too = 0;
444 | if (*place == '-')
445 | place++; /* --foo long option */
446 | else if (*place != ':' && strchr(options, *place) != NULL)
447 | short_too = 1; /* could be short option too */
448 |
449 | optchar = parse_long_options(nargv, options, long_options,
450 | idx, short_too);
451 | if (optchar != -1) {
452 | place = EMSG;
453 | return (optchar);
454 | }
455 | }
456 |
457 | if ((optchar = (int)*place++) == (int)':' ||
458 | (optchar == (int)'-' && *place != '\0') ||
459 | (oli = strchr(options, optchar)) == NULL) {
460 | /*
461 | * If the user specified "-" and '-' isn't listed in
462 | * options, return -1 (non-option) as per POSIX.
463 | * Otherwise, it is an unknown option character (or ':').
464 | */
465 | if (optchar == (int)'-' && *place == '\0')
466 | return (-1);
467 | if (!*place)
468 | ++optind;
469 | if (PRINT_ERROR)
470 | warnx(illoptchar, optchar);
471 | optopt = optchar;
472 | return (BADCH);
473 | }
474 | if (long_options != NULL && optchar == 'W' && oli[1] == ';') {
475 | /* -W long-option */
476 | if (*place) /* no space */
477 | /* NOTHING */;
478 | else if (++optind >= nargc) { /* no arg */
479 | place = EMSG;
480 | if (PRINT_ERROR)
481 | warnx(recargchar, optchar);
482 | optopt = optchar;
483 | return (BADARG);
484 | } else /* white space */
485 | place = nargv[optind];
486 | optchar = parse_long_options(nargv, options, long_options,
487 | idx, 0);
488 | place = EMSG;
489 | return (optchar);
490 | }
491 | if (*++oli != ':') { /* doesn't take argument */
492 | if (!*place)
493 | ++optind;
494 | } else { /* takes (optional) argument */
495 | optarg = NULL;
496 | if (*place) /* no white space */
497 | optarg = place;
498 | else if (oli[1] != ':') { /* arg not optional */
499 | if (++optind >= nargc) { /* no arg */
500 | place = EMSG;
501 | if (PRINT_ERROR)
502 | warnx(recargchar, optchar);
503 | optopt = optchar;
504 | return (BADARG);
505 | } else
506 | optarg = nargv[optind];
507 | }
508 | place = EMSG;
509 | ++optind;
510 | }
511 | /* dump back option letter */
512 | return (optchar);
513 | }
514 |
515 | #ifdef REPLACE_GETOPT
516 | /*
517 | * getopt --
518 | * Parse argc/argv argument vector.
519 | *
520 | * [eventually this will replace the BSD getopt]
521 | */
522 | int
523 | getopt(int nargc, char * const *nargv, const char *options)
524 | {
525 |
526 | /*
527 | * We don't pass FLAG_PERMUTE to getopt_internal() since
528 | * the BSD getopt(3) (unlike GNU) has never done this.
529 | *
530 | * Furthermore, since many privileged programs call getopt()
531 | * before dropping privileges it makes sense to keep things
532 | * as simple (and bug-free) as possible.
533 | */
534 | return (getopt_internal(nargc, nargv, options, NULL, NULL, 0));
535 | }
536 | #endif /* REPLACE_GETOPT */
537 |
538 | /*
539 | * getopt_long --
540 | * Parse argc/argv argument vector.
541 | */
542 | int
543 | getopt_long(int nargc, char * const *nargv, const char *options,
544 | const struct option *long_options, int *idx)
545 | {
546 |
547 | return (getopt_internal(nargc, nargv, options, long_options, idx,
548 | FLAG_PERMUTE));
549 | }
550 |
551 | /*
552 | * getopt_long_only --
553 | * Parse argc/argv argument vector.
554 | */
555 | int
556 | getopt_long_only(int nargc, char * const *nargv, const char *options,
557 | const struct option *long_options, int *idx)
558 | {
559 |
560 | return (getopt_internal(nargc, nargv, options, long_options, idx,
561 | FLAG_PERMUTE|FLAG_LONGONLY));
562 | }
563 |
--------------------------------------------------------------------------------
/vs/getopt.h:
--------------------------------------------------------------------------------
1 | #ifndef __GETOPT_H__
2 | /**
3 | * DISCLAIMER
4 | * This file has no copyright assigned and is placed in the Public Domain.
5 | * This file is a part of the w64 mingw-runtime package.
6 | *
7 | * The w64 mingw-runtime package and its code is distributed in the hope that it
8 | * will be useful but WITHOUT ANY WARRANTY. ALL WARRANTIES, EXPRESSED OR
9 | * IMPLIED ARE HEREBY DISCLAIMED. This includes but is not limited to
10 | * warranties of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
11 | */
12 |
13 | #define __GETOPT_H__
14 |
15 | /* All the headers include this file. */
16 | #if _MSC_VER >= 1300
17 | #include
18 | #endif
19 |
20 | #ifdef __cplusplus
21 | extern "C" {
22 | #endif
23 |
24 | extern int optind; /* index of first non-option in argv */
25 | extern int optopt; /* single option character, as parsed */
26 | extern int opterr; /* flag to enable built-in diagnostics... */
27 | /* (user may set to zero, to suppress) */
28 |
29 | extern char *optarg; /* pointer to argument of current option */
30 |
31 | extern int getopt(int nargc, char * const *nargv, const char *options);
32 |
33 | #ifdef _BSD_SOURCE
34 | /*
35 | * BSD adds the non-standard `optreset' feature, for reinitialisation
36 | * of `getopt' parsing. We support this feature, for applications which
37 | * proclaim their BSD heritage, before including this header; however,
38 | * to maintain portability, developers are advised to avoid it.
39 | */
40 | # define optreset __mingw_optreset
41 | extern int optreset;
42 | #endif
43 | #ifdef __cplusplus
44 | }
45 | #endif
46 | /*
47 | * POSIX requires the `getopt' API to be specified in `unistd.h';
48 | * thus, `unistd.h' includes this header. However, we do not want
49 | * to expose the `getopt_long' or `getopt_long_only' APIs, when
50 | * included in this manner. Thus, close the standard __GETOPT_H__
51 | * declarations block, and open an additional __GETOPT_LONG_H__
52 | * specific block, only when *not* __UNISTD_H_SOURCED__, in which
53 | * to declare the extended API.
54 | */
55 | #endif /* !defined(__GETOPT_H__) */
56 |
57 | #if !defined(__UNISTD_H_SOURCED__) && !defined(__GETOPT_LONG_H__)
58 | #define __GETOPT_LONG_H__
59 |
60 | #ifdef __cplusplus
61 | extern "C" {
62 | #endif
63 |
64 | struct option /* specification for a long form option... */
65 | {
66 | const char *name; /* option name, without leading hyphens */
67 | int has_arg; /* does it take an argument? */
68 | int *flag; /* where to save its status, or NULL */
69 | int val; /* its associated status value */
70 | };
71 |
72 | enum /* permitted values for its `has_arg' field... */
73 | {
74 | no_argument = 0, /* option never takes an argument */
75 | required_argument, /* option always requires an argument */
76 | optional_argument /* option may take an argument */
77 | };
78 |
79 | extern int getopt_long(int nargc, char * const *nargv, const char *options,
80 | const struct option *long_options, int *idx);
81 | extern int getopt_long_only(int nargc, char * const *nargv, const char *options,
82 | const struct option *long_options, int *idx);
83 | /*
84 | * Previous MinGW implementation had...
85 | */
86 | #ifndef HAVE_DECL_GETOPT
87 | /*
88 | * ...for the long form API only; keep this for compatibility.
89 | */
90 | # define HAVE_DECL_GETOPT 1
91 | #endif
92 |
93 | #ifdef __cplusplus
94 | }
95 | #endif
96 |
97 | #endif /* !defined(__UNISTD_H_SOURCED__) && !defined(__GETOPT_LONG_H__) */
98 |
--------------------------------------------------------------------------------
/vs/inttypes.h:
--------------------------------------------------------------------------------
1 | // ISO C9x compliant inttypes.h for Microsoft Visual Studio
2 | // Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124
3 | //
4 | // Copyright (c) 2006-2013 Alexander Chemeris
5 | //
6 | // Redistribution and use in source and binary forms, with or without
7 | // modification, are permitted provided that the following conditions are met:
8 | //
9 | // 1. Redistributions of source code must retain the above copyright notice,
10 | // this list of conditions and the following disclaimer.
11 | //
12 | // 2. Redistributions in binary form must reproduce the above copyright
13 | // notice, this list of conditions and the following disclaimer in the
14 | // documentation and/or other materials provided with the distribution.
15 | //
16 | // 3. Neither the name of the product nor the names of its contributors may
17 | // be used to endorse or promote products derived from this software
18 | // without specific prior written permission.
19 | //
20 | // THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
21 | // WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
22 | // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
23 | // EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24 | // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
25 | // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
26 | // OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
27 | // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
28 | // OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
29 | // ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | //
31 | ///////////////////////////////////////////////////////////////////////////////
32 |
33 | #ifndef _MSC_VER // [
34 | #error "Use this header only with Microsoft Visual C++ compilers!"
35 | #endif // _MSC_VER ]
36 |
37 | #ifndef _MSC_INTTYPES_H_ // [
38 | #define _MSC_INTTYPES_H_
39 |
40 | #if _MSC_VER > 1000
41 | #pragma once
42 | #endif
43 |
44 | #include "stdint.h"
45 |
46 | // 7.8 Format conversion of integer types
47 |
48 | typedef struct {
49 | intmax_t quot;
50 | intmax_t rem;
51 | } imaxdiv_t;
52 |
53 | // 7.8.1 Macros for format specifiers
54 |
55 | #if !defined(__cplusplus) || defined(__STDC_FORMAT_MACROS) // [ See footnote 185 at page 198
56 |
57 | // The fprintf macros for signed integers are:
58 | #define PRId8 "d"
59 | #define PRIi8 "i"
60 | #define PRIdLEAST8 "d"
61 | #define PRIiLEAST8 "i"
62 | #define PRIdFAST8 "d"
63 | #define PRIiFAST8 "i"
64 |
65 | #define PRId16 "hd"
66 | #define PRIi16 "hi"
67 | #define PRIdLEAST16 "hd"
68 | #define PRIiLEAST16 "hi"
69 | #define PRIdFAST16 "hd"
70 | #define PRIiFAST16 "hi"
71 |
72 | #define PRId32 "I32d"
73 | #define PRIi32 "I32i"
74 | #define PRIdLEAST32 "I32d"
75 | #define PRIiLEAST32 "I32i"
76 | #define PRIdFAST32 "I32d"
77 | #define PRIiFAST32 "I32i"
78 |
79 | #define PRId64 "I64d"
80 | #define PRIi64 "I64i"
81 | #define PRIdLEAST64 "I64d"
82 | #define PRIiLEAST64 "I64i"
83 | #define PRIdFAST64 "I64d"
84 | #define PRIiFAST64 "I64i"
85 |
86 | #define PRIdMAX "I64d"
87 | #define PRIiMAX "I64i"
88 |
89 | #define PRIdPTR "Id"
90 | #define PRIiPTR "Ii"
91 |
92 | // The fprintf macros for unsigned integers are:
93 | #define PRIo8 "o"
94 | #define PRIu8 "u"
95 | #define PRIx8 "x"
96 | #define PRIX8 "X"
97 | #define PRIoLEAST8 "o"
98 | #define PRIuLEAST8 "u"
99 | #define PRIxLEAST8 "x"
100 | #define PRIXLEAST8 "X"
101 | #define PRIoFAST8 "o"
102 | #define PRIuFAST8 "u"
103 | #define PRIxFAST8 "x"
104 | #define PRIXFAST8 "X"
105 |
106 | #define PRIo16 "ho"
107 | #define PRIu16 "hu"
108 | #define PRIx16 "hx"
109 | #define PRIX16 "hX"
110 | #define PRIoLEAST16 "ho"
111 | #define PRIuLEAST16 "hu"
112 | #define PRIxLEAST16 "hx"
113 | #define PRIXLEAST16 "hX"
114 | #define PRIoFAST16 "ho"
115 | #define PRIuFAST16 "hu"
116 | #define PRIxFAST16 "hx"
117 | #define PRIXFAST16 "hX"
118 |
119 | #define PRIo32 "I32o"
120 | #define PRIu32 "I32u"
121 | #define PRIx32 "I32x"
122 | #define PRIX32 "I32X"
123 | #define PRIoLEAST32 "I32o"
124 | #define PRIuLEAST32 "I32u"
125 | #define PRIxLEAST32 "I32x"
126 | #define PRIXLEAST32 "I32X"
127 | #define PRIoFAST32 "I32o"
128 | #define PRIuFAST32 "I32u"
129 | #define PRIxFAST32 "I32x"
130 | #define PRIXFAST32 "I32X"
131 |
132 | #define PRIo64 "I64o"
133 | #define PRIu64 "I64u"
134 | #define PRIx64 "I64x"
135 | #define PRIX64 "I64X"
136 | #define PRIoLEAST64 "I64o"
137 | #define PRIuLEAST64 "I64u"
138 | #define PRIxLEAST64 "I64x"
139 | #define PRIXLEAST64 "I64X"
140 | #define PRIoFAST64 "I64o"
141 | #define PRIuFAST64 "I64u"
142 | #define PRIxFAST64 "I64x"
143 | #define PRIXFAST64 "I64X"
144 |
145 | #define PRIoMAX "I64o"
146 | #define PRIuMAX "I64u"
147 | #define PRIxMAX "I64x"
148 | #define PRIXMAX "I64X"
149 |
150 | #define PRIoPTR "Io"
151 | #define PRIuPTR "Iu"
152 | #define PRIxPTR "Ix"
153 | #define PRIXPTR "IX"
154 |
155 | // The fscanf macros for signed integers are:
156 | #define SCNd8 "d"
157 | #define SCNi8 "i"
158 | #define SCNdLEAST8 "d"
159 | #define SCNiLEAST8 "i"
160 | #define SCNdFAST8 "d"
161 | #define SCNiFAST8 "i"
162 |
163 | #define SCNd16 "hd"
164 | #define SCNi16 "hi"
165 | #define SCNdLEAST16 "hd"
166 | #define SCNiLEAST16 "hi"
167 | #define SCNdFAST16 "hd"
168 | #define SCNiFAST16 "hi"
169 |
170 | #define SCNd32 "ld"
171 | #define SCNi32 "li"
172 | #define SCNdLEAST32 "ld"
173 | #define SCNiLEAST32 "li"
174 | #define SCNdFAST32 "ld"
175 | #define SCNiFAST32 "li"
176 |
177 | #define SCNd64 "I64d"
178 | #define SCNi64 "I64i"
179 | #define SCNdLEAST64 "I64d"
180 | #define SCNiLEAST64 "I64i"
181 | #define SCNdFAST64 "I64d"
182 | #define SCNiFAST64 "I64i"
183 |
184 | #define SCNdMAX "I64d"
185 | #define SCNiMAX "I64i"
186 |
187 | #ifdef _WIN64 // [
188 | # define SCNdPTR "I64d"
189 | # define SCNiPTR "I64i"
190 | #else // _WIN64 ][
191 | # define SCNdPTR "ld"
192 | # define SCNiPTR "li"
193 | #endif // _WIN64 ]
194 |
195 | // The fscanf macros for unsigned integers are:
196 | #define SCNo8 "o"
197 | #define SCNu8 "u"
198 | #define SCNx8 "x"
199 | #define SCNX8 "X"
200 | #define SCNoLEAST8 "o"
201 | #define SCNuLEAST8 "u"
202 | #define SCNxLEAST8 "x"
203 | #define SCNXLEAST8 "X"
204 | #define SCNoFAST8 "o"
205 | #define SCNuFAST8 "u"
206 | #define SCNxFAST8 "x"
207 | #define SCNXFAST8 "X"
208 |
209 | #define SCNo16 "ho"
210 | #define SCNu16 "hu"
211 | #define SCNx16 "hx"
212 | #define SCNX16 "hX"
213 | #define SCNoLEAST16 "ho"
214 | #define SCNuLEAST16 "hu"
215 | #define SCNxLEAST16 "hx"
216 | #define SCNXLEAST16 "hX"
217 | #define SCNoFAST16 "ho"
218 | #define SCNuFAST16 "hu"
219 | #define SCNxFAST16 "hx"
220 | #define SCNXFAST16 "hX"
221 |
222 | #define SCNo32 "lo"
223 | #define SCNu32 "lu"
224 | #define SCNx32 "lx"
225 | #define SCNX32 "lX"
226 | #define SCNoLEAST32 "lo"
227 | #define SCNuLEAST32 "lu"
228 | #define SCNxLEAST32 "lx"
229 | #define SCNXLEAST32 "lX"
230 | #define SCNoFAST32 "lo"
231 | #define SCNuFAST32 "lu"
232 | #define SCNxFAST32 "lx"
233 | #define SCNXFAST32 "lX"
234 |
235 | #define SCNo64 "I64o"
236 | #define SCNu64 "I64u"
237 | #define SCNx64 "I64x"
238 | #define SCNX64 "I64X"
239 | #define SCNoLEAST64 "I64o"
240 | #define SCNuLEAST64 "I64u"
241 | #define SCNxLEAST64 "I64x"
242 | #define SCNXLEAST64 "I64X"
243 | #define SCNoFAST64 "I64o"
244 | #define SCNuFAST64 "I64u"
245 | #define SCNxFAST64 "I64x"
246 | #define SCNXFAST64 "I64X"
247 |
248 | #define SCNoMAX "I64o"
249 | #define SCNuMAX "I64u"
250 | #define SCNxMAX "I64x"
251 | #define SCNXMAX "I64X"
252 |
253 | #ifdef _WIN64 // [
254 | # define SCNoPTR "I64o"
255 | # define SCNuPTR "I64u"
256 | # define SCNxPTR "I64x"
257 | # define SCNXPTR "I64X"
258 | #else // _WIN64 ][
259 | # define SCNoPTR "lo"
260 | # define SCNuPTR "lu"
261 | # define SCNxPTR "lx"
262 | # define SCNXPTR "lX"
263 | #endif // _WIN64 ]
264 |
265 | #endif // __STDC_FORMAT_MACROS ]
266 |
267 | // 7.8.2 Functions for greatest-width integer types
268 |
269 | // 7.8.2.1 The imaxabs function
270 | #define imaxabs _abs64
271 |
272 | // 7.8.2.2 The imaxdiv function
273 |
274 | // This is modified version of div() function from Microsoft's div.c found
275 | // in %MSVC.NET%\crt\src\div.c
276 | #ifdef STATIC_IMAXDIV // [
277 | static
278 | #else // STATIC_IMAXDIV ][
279 | _inline
280 | #endif // STATIC_IMAXDIV ]
281 | imaxdiv_t __cdecl imaxdiv(intmax_t numer, intmax_t denom)
282 | {
283 | imaxdiv_t result;
284 |
285 | result.quot = numer / denom;
286 | result.rem = numer % denom;
287 |
288 | if (numer < 0 && result.rem > 0) {
289 | // did division wrong; must fix up
290 | ++result.quot;
291 | result.rem -= denom;
292 | }
293 |
294 | return result;
295 | }
296 |
297 | // 7.8.2.3 The strtoimax and strtoumax functions
298 | #define strtoimax _strtoi64
299 | #define strtoumax _strtoui64
300 |
301 | // 7.8.2.4 The wcstoimax and wcstoumax functions
302 | #define wcstoimax _wcstoi64
303 | #define wcstoumax _wcstoui64
304 |
305 |
306 | #endif // _MSC_INTTYPES_H_ ]
307 |
--------------------------------------------------------------------------------
/vs/stdint.h:
--------------------------------------------------------------------------------
1 | // ISO C9x compliant stdint.h for Microsoft Visual Studio
2 | // Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124
3 | //
4 | // Copyright (c) 2006-2013 Alexander Chemeris
5 | //
6 | // Redistribution and use in source and binary forms, with or without
7 | // modification, are permitted provided that the following conditions are met:
8 | //
9 | // 1. Redistributions of source code must retain the above copyright notice,
10 | // this list of conditions and the following disclaimer.
11 | //
12 | // 2. Redistributions in binary form must reproduce the above copyright
13 | // notice, this list of conditions and the following disclaimer in the
14 | // documentation and/or other materials provided with the distribution.
15 | //
16 | // 3. Neither the name of the product nor the names of its contributors may
17 | // be used to endorse or promote products derived from this software
18 | // without specific prior written permission.
19 | //
20 | // THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED
21 | // WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
22 | // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
23 | // EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24 | // SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
25 | // PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
26 | // OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
27 | // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
28 | // OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
29 | // ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | //
31 | ///////////////////////////////////////////////////////////////////////////////
32 |
33 | #ifndef _MSC_VER // [
34 | #error "Use this header only with Microsoft Visual C++ compilers!"
35 | #endif // _MSC_VER ]
36 |
37 | #ifndef _MSC_STDINT_H_ // [
38 | #define _MSC_STDINT_H_
39 |
40 | #if _MSC_VER > 1000
41 | #pragma once
42 | #endif
43 |
44 | #if _MSC_VER >= 1600 // [
45 | #include
46 | #else // ] _MSC_VER >= 1600 [
47 |
48 | #include
49 |
50 | // For Visual Studio 6 in C++ mode and for many Visual Studio versions when
51 | // compiling for ARM we should wrap include with 'extern "C++" {}'
52 | // or compiler give many errors like this:
53 | // error C2733: second C linkage of overloaded function 'wmemchr' not allowed
54 | #ifdef __cplusplus
55 | extern "C" {
56 | #endif
57 | # include
58 | #ifdef __cplusplus
59 | }
60 | #endif
61 |
62 | // Define _W64 macros to mark types changing their size, like intptr_t.
63 | #ifndef _W64
64 | # if !defined(__midl) && (defined(_X86_) || defined(_M_IX86)) && _MSC_VER >= 1300
65 | # define _W64 __w64
66 | # else
67 | # define _W64
68 | # endif
69 | #endif
70 |
71 |
72 | // 7.18.1 Integer types
73 |
74 | // 7.18.1.1 Exact-width integer types
75 |
76 | // Visual Studio 6 and Embedded Visual C++ 4 doesn't
77 | // realize that, e.g. char has the same size as __int8
78 | // so we give up on __intX for them.
79 | #if (_MSC_VER < 1300)
80 | typedef signed char int8_t;
81 | typedef signed short int16_t;
82 | typedef signed int int32_t;
83 | typedef unsigned char uint8_t;
84 | typedef unsigned short uint16_t;
85 | typedef unsigned int uint32_t;
86 | #else
87 | typedef signed __int8 int8_t;
88 | typedef signed __int16 int16_t;
89 | typedef signed __int32 int32_t;
90 | typedef unsigned __int8 uint8_t;
91 | typedef unsigned __int16 uint16_t;
92 | typedef unsigned __int32 uint32_t;
93 | #endif
94 | typedef signed __int64 int64_t;
95 | typedef unsigned __int64 uint64_t;
96 |
97 |
98 | // 7.18.1.2 Minimum-width integer types
99 | typedef int8_t int_least8_t;
100 | typedef int16_t int_least16_t;
101 | typedef int32_t int_least32_t;
102 | typedef int64_t int_least64_t;
103 | typedef uint8_t uint_least8_t;
104 | typedef uint16_t uint_least16_t;
105 | typedef uint32_t uint_least32_t;
106 | typedef uint64_t uint_least64_t;
107 |
108 | // 7.18.1.3 Fastest minimum-width integer types
109 | typedef int8_t int_fast8_t;
110 | typedef int16_t int_fast16_t;
111 | typedef int32_t int_fast32_t;
112 | typedef int64_t int_fast64_t;
113 | typedef uint8_t uint_fast8_t;
114 | typedef uint16_t uint_fast16_t;
115 | typedef uint32_t uint_fast32_t;
116 | typedef uint64_t uint_fast64_t;
117 |
118 | // 7.18.1.4 Integer types capable of holding object pointers
119 | #ifdef _WIN64 // [
120 | typedef signed __int64 intptr_t;
121 | typedef unsigned __int64 uintptr_t;
122 | #else // _WIN64 ][
123 | typedef _W64 signed int intptr_t;
124 | typedef _W64 unsigned int uintptr_t;
125 | #endif // _WIN64 ]
126 |
127 | // 7.18.1.5 Greatest-width integer types
128 | typedef int64_t intmax_t;
129 | typedef uint64_t uintmax_t;
130 |
131 |
132 | // 7.18.2 Limits of specified-width integer types
133 |
134 | #if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS) // [ See footnote 220 at page 257 and footnote 221 at page 259
135 |
136 | // 7.18.2.1 Limits of exact-width integer types
137 | #define INT8_MIN ((int8_t)_I8_MIN)
138 | #define INT8_MAX _I8_MAX
139 | #define INT16_MIN ((int16_t)_I16_MIN)
140 | #define INT16_MAX _I16_MAX
141 | #define INT32_MIN ((int32_t)_I32_MIN)
142 | #define INT32_MAX _I32_MAX
143 | #define INT64_MIN ((int64_t)_I64_MIN)
144 | #define INT64_MAX _I64_MAX
145 | #define UINT8_MAX _UI8_MAX
146 | #define UINT16_MAX _UI16_MAX
147 | #define UINT32_MAX _UI32_MAX
148 | #define UINT64_MAX _UI64_MAX
149 |
150 | // 7.18.2.2 Limits of minimum-width integer types
151 | #define INT_LEAST8_MIN INT8_MIN
152 | #define INT_LEAST8_MAX INT8_MAX
153 | #define INT_LEAST16_MIN INT16_MIN
154 | #define INT_LEAST16_MAX INT16_MAX
155 | #define INT_LEAST32_MIN INT32_MIN
156 | #define INT_LEAST32_MAX INT32_MAX
157 | #define INT_LEAST64_MIN INT64_MIN
158 | #define INT_LEAST64_MAX INT64_MAX
159 | #define UINT_LEAST8_MAX UINT8_MAX
160 | #define UINT_LEAST16_MAX UINT16_MAX
161 | #define UINT_LEAST32_MAX UINT32_MAX
162 | #define UINT_LEAST64_MAX UINT64_MAX
163 |
164 | // 7.18.2.3 Limits of fastest minimum-width integer types
165 | #define INT_FAST8_MIN INT8_MIN
166 | #define INT_FAST8_MAX INT8_MAX
167 | #define INT_FAST16_MIN INT16_MIN
168 | #define INT_FAST16_MAX INT16_MAX
169 | #define INT_FAST32_MIN INT32_MIN
170 | #define INT_FAST32_MAX INT32_MAX
171 | #define INT_FAST64_MIN INT64_MIN
172 | #define INT_FAST64_MAX INT64_MAX
173 | #define UINT_FAST8_MAX UINT8_MAX
174 | #define UINT_FAST16_MAX UINT16_MAX
175 | #define UINT_FAST32_MAX UINT32_MAX
176 | #define UINT_FAST64_MAX UINT64_MAX
177 |
178 | // 7.18.2.4 Limits of integer types capable of holding object pointers
179 | #ifdef _WIN64 // [
180 | # define INTPTR_MIN INT64_MIN
181 | # define INTPTR_MAX INT64_MAX
182 | # define UINTPTR_MAX UINT64_MAX
183 | #else // _WIN64 ][
184 | # define INTPTR_MIN INT32_MIN
185 | # define INTPTR_MAX INT32_MAX
186 | # define UINTPTR_MAX UINT32_MAX
187 | #endif // _WIN64 ]
188 |
189 | // 7.18.2.5 Limits of greatest-width integer types
190 | #define INTMAX_MIN INT64_MIN
191 | #define INTMAX_MAX INT64_MAX
192 | #define UINTMAX_MAX UINT64_MAX
193 |
194 | // 7.18.3 Limits of other integer types
195 |
196 | #ifdef _WIN64 // [
197 | # define PTRDIFF_MIN _I64_MIN
198 | # define PTRDIFF_MAX _I64_MAX
199 | #else // _WIN64 ][
200 | # define PTRDIFF_MIN _I32_MIN
201 | # define PTRDIFF_MAX _I32_MAX
202 | #endif // _WIN64 ]
203 |
204 | #define SIG_ATOMIC_MIN INT_MIN
205 | #define SIG_ATOMIC_MAX INT_MAX
206 |
207 | #ifndef SIZE_MAX // [
208 | # ifdef _WIN64 // [
209 | # define SIZE_MAX _UI64_MAX
210 | # else // _WIN64 ][
211 | # define SIZE_MAX _UI32_MAX
212 | # endif // _WIN64 ]
213 | #endif // SIZE_MAX ]
214 |
215 | // WCHAR_MIN and WCHAR_MAX are also defined in
216 | #ifndef WCHAR_MIN // [
217 | # define WCHAR_MIN 0
218 | #endif // WCHAR_MIN ]
219 | #ifndef WCHAR_MAX // [
220 | # define WCHAR_MAX _UI16_MAX
221 | #endif // WCHAR_MAX ]
222 |
223 | #define WINT_MIN 0
224 | #define WINT_MAX _UI16_MAX
225 |
226 | #endif // __STDC_LIMIT_MACROS ]
227 |
228 |
229 | // 7.18.4 Limits of other integer types
230 |
231 | #if !defined(__cplusplus) || defined(__STDC_CONSTANT_MACROS) // [ See footnote 224 at page 260
232 |
233 | // 7.18.4.1 Macros for minimum-width integer constants
234 |
235 | #define INT8_C(val) val##i8
236 | #define INT16_C(val) val##i16
237 | #define INT32_C(val) val##i32
238 | #define INT64_C(val) val##i64
239 |
240 | #define UINT8_C(val) val##ui8
241 | #define UINT16_C(val) val##ui16
242 | #define UINT32_C(val) val##ui32
243 | #define UINT64_C(val) val##ui64
244 |
245 | // 7.18.4.2 Macros for greatest-width integer constants
246 | // These #ifndef's are needed to prevent collisions with .
247 | // Check out Issue 9 for the details.
248 | #ifndef INTMAX_C // [
249 | # define INTMAX_C INT64_C
250 | #endif // INTMAX_C ]
251 | #ifndef UINTMAX_C // [
252 | # define UINTMAX_C UINT64_C
253 | #endif // UINTMAX_C ]
254 |
255 | #endif // __STDC_CONSTANT_MACROS ]
256 |
257 | #endif // _MSC_VER >= 1600 ]
258 |
259 | #endif // _MSC_STDINT_H_ ]
260 |
--------------------------------------------------------------------------------