├── AUTHORS
├── LICENSE
├── Makefile
├── README.md
├── asm
    ├── countavx2_32.S
    ├── countavx2_64.S
    ├── countavx512_32.S
    ├── countavx512_64.S
    ├── countneon_32.S
    ├── countneon_64.S
    ├── kernelavx2.S
    ├── kernelavx512.S
    └── kernelneon.S
├── benchmark
    ├── golike
    │   └── benchmark.c
    └── linux
    │   ├── aligned_alloc.h
    │   ├── instrumented_benchmark.cpp
    │   ├── linux-perf-events.h
    │   ├── memalloc.h
    │   └── stream_benchmark.cpp
├── goscript.sh
├── include
    ├── pospopcnt.h
    └── pospopcnt_avx512bw.h
├── results.txt
├── script.sh
├── smallresults.txt
└── verylargeresults.txt


/AUTHORS:
--------------------------------------------------------------------------------
1 | Marcus D. R. Klarqvist 
2 | Wojciech Muła
3 | Daniel Lemire
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright 2019, Marcus D. R. Klarqvist, Wojciech Muła and Daniel Lemire
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | OPTFLAGS  := -O3 -march=native
 2 | CFLAGS     = -std=c99 -DFLAGSIZE=$(FLAGSIZE) $(OPTFLAGS) $(DEBUG_FLAGS)
 3 | FLAGSIZE   = 32
 4 | #FLAGSIZE  = 64
 5 | CPPFLAGS   = -std=c++0x -DFLAGSIZE=$(FLAGSIZE) $(OPTFLAGS) $(DEBUG_FLAGS)
 6 | ARCH      != uname -m
 7 | 
 8 | # Default target
 9 | all: instrumented_benchmark golike_benchmark
10 | 
11 | # Generic rules
12 | itest: instrumented_benchmark
13 | 	$(CXX) --version
14 | 	./instrumented_benchmark
15 | 
16 | DEPS=benchmark/linux/linux-perf-events.h \
17 |     include/pospopcnt_avx512bw.h  \
18 |     include/pospopcnt.h \
19 | 
20 | KERNELS= $(KERNELS_$(ARCH))
21 | KERNELS_amd64= $(KERNELS_x86_64)
22 | KERNELS_x86_64= asm/countavx512_$(FLAGSIZE).S \
23 |     asm/countavx2_$(FLAGSIZE).S
24 | KERNELS_arm64= $(KERNELS_aarch64)
25 | KERNELS_aarch64= asm/countneon_$(FLAGSIZE).S
26 | 
27 | stream_benchmark: $(DEPS)  benchmark/linux/stream_benchmark.cpp
28 | 	$(CXX) $(CPPFLAGS) benchmark/linux/stream_benchmark.cpp -Iinclude -Ibenchmark/linux -o $@
29 | 
30 | instrumented_benchmark: $(DEPS) benchmark/linux/instrumented_benchmark.cpp
31 | 	$(CXX) $(CPPFLAGS) benchmark/linux/instrumented_benchmark.cpp $(KERNELS) -Iinclude -Ibenchmark/linux -o $@
32 | 
33 | instrumented_benchmark_align64: $(DEPS)
34 | 	$(CXX) $(CPPFLAGS) benchmark/linux/instrumented_benchmark.cpp $(KERNELS) -DALIGN -Iinclude -Ibenchmark/linux -o $@
35 | 
36 | golike_benchmark: $(DEPS) benchmark/golike/benchmark.c
37 | 	$(CC) $(CFLAGS) -Iinclude -o $@ benchmark/golike/benchmark.c $(KERNELS)
38 | 
39 | golike_benchmark_align64: $(DEPS) benchmark/golike/benchmark.c
40 | 	$(CC) $(CFLAGS) -Iinclude -DALIGN -o $@ benchmark/golike/benchmark.c $(KERNELS)
41 | 
42 | clean:
43 | 	rm -f bench example instrumented_benchmark golike_benchmark
44 | 
45 | .PHONY: all clean itest
46 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## Position population-count benchmarks
 3 | 
 4 | ### requirements
 5 | 
 6 | - Linux, with bare-metal access (you may need root)
 7 | - Make sure your processor is in performance mode (not powersaving)
 8 | - x64 processor supporting the AVX512BW extension
 9 | 
10 | ### instructions
11 | 
12 | ```
13 | make
14 | ./instrumented_benchmark -v
15 | ```
16 | # Sample output
17 | 
18 | On a Cannon Lake processor:
19 | 
20 | ```
21 | $ ./instrumented_benchmark -v
22 | n = 10000000 m = 1 
23 | iterations = 100 
24 | array size: 19.073 MB
25 | nothing                                         instructions per cycle 0.20, cycles per 16-bit word:  0.000, instructions per 16-bit word 0.000 
26 | min:       65 cycles,       13 instructions,           1 branch mis.,        0 cache ref.,        0 cache mis.
27 | avg:     69.4 cycles,     13.0 instructions,         1.0 branch mis.,      0.1 cache ref.,      0.1 cache mis.
28 | 
29 | 
30 | pospopcnt_u16_scalar                            alignments: 16 
31 | instructions per cycle 3.75, cycles per 16-bit word:  17.325, instructions per 16-bit word 65.000 
32 | min: 173251245 cycles, 650000159 instructions,         3 branch mis.,   407959 cache ref.,   283659 cache mis.
33 | avg: 173473725.4 cycles, 650000160.2 instructions,           8.4 branch mis., 409996.0 cache ref., 295584.1 cache mis.
34 |  0.367 GB/s 
35 | estimated clock in range 3.102 GHz to 3.183 GHz
36 | 
37 | 
38 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
39 | instructions per cycle 0.46, cycles per 16-bit word:  0.497, instructions per 16-bit word 0.227 
40 | min:  4966068 cycles,  2271648 instructions,         114 branch mis.,   547590 cache ref.,   262188 cache mis.
41 | avg: 5296987.3 cycles, 2271648.7 instructions,     137.9 branch mis., 552796.2 cache ref., 275536.8 cache mis.
42 |  12.407 GB/s 
43 | estimated clock in range 2.916 GHz to 3.140 GHz
44 | 
45 | 
46 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
47 | instructions per cycle 0.60, cycles per 16-bit word:  0.538, instructions per 16-bit word 0.325 
48 | min:  5382323 cycles,  3245605 instructions,          82 branch mis.,   487390 cache ref.,   268386 cache mis.
49 | avg: 5697539.5 cycles, 3245605.8 instructions,     125.8 branch mis., 498338.1 cache ref., 279114.2 cache mis.
50 |  11.658 GB/s 
51 | estimated clock in range 2.985 GHz to 3.142 GHz
52 | 
53 | 
54 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
55 | instructions per cycle 0.84, cycles per 16-bit word:  0.618, instructions per 16-bit word 0.518 
56 | min:  6184692 cycles,  5181931 instructions,         114 branch mis.,   441161 cache ref.,   267157 cache mis.
57 | avg: 6661233.2 cycles, 5181931.2 instructions,     124.0 branch mis., 446684.9 cache ref., 280685.8 cache mis.
58 |  10.162 GB/s 
59 | estimated clock in range 2.956 GHz to 3.148 GHz
60 | ```
61 | 
62 | ### Reference
63 | 
64 | - Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire, [Efficient computation of positional population counts using SIMD instructions](https://arxiv.org/abs/1911.02696), Concurrency and Computation: Practice and Experience 33 (17), 2021.
65 | 
66 | 


--------------------------------------------------------------------------------
/asm/countavx2_32.S:
--------------------------------------------------------------------------------
  1 | #include "kernelavx2.S"
  2 | 
  3 | # not yet ported to 32 bit flag size
  4 | 	.balign	16
  5 | accum8:
  6 | 	.type   accum8, @function
  7 | 	vpunpcklwd ymm12, ymm8, ymm7
  8 | 	vpunpckhwd ymm8, ymm8, ymm7
  9 | 	vpunpcklwd ymm14, ymm9, ymm7
 10 | 	vpunpckhwd ymm9, ymm9, ymm7
 11 | 	vpunpcklwd ymm4, ymm10, ymm7
 12 | 	vpunpckhwd ymm10, ymm10, ymm7
 13 | 	vpunpcklwd ymm5, ymm11, ymm7
 14 | 	vpunpckhwd ymm11, ymm11, ymm7
 15 | 	vpaddd  ymm12, ymm4, ymm12
 16 | 	vpaddd  ymm8, ymm10, ymm8
 17 | 	vpaddd  ymm14, ymm5, ymm14
 18 | 	vpaddd  ymm9, ymm11, ymm9
 19 | 	vpaddd  ymm12, ymm12, ymm14
 20 | 	vpaddd  ymm8, ymm8, ymm9
 21 | 	vperm2i128 ymm14, ymm12, ymm8, 0x20
 22 | 	vperm2i128 ymm4, ymm12, ymm8, 0x31
 23 | 	vpaddd  ymm12, ymm14, ymm4
 24 | 	vpermq  ymm12, ymm12, 0xD8
 25 | 	vpunpckhdq ymm14, ymm12, ymm7
 26 | 	vpunpckldq ymm12, ymm12, ymm7
 27 | 	vpaddq  ymm12, ymm12, [rdi]
 28 | 	vpaddq  ymm14, ymm14, [rdi+0x20]
 29 | 	vmovdqu [rdi], ymm12
 30 | 	vmovdqu [rdi+0x20], ymm14
 31 | 	ret
 32 | 
 33 | 	.balign	16
 34 | accum16:
 35 | 	.type   accum16, @function
 36 | 	vpunpcklwd ymm12, ymm8, ymm7
 37 | 	vpunpckhwd ymm8, ymm8, ymm7
 38 | 	vpunpcklwd ymm14, ymm9, ymm7
 39 | 	vpunpckhwd ymm9, ymm9, ymm7
 40 | 	vpunpcklwd ymm4, ymm10, ymm7
 41 | 	vpunpckhwd ymm10, ymm10, ymm7
 42 | 	vpunpcklwd ymm5, ymm11, ymm7
 43 | 	vpunpckhwd ymm11, ymm11, ymm7
 44 | 	vpaddd  ymm12, ymm4, ymm12	#  0- 3, 16-19
 45 | 	vpaddd  ymm8, ymm10, ymm8	#  4- 7, 20-23
 46 | 	vpaddd  ymm14, ymm5, ymm14	#  8-11, 24-27
 47 | 	vpaddd  ymm9, ymm11, ymm9	# 12-15, 28-31
 48 | 	vperm2i128 ymm4, ymm12, ymm8, 0x20
 49 | 	vperm2i128 ymm10, ymm12, ymm8, 0x31
 50 | 	vpaddd  ymm12, ymm10, ymm4
 51 | 	vperm2i128 ymm5, ymm14, ymm9, 0x20
 52 | 	vperm2i128 ymm11, ymm14, ymm9, 0x31
 53 | 	vpaddd  ymm4, ymm11, ymm5
 54 | 
 55 | 	vpaddd ymm12, ymm12, [rdi+0x00]
 56 | 	vpaddd ymm4, ymm4, [rdi+0x20]
 57 | 	vmovdqu [rdi+0x00], ymm12
 58 | 	vmovdqu [rdi+0x20], ymm4
 59 | 	ret
 60 | 
 61 | # not yet ported to 32 bit flag size
 62 | 	.balign	16
 63 | accum32:
 64 | 	.type   accum32, @function
 65 | 	vpunpcklwd ymm12, ymm8, ymm7
 66 | 	vpunpckhwd ymm8, ymm8, ymm7
 67 | 	vpunpcklwd ymm14, ymm9, ymm7
 68 | 	vpunpckhwd ymm9, ymm9, ymm7
 69 | 	vpunpcklwd ymm4, ymm10, ymm7
 70 | 	vpunpckhwd ymm10, ymm10, ymm7
 71 | 	vpunpcklwd ymm5, ymm11, ymm7
 72 | 	vpunpckhwd ymm11, ymm11, ymm7
 73 | 	vpaddd  ymm12, ymm4, ymm12
 74 | 	vpaddd  ymm8, ymm10, ymm8
 75 | 	vpaddd  ymm14, ymm5, ymm14
 76 | 	vpaddd  ymm9, ymm11, ymm9
 77 | 	vpermq  ymm12, ymm12, 0xD8
 78 | 	vpunpckhdq ymm4, ymm12, ymm7
 79 | 	vpunpckldq ymm12, ymm12, ymm7
 80 | 	vpaddq  ymm12, ymm12, [rdi]
 81 | 	vpaddq  ymm4, ymm4, [rdi+0x80]
 82 | 	vmovdqu [rdi], ymm12
 83 | 	vmovdqu [rdi+0x80], ymm4
 84 | 	vpermq  ymm8, ymm8, 0xD8
 85 | 	vpunpckhdq ymm4, ymm8, ymm7
 86 | 	vpunpckldq ymm8, ymm8, ymm7
 87 | 	vpaddq  ymm8, ymm8, [rdi+0x20]
 88 | 	vpaddq  ymm4, ymm4, [rdi+0xA0]
 89 | 	vmovdqu [rdi+0x20], ymm8
 90 | 	vmovdqu [rdi+0xA0], ymm4
 91 | 	vpermq  ymm14, ymm14, 0xD8
 92 | 	vpunpckhdq ymm4, ymm14, ymm7
 93 | 	vpunpckldq ymm14, ymm14, ymm7
 94 | 	vpaddq  ymm14, ymm14, [rdi+0x40]
 95 | 	vpaddq  ymm4, ymm4, [rdi+0xC0]
 96 | 	vmovdqu [rdi+0x40], ymm14
 97 | 	vmovdqu [rdi+0xC0], ymm4
 98 | 	vpermq  ymm9, ymm9, 0xD8
 99 | 	vpunpckhdq ymm4, ymm9, ymm7
100 | 	vpunpckldq ymm9, ymm9, ymm7
101 | 	vpaddq  ymm9, ymm9, [rdi+0x60]
102 | 	vpaddq  ymm4, ymm4, [rdi+0xE0]
103 | 	vmovdqu [rdi+0x60], ymm9
104 | 	vmovdqu [rdi+0xE0], ymm4
105 | 	ret
106 | 
107 | # not yet ported to 32 bit flag size
108 | 	.balign	16
109 | accum64:
110 | 	.type   accum64, @function
111 | 	vpunpcklwd ymm12, ymm8, ymm7
112 | 	vpunpckhwd ymm14, ymm8, ymm7
113 | 	vpermq  ymm12, ymm12, 0xD8
114 | 	vpunpckhdq ymm4, ymm12, ymm7
115 | 	vpunpckldq ymm12, ymm12, ymm7
116 | 	vpaddq  ymm12, ymm12, [rdi]
117 | 	vpaddq  ymm4, ymm4, [rdi+0x80]
118 | 	vmovdqu [rdi], ymm12
119 | 	vmovdqu [rdi+0x80], ymm4
120 | 	vpermq  ymm14, ymm14, 0xD8
121 | 	vpunpckhdq ymm4, ymm14, ymm7
122 | 	vpunpckldq ymm14, ymm14, ymm7
123 | 	vpaddq  ymm14, ymm14, [rdi+0x20]
124 | 	vpaddq  ymm4, ymm4, [rdi+0xA0]
125 | 	vmovdqu [rdi+0x20], ymm14
126 | 	vmovdqu [rdi+0xA0], ymm4
127 | 	vpunpcklwd ymm12, ymm9, ymm7
128 | 	vpunpckhwd ymm14, ymm9, ymm7
129 | 	vpermq  ymm12, ymm12, 0xD8
130 | 	vpunpckhdq ymm4, ymm12, ymm7
131 | 	vpunpckldq ymm12, ymm12, ymm7
132 | 	vpaddq  ymm12, ymm12, [rdi+0x40]
133 | 	vpaddq  ymm4, ymm4, [rdi+0xC0]
134 | 	vmovdqu [rdi+0x40], ymm12
135 | 	vmovdqu [rdi+0xC0], ymm4
136 | 	vpermq  ymm14, ymm14, 0xD8
137 | 	vpunpckhdq ymm4, ymm14, ymm7
138 | 	vpunpckldq ymm14, ymm14, ymm7
139 | 	vpaddq  ymm14, ymm14, [rdi+0x60]
140 | 	vpaddq  ymm4, ymm4, [rdi+0xE0]
141 | 	vmovdqu [rdi+0x60], ymm14
142 | 	vmovdqu [rdi+0xE0], ymm4
143 | 	vpunpcklwd ymm12, ymm10, ymm7
144 | 	vpunpckhwd ymm14, ymm10, ymm7
145 | 	vpermq  ymm12, ymm12, 0xD8
146 | 	vpunpckhdq ymm4, ymm12, ymm7
147 | 	vpunpckldq ymm12, ymm12, ymm7
148 | 	vpaddq  ymm12, ymm12, [rdi+0x100]
149 | 	vpaddq  ymm4, ymm4, [rdi+0x180]
150 | 	vmovdqu [rdi+0x100], ymm12
151 | 	vmovdqu [rdi+0x180], ymm4
152 | 	vpermq  ymm14, ymm14, 0xD8
153 | 	vpunpckhdq ymm4, ymm14, ymm7
154 | 	vpunpckldq ymm14, ymm14, ymm7
155 | 	vpaddq  ymm14, ymm14, [rdi+0x120]
156 | 	vpaddq  ymm4, ymm4, [rdi+0x1A0]
157 | 	vmovdqu [rdi+0x120], ymm14
158 | 	vmovdqu [rdi+0x1A0], ymm4
159 | 	vpunpcklwd ymm12, ymm11, ymm7
160 | 	vpunpckhwd ymm14, ymm11, ymm7
161 | 	vpermq  ymm12, ymm12, 0xD8
162 | 	vpunpckhdq ymm4, ymm12, ymm7
163 | 	vpunpckldq ymm12, ymm12, ymm7
164 | 	vpaddq  ymm12, ymm12, [rdi+0x140]
165 | 	vpaddq  ymm4, ymm4, [rdi+0x1C0]
166 | 	vmovdqu [rdi+0x140], ymm12
167 | 	vmovdqu [rdi+0x1C0], ymm4
168 | 	vpermq  ymm14, ymm14, 0xD8
169 | 	vpunpckhdq ymm4, ymm14, ymm7
170 | 	vpunpckldq ymm14, ymm14, ymm7
171 | 	vpaddq  ymm14, ymm14, [rdi+0x160]
172 | 	vpaddq  ymm4, ymm4, [rdi+0x1E0]
173 | 	vmovdqu [rdi+0x160], ymm14
174 | 	vmovdqu [rdi+0x1E0], ymm4
175 | 	ret
176 | 


--------------------------------------------------------------------------------
/asm/countavx2_64.S:
--------------------------------------------------------------------------------
  1 | #include "kernelavx2.S"
  2 | 
  3 | 	.balign	16
  4 | accum8:
  5 | 	.type   accum8, @function
  6 | 	vpunpcklwd ymm12, ymm8, ymm7
  7 | 	vpunpckhwd ymm8, ymm8, ymm7
  8 | 	vpunpcklwd ymm14, ymm9, ymm7
  9 | 	vpunpckhwd ymm9, ymm9, ymm7
 10 | 	vpunpcklwd ymm4, ymm10, ymm7
 11 | 	vpunpckhwd ymm10, ymm10, ymm7
 12 | 	vpunpcklwd ymm5, ymm11, ymm7
 13 | 	vpunpckhwd ymm11, ymm11, ymm7
 14 | 	vpaddd  ymm12, ymm4, ymm12
 15 | 	vpaddd  ymm8, ymm10, ymm8
 16 | 	vpaddd  ymm14, ymm5, ymm14
 17 | 	vpaddd  ymm9, ymm11, ymm9
 18 | 	vpaddd  ymm12, ymm12, ymm14
 19 | 	vpaddd  ymm8, ymm8, ymm9
 20 | 	vperm2i128 ymm14, ymm12, ymm8, 0x20
 21 | 	vperm2i128 ymm4, ymm12, ymm8, 0x31
 22 | 	vpaddd  ymm12, ymm14, ymm4
 23 | 	vpermq  ymm12, ymm12, 0xD8
 24 | 	vpunpckhdq ymm14, ymm12, ymm7
 25 | 	vpunpckldq ymm12, ymm12, ymm7
 26 | 	vpaddq  ymm12, ymm12, [rdi]
 27 | 	vpaddq  ymm14, ymm14, [rdi+0x20]
 28 | 	vmovdqu [rdi], ymm12
 29 | 	vmovdqu [rdi+0x20], ymm14
 30 | 	ret
 31 | 
 32 | 	.balign	16
 33 | accum16:
 34 | 	.type   accum16, @function
 35 | 	vpunpcklwd ymm12, ymm8, ymm7
 36 | 	vpunpckhwd ymm8, ymm8, ymm7
 37 | 	vpunpcklwd ymm14, ymm9, ymm7
 38 | 	vpunpckhwd ymm9, ymm9, ymm7
 39 | 	vpunpcklwd ymm4, ymm10, ymm7
 40 | 	vpunpckhwd ymm10, ymm10, ymm7
 41 | 	vpunpcklwd ymm5, ymm11, ymm7
 42 | 	vpunpckhwd ymm11, ymm11, ymm7
 43 | 	vpaddd  ymm12, ymm4, ymm12
 44 | 	vpaddd  ymm8, ymm10, ymm8
 45 | 	vpaddd  ymm14, ymm5, ymm14
 46 | 	vpaddd  ymm9, ymm11, ymm9
 47 | 	vperm2i128 ymm4, ymm12, ymm8, 0x20
 48 | 	vperm2i128 ymm10, ymm12, ymm8, 0x31
 49 | 	vpaddd  ymm12, ymm10, ymm4
 50 | 	vperm2i128 ymm5, ymm14, ymm9, 0x20
 51 | 	vperm2i128 ymm11, ymm14, ymm9, 0x31
 52 | 	vpaddd  ymm4, ymm11, ymm5
 53 | 	vpermq  ymm12, ymm12, 0xD8
 54 | 	vpunpckhdq ymm14, ymm12, ymm7
 55 | 	vpunpckldq ymm12, ymm12, ymm7
 56 | 	vpaddq  ymm12, ymm12, [rdi]
 57 | 	vpaddq  ymm14, ymm14, [rdi+0x20]
 58 | 	vmovdqu [rdi], ymm12
 59 | 	vmovdqu [rdi+0x20], ymm14
 60 | 	vpermq  ymm4, ymm4, 0xD8
 61 | 	vpunpckhdq ymm5, ymm4, ymm7
 62 | 	vpunpckldq ymm4, ymm4, ymm7
 63 | 	vpaddq  ymm4, ymm4, [rdi+0x40]
 64 | 	vpaddq  ymm5, ymm5, [rdi+0x60]
 65 | 	vmovdqu [rdi+0x40], ymm4
 66 | 	vmovdqu [rdi+0x60], ymm5
 67 | 	ret
 68 | 
 69 | 	.balign	16
 70 | accum32:
 71 | 	.type   accum32, @function
 72 | 	vpunpcklwd ymm12, ymm8, ymm7
 73 | 	vpunpckhwd ymm8, ymm8, ymm7
 74 | 	vpunpcklwd ymm14, ymm9, ymm7
 75 | 	vpunpckhwd ymm9, ymm9, ymm7
 76 | 	vpunpcklwd ymm4, ymm10, ymm7
 77 | 	vpunpckhwd ymm10, ymm10, ymm7
 78 | 	vpunpcklwd ymm5, ymm11, ymm7
 79 | 	vpunpckhwd ymm11, ymm11, ymm7
 80 | 	vpaddd  ymm12, ymm4, ymm12
 81 | 	vpaddd  ymm8, ymm10, ymm8
 82 | 	vpaddd  ymm14, ymm5, ymm14
 83 | 	vpaddd  ymm9, ymm11, ymm9
 84 | 	vpermq  ymm12, ymm12, 0xD8
 85 | 	vpunpckhdq ymm4, ymm12, ymm7
 86 | 	vpunpckldq ymm12, ymm12, ymm7
 87 | 	vpaddq  ymm12, ymm12, [rdi]
 88 | 	vpaddq  ymm4, ymm4, [rdi+0x80]
 89 | 	vmovdqu [rdi], ymm12
 90 | 	vmovdqu [rdi+0x80], ymm4
 91 | 	vpermq  ymm8, ymm8, 0xD8
 92 | 	vpunpckhdq ymm4, ymm8, ymm7
 93 | 	vpunpckldq ymm8, ymm8, ymm7
 94 | 	vpaddq  ymm8, ymm8, [rdi+0x20]
 95 | 	vpaddq  ymm4, ymm4, [rdi+0xA0]
 96 | 	vmovdqu [rdi+0x20], ymm8
 97 | 	vmovdqu [rdi+0xA0], ymm4
 98 | 	vpermq  ymm14, ymm14, 0xD8
 99 | 	vpunpckhdq ymm4, ymm14, ymm7
100 | 	vpunpckldq ymm14, ymm14, ymm7
101 | 	vpaddq  ymm14, ymm14, [rdi+0x40]
102 | 	vpaddq  ymm4, ymm4, [rdi+0xC0]
103 | 	vmovdqu [rdi+0x40], ymm14
104 | 	vmovdqu [rdi+0xC0], ymm4
105 | 	vpermq  ymm9, ymm9, 0xD8
106 | 	vpunpckhdq ymm4, ymm9, ymm7
107 | 	vpunpckldq ymm9, ymm9, ymm7
108 | 	vpaddq  ymm9, ymm9, [rdi+0x60]
109 | 	vpaddq  ymm4, ymm4, [rdi+0xE0]
110 | 	vmovdqu [rdi+0x60], ymm9
111 | 	vmovdqu [rdi+0xE0], ymm4
112 | 	ret
113 | 
114 | 	.balign	16
115 | accum64:
116 | 	.type   accum64, @function
117 | 	vpunpcklwd ymm12, ymm8, ymm7
118 | 	vpunpckhwd ymm14, ymm8, ymm7
119 | 	vpermq  ymm12, ymm12, 0xD8
120 | 	vpunpckhdq ymm4, ymm12, ymm7
121 | 	vpunpckldq ymm12, ymm12, ymm7
122 | 	vpaddq  ymm12, ymm12, [rdi]
123 | 	vpaddq  ymm4, ymm4, [rdi+0x80]
124 | 	vmovdqu [rdi], ymm12
125 | 	vmovdqu [rdi+0x80], ymm4
126 | 	vpermq  ymm14, ymm14, 0xD8
127 | 	vpunpckhdq ymm4, ymm14, ymm7
128 | 	vpunpckldq ymm14, ymm14, ymm7
129 | 	vpaddq  ymm14, ymm14, [rdi+0x20]
130 | 	vpaddq  ymm4, ymm4, [rdi+0xA0]
131 | 	vmovdqu [rdi+0x20], ymm14
132 | 	vmovdqu [rdi+0xA0], ymm4
133 | 	vpunpcklwd ymm12, ymm9, ymm7
134 | 	vpunpckhwd ymm14, ymm9, ymm7
135 | 	vpermq  ymm12, ymm12, 0xD8
136 | 	vpunpckhdq ymm4, ymm12, ymm7
137 | 	vpunpckldq ymm12, ymm12, ymm7
138 | 	vpaddq  ymm12, ymm12, [rdi+0x40]
139 | 	vpaddq  ymm4, ymm4, [rdi+0xC0]
140 | 	vmovdqu [rdi+0x40], ymm12
141 | 	vmovdqu [rdi+0xC0], ymm4
142 | 	vpermq  ymm14, ymm14, 0xD8
143 | 	vpunpckhdq ymm4, ymm14, ymm7
144 | 	vpunpckldq ymm14, ymm14, ymm7
145 | 	vpaddq  ymm14, ymm14, [rdi+0x60]
146 | 	vpaddq  ymm4, ymm4, [rdi+0xE0]
147 | 	vmovdqu [rdi+0x60], ymm14
148 | 	vmovdqu [rdi+0xE0], ymm4
149 | 	vpunpcklwd ymm12, ymm10, ymm7
150 | 	vpunpckhwd ymm14, ymm10, ymm7
151 | 	vpermq  ymm12, ymm12, 0xD8
152 | 	vpunpckhdq ymm4, ymm12, ymm7
153 | 	vpunpckldq ymm12, ymm12, ymm7
154 | 	vpaddq  ymm12, ymm12, [rdi+0x100]
155 | 	vpaddq  ymm4, ymm4, [rdi+0x180]
156 | 	vmovdqu [rdi+0x100], ymm12
157 | 	vmovdqu [rdi+0x180], ymm4
158 | 	vpermq  ymm14, ymm14, 0xD8
159 | 	vpunpckhdq ymm4, ymm14, ymm7
160 | 	vpunpckldq ymm14, ymm14, ymm7
161 | 	vpaddq  ymm14, ymm14, [rdi+0x120]
162 | 	vpaddq  ymm4, ymm4, [rdi+0x1A0]
163 | 	vmovdqu [rdi+0x120], ymm14
164 | 	vmovdqu [rdi+0x1A0], ymm4
165 | 	vpunpcklwd ymm12, ymm11, ymm7
166 | 	vpunpckhwd ymm14, ymm11, ymm7
167 | 	vpermq  ymm12, ymm12, 0xD8
168 | 	vpunpckhdq ymm4, ymm12, ymm7
169 | 	vpunpckldq ymm12, ymm12, ymm7
170 | 	vpaddq  ymm12, ymm12, [rdi+0x140]
171 | 	vpaddq  ymm4, ymm4, [rdi+0x1C0]
172 | 	vmovdqu [rdi+0x140], ymm12
173 | 	vmovdqu [rdi+0x1C0], ymm4
174 | 	vpermq  ymm14, ymm14, 0xD8
175 | 	vpunpckhdq ymm4, ymm14, ymm7
176 | 	vpunpckldq ymm14, ymm14, ymm7
177 | 	vpaddq  ymm14, ymm14, [rdi+0x160]
178 | 	vpaddq  ymm4, ymm4, [rdi+0x1E0]
179 | 	vmovdqu [rdi+0x160], ymm14
180 | 	vmovdqu [rdi+0x1E0], ymm4
181 | 	ret
182 | 


--------------------------------------------------------------------------------
/asm/countavx512_32.S:
--------------------------------------------------------------------------------
  1 | #include "kernelavx512.S"
  2 | 
  3 | # not ported to uint32_t flags yet
  4 | 	.balign 16
  5 | accum8:
  6 | 	.type   accum8, @function
  7 | 	vpmovzxwq zmm0, xmm8
  8 | 	vextracti64x2 xmm1, zmm8, 0x01
  9 | 	vpmovzxwq zmm1, xmm1
 10 | 	vextracti64x2 xmm2, zmm8, 0x02
 11 | 	vpmovzxwq zmm2, xmm2
 12 | 	vextracti64x2 xmm3, zmm8, 0x03
 13 | 	vpmovzxwq zmm3, xmm3
 14 | 	vpmovzxwq zmm4, xmm9
 15 | 	vextracti64x2 xmm5, zmm9, 0x01
 16 | 	vpmovzxwq zmm5, xmm5
 17 | 	vextracti64x2 xmm6, zmm9, 0x02
 18 | 	vpmovzxwq zmm6, xmm6
 19 | 	vextracti64x2 xmm7, zmm9, 0x03
 20 | 	vpmovzxwq zmm7, xmm7
 21 | 	vpaddq  zmm0, zmm0, zmm2
 22 | 	vpaddq  zmm1, zmm1, zmm3
 23 | 	vpaddq  zmm4, zmm4, zmm6
 24 | 	vpaddq  zmm5, zmm5, zmm7
 25 | 	vpaddq  zmm0, zmm0, zmm1
 26 | 	vpaddq  zmm4, zmm4, zmm5
 27 | 	vpaddq  zmm0, zmm0, zmm4
 28 | 	vpaddq  zmm0, zmm0, [rdi]
 29 | 	vmovdqu64 [rdi], zmm0
 30 | 	ret
 31 | 
 32 | 	.balign 16
 33 | accum16:
 34 | 	.type   accum16, @function
 35 | 	vpmovzxwd zmm10, ymm8
 36 | 	vextracti64x4 ymm12, zmm8, 0x01
 37 | 	vpmovzxwd zmm12, ymm12
 38 | 	vpmovzxwd zmm14, ymm9
 39 | 	vextracti64x4 ymm16, zmm9, 0x01
 40 | 	vpmovzxwd zmm16, ymm16
 41 | 	vpaddd  zmm10, zmm10, zmm12
 42 | 	vpaddd  zmm14, zmm14, zmm16
 43 | 	vshufi64x2 zmm11, zmm10, zmm14, 0x44
 44 | 	vshufi64x2 zmm15, zmm10, zmm14, 0xEE
 45 | 	vpaddd  zmm11, zmm11, zmm15
 46 | 	vpaddd  zmm11, zmm11, [rdi]
 47 | 	vmovdqu64 [rdi], zmm11
 48 | 	ret
 49 | 
 50 | # not ported to uint32_t flags yet
 51 | 	.balign 16
 52 | accum32:
 53 | 	.type   accum32, @function
 54 | 	vextracti64x2 xmm2, zmm8, 0x02
 55 | 	vextracti64x2 xmm3, zmm9, 0x02
 56 | 	vpmovzxwq zmm0, xmm8
 57 | 	vpmovzxwq zmm1, xmm9
 58 | 	vpmovzxwq zmm2, xmm2
 59 | 	vpmovzxwq zmm3, xmm3
 60 | 	vpaddq  zmm0, zmm0, zmm2
 61 | 	vpaddq  zmm1, zmm1, zmm3
 62 | 	vpaddq  zmm0, zmm0, [rdi]
 63 | 	vpaddq  zmm1, zmm1, [rdi+0x1*0x40]
 64 | 	vmovdqu64 [rdi], zmm0
 65 | 	vmovdqu64 [rdi+0x1*0x40], zmm1
 66 | 	vextracti64x2 xmm0, zmm8, 0x01
 67 | 	vextracti64x2 xmm1, zmm9, 0x01
 68 | 	vextracti64x2 xmm2, zmm8, 0x03
 69 | 	vextracti64x2 xmm3, zmm9, 0x03
 70 | 	vpmovzxwq zmm0, xmm0
 71 | 	vpmovzxwq zmm1, xmm1
 72 | 	vpmovzxwq zmm2, xmm2
 73 | 	vpmovzxwq zmm3, xmm3
 74 | 	vpaddq  zmm0, zmm0, zmm2
 75 | 	vpaddq  zmm1, zmm1, zmm3
 76 | 	vpaddq  zmm0, zmm0, [rdi+0x2*0x40]
 77 | 	vpaddq  zmm1, zmm1, [rdi+0x3*0x40]
 78 | 	vmovdqu64 [rdi+0x2*0x40], zmm0
 79 | 	vmovdqu64 [rdi+0x3*0x40], zmm1
 80 | 	ret
 81 | 
 82 | # not ported to uint32_t flags yet
 83 | 	.balign 16
 84 | accum64:
 85 | 	.type   accum64, @function
 86 | 	vpmovzxwq zmm3, xmm8
 87 | 	vpmovzxwq zmm4, xmm9
 88 | 	vpaddq  zmm3, zmm3, [rdi]
 89 | 	vpaddq  zmm4, zmm4, [rdi+0x1*0x40]
 90 | 	vmovdqu64 [rdi], zmm3
 91 | 	vmovdqu64 [rdi+0x1*0x40], zmm4
 92 | 	vextracti64x2 xmm3, zmm8, 0x01
 93 | 	vextracti64x2 xmm4, zmm9, 0x01
 94 | 	vpmovzxwq zmm3, xmm3
 95 | 	vpmovzxwq zmm4, xmm4
 96 | 	vpaddq  zmm3, zmm3, [rdi+0x2*0x40]
 97 | 	vpaddq  zmm4, zmm4, [rdi+0x3*0x40]
 98 | 	vmovdqu64 [rdi+0x2*0x40], zmm3
 99 | 	vmovdqu64 [rdi+0x3*0x40], zmm4
100 | 	vextracti64x2 xmm3, zmm8, 0x02
101 | 	vextracti64x2 xmm4, zmm9, 0x02
102 | 	vpmovzxwq zmm3, xmm3
103 | 	vpmovzxwq zmm4, xmm4
104 | 	vpaddq  zmm3, zmm3, [rdi+0x4*0x40]
105 | 	vpaddq  zmm4, zmm4, [rdi+0x5*0x40]
106 | 	vmovdqu64 [rdi+0x4*0x40], zmm3
107 | 	vmovdqu64 [rdi+0x5*0x40], zmm4
108 | 	vextracti64x2 xmm3, zmm8, 0x03
109 | 	vextracti64x2 xmm4, zmm9, 0x03
110 | 	vpmovzxwq zmm3, xmm3
111 | 	vpmovzxwq zmm4, xmm4
112 | 	vpaddq  zmm3, zmm3, [rdi+0x6*0x40]
113 | 	vpaddq  zmm4, zmm4, [rdi+0x7*0x40]
114 | 	vmovdqu64 [rdi+0x6*0x40], zmm3
115 | 	vmovdqu64 [rdi+0x7*0x40], zmm4
116 | 	ret
117 | 


--------------------------------------------------------------------------------
/asm/countavx512_64.S:
--------------------------------------------------------------------------------
  1 | #include "kernelavx512.S"
  2 | 
  3 | 	.balign 16
  4 | accum8:
  5 | 	.type   accum8, @function
  6 | 	vpmovzxwq zmm0, xmm8
  7 | 	vextracti64x2 xmm1, zmm8, 0x01
  8 | 	vpmovzxwq zmm1, xmm1
  9 | 	vextracti64x2 xmm2, zmm8, 0x02
 10 | 	vpmovzxwq zmm2, xmm2
 11 | 	vextracti64x2 xmm3, zmm8, 0x03
 12 | 	vpmovzxwq zmm3, xmm3
 13 | 	vpmovzxwq zmm4, xmm9
 14 | 	vextracti64x2 xmm5, zmm9, 0x01
 15 | 	vpmovzxwq zmm5, xmm5
 16 | 	vextracti64x2 xmm6, zmm9, 0x02
 17 | 	vpmovzxwq zmm6, xmm6
 18 | 	vextracti64x2 xmm7, zmm9, 0x03
 19 | 	vpmovzxwq zmm7, xmm7
 20 | 	vpaddq  zmm0, zmm0, zmm2
 21 | 	vpaddq  zmm1, zmm1, zmm3
 22 | 	vpaddq  zmm4, zmm4, zmm6
 23 | 	vpaddq  zmm5, zmm5, zmm7
 24 | 	vpaddq  zmm0, zmm0, zmm1
 25 | 	vpaddq  zmm4, zmm4, zmm5
 26 | 	vpaddq  zmm0, zmm0, zmm4
 27 | 	vpaddq  zmm0, zmm0, [rdi]
 28 | 	vmovdqu64 [rdi], zmm0
 29 | 	ret
 30 | 
 31 | 	.balign 16
 32 | accum16:
 33 | 	.type   accum16, @function
 34 | 	vpmovzxwq zmm10, xmm8
 35 | 	vextracti64x2 xmm11, zmm8, 0x01
 36 | 	vpmovzxwq zmm11, xmm11
 37 | 	vextracti64x2 xmm12, zmm8, 0x02
 38 | 	vpmovzxwq zmm12, xmm12
 39 | 	vextracti64x2 xmm13, zmm8, 0x03
 40 | 	vpmovzxwq zmm13, xmm13
 41 | 	vpmovzxwq zmm14, xmm9
 42 | 	vextracti64x2 xmm15, zmm9, 0x01
 43 | 	vpmovzxwq zmm15, xmm15
 44 | 	vextracti64x2 xmm16, zmm9, 0x02
 45 | 	vpmovzxwq zmm16, xmm16
 46 | 	vextracti64x2 xmm17, zmm9, 0x03
 47 | 	vpmovzxwq zmm17, xmm17
 48 | 	vpaddq  zmm10, zmm10, zmm12
 49 | 	vpaddq  zmm11, zmm11, zmm13
 50 | 	vpaddq  zmm14, zmm14, zmm16
 51 | 	vpaddq  zmm15, zmm15, zmm17
 52 | 	vpaddq  zmm10, zmm10, zmm11
 53 | 	vpaddq  zmm14, zmm14, zmm15
 54 | 	vpaddq  zmm10, zmm10, [rdi]
 55 | 	vpaddq  zmm14, zmm14, [rdi+0x1*0x40]
 56 | 	vmovdqu64 [rdi], zmm10
 57 | 	vmovdqu64 [rdi+0x1*0x40], zmm14
 58 | 	ret
 59 | 
 60 | 	.balign 16
 61 | accum32:
 62 | 	.type   accum32, @function
 63 | 	vextracti64x2 xmm2, zmm8, 0x02
 64 | 	vextracti64x2 xmm3, zmm9, 0x02
 65 | 	vpmovzxwq zmm0, xmm8
 66 | 	vpmovzxwq zmm1, xmm9
 67 | 	vpmovzxwq zmm2, xmm2
 68 | 	vpmovzxwq zmm3, xmm3
 69 | 	vpaddq  zmm0, zmm0, zmm2
 70 | 	vpaddq  zmm1, zmm1, zmm3
 71 | 	vpaddq  zmm0, zmm0, [rdi]
 72 | 	vpaddq  zmm1, zmm1, [rdi+0x1*0x40]
 73 | 	vmovdqu64 [rdi], zmm0
 74 | 	vmovdqu64 [rdi+0x1*0x40], zmm1
 75 | 	vextracti64x2 xmm0, zmm8, 0x01
 76 | 	vextracti64x2 xmm1, zmm9, 0x01
 77 | 	vextracti64x2 xmm2, zmm8, 0x03
 78 | 	vextracti64x2 xmm3, zmm9, 0x03
 79 | 	vpmovzxwq zmm0, xmm0
 80 | 	vpmovzxwq zmm1, xmm1
 81 | 	vpmovzxwq zmm2, xmm2
 82 | 	vpmovzxwq zmm3, xmm3
 83 | 	vpaddq  zmm0, zmm0, zmm2
 84 | 	vpaddq  zmm1, zmm1, zmm3
 85 | 	vpaddq  zmm0, zmm0, [rdi+0x2*0x40]
 86 | 	vpaddq  zmm1, zmm1, [rdi+0x3*0x40]
 87 | 	vmovdqu64 [rdi+0x2*0x40], zmm0
 88 | 	vmovdqu64 [rdi+0x3*0x40], zmm1
 89 | 	ret
 90 | 
 91 | 	.balign 16
 92 | accum64:
 93 | 	.type   accum64, @function
 94 | 	vpmovzxwq zmm3, xmm8
 95 | 	vpmovzxwq zmm4, xmm9
 96 | 	vpaddq  zmm3, zmm3, [rdi]
 97 | 	vpaddq  zmm4, zmm4, [rdi+0x1*0x40]
 98 | 	vmovdqu64 [rdi], zmm3
 99 | 	vmovdqu64 [rdi+0x1*0x40], zmm4
100 | 	vextracti64x2 xmm3, zmm8, 0x01
101 | 	vextracti64x2 xmm4, zmm9, 0x01
102 | 	vpmovzxwq zmm3, xmm3
103 | 	vpmovzxwq zmm4, xmm4
104 | 	vpaddq  zmm3, zmm3, [rdi+0x2*0x40]
105 | 	vpaddq  zmm4, zmm4, [rdi+0x3*0x40]
106 | 	vmovdqu64 [rdi+0x2*0x40], zmm3
107 | 	vmovdqu64 [rdi+0x3*0x40], zmm4
108 | 	vextracti64x2 xmm3, zmm8, 0x02
109 | 	vextracti64x2 xmm4, zmm9, 0x02
110 | 	vpmovzxwq zmm3, xmm3
111 | 	vpmovzxwq zmm4, xmm4
112 | 	vpaddq  zmm3, zmm3, [rdi+0x4*0x40]
113 | 	vpaddq  zmm4, zmm4, [rdi+0x5*0x40]
114 | 	vmovdqu64 [rdi+0x4*0x40], zmm3
115 | 	vmovdqu64 [rdi+0x5*0x40], zmm4
116 | 	vextracti64x2 xmm3, zmm8, 0x03
117 | 	vextracti64x2 xmm4, zmm9, 0x03
118 | 	vpmovzxwq zmm3, xmm3
119 | 	vpmovzxwq zmm4, xmm4
120 | 	vpaddq  zmm3, zmm3, [rdi+0x6*0x40]
121 | 	vpaddq  zmm4, zmm4, [rdi+0x7*0x40]
122 | 	vmovdqu64 [rdi+0x6*0x40], zmm3
123 | 	vmovdqu64 [rdi+0x7*0x40], zmm4
124 | 	ret
125 | 


--------------------------------------------------------------------------------
/asm/countneon_32.S:
--------------------------------------------------------------------------------
  1 | #include "kernelneon.S"
  2 | 
  3 | // not ported to 32 bit flags yet
  4 | accum8:
  5 | 	ld1	{v4.2d-v7.2d}, [x2]
  6 | 	uaddl	v16.4s, v10.4h, v8.4h
  7 | 	uaddl2	v17.4s, v10.8h, v8.8h
  8 | 	uaddl	v18.4s, v11.4h, v9.4h
  9 | 	uaddl2	v19.4s, v11.8h, v9.8h
 10 | 	uaddl	v20.4s, v14.4h, v12.4h
 11 | 	uaddl2	v21.4s, v14.8h, v12.8h
 12 | 	uaddl	v22.4s, v15.4h, v13.4h
 13 | 	uaddl2	v23.4s, v15.8h, v13.8h
 14 | 	add	v16.4s, v16.4s, v18.4s
 15 | 	add	v17.4s, v17.4s, v19.4s
 16 | 	add	v20.4s, v20.4s, v22.4s
 17 | 	add	v21.4s, v21.4s, v23.4s
 18 | 	add	v16.4s, v16.4s, v20.4s
 19 | 	add	v17.4s, v17.4s, v21.4s
 20 | 	uaddw	v4.2d, v4.2d, v16.2s
 21 | 	uaddw2	v5.2d, v5.2d, v16.4s
 22 | 	uaddw	v6.2d, v6.2d, v17.2s
 23 | 	uaddw2	v7.2d, v7.2d, v17.4s
 24 | 	st1	{v4.2d-v7.2d}, [x2]
 25 | 	ret
 26 | 
 27 | accum16:
 28 | 	ld1	{v4.4s-v7.4s}, [x2]
 29 | 	uaddl	v16.4s, v10.4h, v8.4h
 30 | 	uaddl2	v17.4s, v10.8h, v8.8h
 31 | 	uaddl	v18.4s, v11.4h, v9.4h
 32 | 	uaddl2	v19.4s, v11.8h, v9.8h
 33 | 	uaddl	v20.4s, v14.4h, v12.4h
 34 | 	uaddl2	v21.4s, v14.8h, v12.8h
 35 | 	uaddl	v22.4s, v15.4h, v13.4h
 36 | 	uaddl2	v23.4s, v15.8h, v13.8h
 37 | 	add	v16.4s, v16.4s, v20.4s
 38 | 	add	v17.4s, v17.4s, v21.4s
 39 | 	add	v18.4s, v18.4s, v22.4s
 40 | 	add	v19.4s, v19.4s, v23.4s
 41 | 	add	v4.4s, v4.4s, v16.4s
 42 | 	add	v5.4s, v5.4s, v17.4s
 43 | 	add	v6.4s, v6.4s, v18.4s
 44 | 	add	v7.4s, v7.4s, v19.4s
 45 | 	st1	{v4.4s-v7.4s}, [x2]
 46 | 	ret
 47 | 
 48 | // not ported to 32 bit flags yet
 49 | accum32:
 50 | 	mov	x7, x2
 51 | 	mov	x8, x2
 52 | 	mov	x9, #2
 53 | 
 54 | 0:	ld1	{v20.2d-v23.2d}, [x7], #64
 55 | 	ld1	{v4.2d-v7.2d}, [x7], #64
 56 | 	sub	x9, x9, #0x1
 57 | 	uaddl	v16.4s, v12.4h, v8.4h
 58 | 	uaddl2	v17.4s, v12.8h, v8.8h
 59 | 	uaddl	v18.4s, v13.4h, v9.4h
 60 | 	uaddl2	v19.4s, v13.8h, v9.8h
 61 | 	mov	v8.16b, v10.16b
 62 | 	mov	v9.16b, v11.16b
 63 | 	mov	v12.16b, v14.16b
 64 | 	mov	v13.16b, v15.16b
 65 | 	uaddw	v20.2d, v20.2d, v16.2s
 66 | 	uaddw2	v21.2d, v21.2d, v16.4s
 67 | 	uaddw	v22.2d, v22.2d, v17.2s
 68 | 	uaddw2	v23.2d, v23.2d, v17.4s
 69 | 	uaddw	v4.2d, v4.2d, v18.2s
 70 | 	uaddw2	v5.2d, v5.2d, v18.4s
 71 | 	uaddw	v6.2d, v6.2d, v19.2s
 72 | 	uaddw2	v7.2d, v7.2d, v19.4s
 73 | 	st1	{v20.2d-v23.2d}, [x8], #64
 74 | 	st1	{v4.2d-v7.2d}, [x8], #64
 75 | 	cbnz	x9, 0b
 76 | 	ret
 77 | 
 78 | // not ported to 32 bit flags yet
 79 | accum64:
 80 | 	mov	x7, x2
 81 | 	mov	x8, x2
 82 | 	mov	x9, #4
 83 | 
 84 | 0:	ld1	{v20.2d-v23.2d}, [x7], #64
 85 | 	ld1	{v4.2d-v7.2d}, [x7], #64
 86 | 	sub	x9, x9, #0x1
 87 | 	uxtl	v16.4s, v8.4h
 88 | 	uxtl2	v17.4s, v8.8h
 89 | 	uxtl	v18.4s, v9.4h
 90 | 	uxtl2	v19.4s, v9.8h
 91 | 	mov	v8.16b, v10.16b
 92 | 	mov	v9.16b, v11.16b
 93 | 	mov	v10.16b, v12.16b
 94 | 	mov	v11.16b, v13.16b
 95 | 	mov	v12.16b, v14.16b
 96 | 	mov	v13.16b, v15.16b
 97 | 	uaddw	v20.2d, v20.2d, v16.2s
 98 | 	uaddw2	v21.2d, v21.2d, v16.4s
 99 | 	uaddw	v22.2d, v22.2d, v17.2s
100 | 	uaddw2	v23.2d, v23.2d, v17.4s
101 | 	uaddw	v4.2d, v4.2d, v18.2s
102 | 	uaddw2	v5.2d, v5.2d, v18.4s
103 | 	uaddw	v6.2d, v6.2d, v19.2s
104 | 	uaddw2	v7.2d, v7.2d, v19.4s
105 | 	st1	{v20.2d-v23.2d}, [x8], #64
106 | 	st1	{v4.2d-v7.2d}, [x8], #64
107 | 	cbnz	x9, 0b
108 | 	ret
109 | 


--------------------------------------------------------------------------------
/asm/countneon_64.S:
--------------------------------------------------------------------------------
  1 | #include "kernelneon.S"
  2 | 
  3 | accum8:
  4 | 	ld1	{v4.2d-v7.2d}, [x2]
  5 | 	uaddl	v16.4s, v10.4h, v8.4h
  6 | 	uaddl2	v17.4s, v10.8h, v8.8h
  7 | 	uaddl	v18.4s, v11.4h, v9.4h
  8 | 	uaddl2	v19.4s, v11.8h, v9.8h
  9 | 	uaddl	v20.4s, v14.4h, v12.4h
 10 | 	uaddl2	v21.4s, v14.8h, v12.8h
 11 | 	uaddl	v22.4s, v15.4h, v13.4h
 12 | 	uaddl2	v23.4s, v15.8h, v13.8h
 13 | 	add	v16.4s, v16.4s, v18.4s
 14 | 	add	v17.4s, v17.4s, v19.4s
 15 | 	add	v20.4s, v20.4s, v22.4s
 16 | 	add	v21.4s, v21.4s, v23.4s
 17 | 	add	v16.4s, v16.4s, v20.4s
 18 | 	add	v17.4s, v17.4s, v21.4s
 19 | 	uaddw	v4.2d, v4.2d, v16.2s
 20 | 	uaddw2	v5.2d, v5.2d, v16.4s
 21 | 	uaddw	v6.2d, v6.2d, v17.2s
 22 | 	uaddw2	v7.2d, v7.2d, v17.4s
 23 | 	st1	{v4.2d-v7.2d}, [x2]
 24 | 	ret
 25 | 
 26 | accum16:
 27 | 	ld1	{v4.2d-v7.2d}, [x2], #64
 28 | 	uaddl	v16.4s, v10.4h, v8.4h
 29 | 	uaddl2	v17.4s, v10.8h, v8.8h
 30 | 	uaddl	v18.4s, v11.4h, v9.4h
 31 | 	uaddl2	v19.4s, v11.8h, v9.8h
 32 | 	uaddl	v20.4s, v14.4h, v12.4h
 33 | 	uaddl2	v21.4s, v14.8h, v12.8h
 34 | 	uaddl	v22.4s, v15.4h, v13.4h
 35 | 	uaddl2	v23.4s, v15.8h, v13.8h
 36 | 	add	v16.4s, v16.4s, v20.4s
 37 | 	add	v17.4s, v17.4s, v21.4s
 38 | 	add	v18.4s, v18.4s, v22.4s
 39 | 	add	v19.4s, v19.4s, v23.4s
 40 | 	ld1	{v20.2d-v23.2d}, [x2]
 41 | 	sub	x2, x2, #0x40
 42 | 	uaddw	v4.2d, v4.2d, v16.2s
 43 | 	uaddw2	v5.2d, v5.2d, v16.4s
 44 | 	uaddw	v6.2d, v6.2d, v17.2s
 45 | 	uaddw2	v7.2d, v7.2d, v17.4s
 46 | 	uaddw	v20.2d, v20.2d, v18.2s
 47 | 	uaddw2	v21.2d, v21.2d, v18.4s
 48 | 	uaddw	v22.2d, v22.2d, v19.2s
 49 | 	uaddw2	v23.2d, v23.2d, v19.4s
 50 | 	st1	{v4.2d-v7.2d}, [x2], #64
 51 | 	st1	{v20.2d-v23.2d}, [x2]
 52 | 	sub	x2, x2, #0x40
 53 | 	ret
 54 | 
 55 | accum32:
 56 | 	mov	x7, x2
 57 | 	mov	x8, x2
 58 | 	mov	x9, #2
 59 | 
 60 | 0:	ld1	{v20.2d-v23.2d}, [x7], #64
 61 | 	ld1	{v4.2d-v7.2d}, [x7], #64
 62 | 	sub	x9, x9, #0x1
 63 | 	uaddl	v16.4s, v12.4h, v8.4h
 64 | 	uaddl2	v17.4s, v12.8h, v8.8h
 65 | 	uaddl	v18.4s, v13.4h, v9.4h
 66 | 	uaddl2	v19.4s, v13.8h, v9.8h
 67 | 	mov	v8.16b, v10.16b
 68 | 	mov	v9.16b, v11.16b
 69 | 	mov	v12.16b, v14.16b
 70 | 	mov	v13.16b, v15.16b
 71 | 	uaddw	v20.2d, v20.2d, v16.2s
 72 | 	uaddw2	v21.2d, v21.2d, v16.4s
 73 | 	uaddw	v22.2d, v22.2d, v17.2s
 74 | 	uaddw2	v23.2d, v23.2d, v17.4s
 75 | 	uaddw	v4.2d, v4.2d, v18.2s
 76 | 	uaddw2	v5.2d, v5.2d, v18.4s
 77 | 	uaddw	v6.2d, v6.2d, v19.2s
 78 | 	uaddw2	v7.2d, v7.2d, v19.4s
 79 | 	st1	{v20.2d-v23.2d}, [x8], #64
 80 | 	st1	{v4.2d-v7.2d}, [x8], #64
 81 | 	cbnz	x9, 0b
 82 | 	ret
 83 | 
 84 | accum64:
 85 | 	mov	x7, x2
 86 | 	mov	x8, x2
 87 | 	mov	x9, #4
 88 | 
 89 | 0:	ld1	{v20.2d-v23.2d}, [x7], #64
 90 | 	ld1	{v4.2d-v7.2d}, [x7], #64
 91 | 	sub	x9, x9, #0x1
 92 | 	uxtl	v16.4s, v8.4h
 93 | 	uxtl2	v17.4s, v8.8h
 94 | 	uxtl	v18.4s, v9.4h
 95 | 	uxtl2	v19.4s, v9.8h
 96 | 	mov	v8.16b, v10.16b
 97 | 	mov	v9.16b, v11.16b
 98 | 	mov	v10.16b, v12.16b
 99 | 	mov	v11.16b, v13.16b
100 | 	mov	v12.16b, v14.16b
101 | 	mov	v13.16b, v15.16b
102 | 	uaddw	v20.2d, v20.2d, v16.2s
103 | 	uaddw2	v21.2d, v21.2d, v16.4s
104 | 	uaddw	v22.2d, v22.2d, v17.2s
105 | 	uaddw2	v23.2d, v23.2d, v17.4s
106 | 	uaddw	v4.2d, v4.2d, v18.2s
107 | 	uaddw2	v5.2d, v5.2d, v18.4s
108 | 	uaddw	v6.2d, v6.2d, v19.2s
109 | 	uaddw2	v7.2d, v7.2d, v19.4s
110 | 	st1	{v20.2d-v23.2d}, [x8], #64
111 | 	st1	{v4.2d-v7.2d}, [x8], #64
112 | 	cbnz	x9, 0b
113 | 	ret
114 | 


--------------------------------------------------------------------------------
/asm/kernelavx2.S:
--------------------------------------------------------------------------------
  1 | # AVX2 positional popcount with 15-fold CSA reduction
  2 | # by Robert Clausecker <fuz@fuz.su>
  3 | # from github.com/clausecker/pospop@v1.3.5
  4 | # with slight copy-editing
  5 | 
  6 | 	.intel_syntax noprefix
  7 | 	.section .rodata
  8 | 	.balign 32
  9 | magic:	.quad	0x0000000000000000
 10 | 	.quad	0x0101010101010101
 11 | 	.quad	0x0202020202020202
 12 | 	.quad	0x0303030303030303
 13 | 	.quad	0x0404040404040404
 14 | 	.quad	0x0505050505050505
 15 | 	.quad	0x0606060606060606
 16 | 	.quad	0x0707070707070707
 17 | 	.quad	0x8040201008040201
 18 | 	.long	0x55555555
 19 | 	.long	0x33333333
 20 | 	.long	0x0f0f0f0f
 21 | 	.long	0x00ff00ff
 22 | 	.size	magic, .-magic
 23 | 
 24 | window:	.quad	0, 0, 0, 0, -1, -1, -1, -1
 25 | 	.size	window, .-window
 26 | 
 27 | 	.section .text
 28 | 	.balign	16
 29 | countavx2:
 30 | 	.type   countavx2, @function
 31 | 	push    rbp
 32 | 	mov     rbp, rsp
 33 | 	cmp     rcx, 480
 34 | 	jl      .L22056
 35 | 	mov     edx, esi
 36 | 	and     edx, 0x1F
 37 | 	mov     eax, 32
 38 | 	sub     eax, edx
 39 | 	sub     rsi, rdx
 40 | 	vmovdqa ymm0, [rsi]
 41 | 	add     rcx, rdx
 42 | 	lea     rdx, [window+rip]
 43 | 	vpand   ymm0, ymm0, [rdx+rax]
 44 | 	vmovdqa ymm1, [rsi+0x20]
 45 | 	vmovdqa ymm4, [rsi+0x40]
 46 | 	vmovdqa ymm2, [rsi+0x60]
 47 | 	vmovdqa ymm3, [rsi+0x80]
 48 | 	vmovdqa ymm5, [rsi+0xA0]
 49 | 	vmovdqa ymm6, [rsi+0xC0]
 50 | 	vpand   ymm7, ymm1, ymm0
 51 | 	vpxor   ymm0, ymm1, ymm0
 52 | 	vpand   ymm1, ymm4, ymm0
 53 | 	vpxor   ymm0, ymm4, ymm0
 54 | 	vpor    ymm1, ymm7, ymm1
 55 | 	vmovdqa ymm4, [rsi+0xE0]
 56 | 	vpand   ymm7, ymm2, ymm3
 57 | 	vpxor   ymm3, ymm2, ymm3
 58 | 	vpand   ymm2, ymm5, ymm3
 59 | 	vpxor   ymm3, ymm5, ymm3
 60 | 	vpor    ymm2, ymm7, ymm2
 61 | 	vmovdqa ymm5, [rsi+0x100]
 62 | 	vpand   ymm7, ymm3, ymm0
 63 | 	vpxor   ymm0, ymm3, ymm0
 64 | 	vpand   ymm3, ymm6, ymm0
 65 | 	vpxor   ymm0, ymm6, ymm0
 66 | 	vpor    ymm3, ymm7, ymm3
 67 | 	vmovdqa ymm6, [rsi+0x120]
 68 | 	vpand   ymm7, ymm2, ymm1
 69 | 	vpxor   ymm1, ymm2, ymm1
 70 | 	vpand   ymm2, ymm3, ymm1
 71 | 	vpxor   ymm1, ymm3, ymm1
 72 | 	vpor    ymm2, ymm7, ymm2
 73 | 	vmovdqa ymm3, [rsi+0x140]
 74 | 	vpand   ymm7, ymm4, ymm0
 75 | 	vpxor   ymm0, ymm4, ymm0
 76 | 	vpand   ymm4, ymm5, ymm0
 77 | 	vpxor   ymm0, ymm5, ymm0
 78 | 	vpor    ymm4, ymm7, ymm4
 79 | 	vmovdqa ymm5, [rsi+0x160]
 80 | 	vpand   ymm7, ymm3, ymm0
 81 | 	vpxor   ymm0, ymm3, ymm0
 82 | 	vpand   ymm3, ymm6, ymm0
 83 | 	vpxor   ymm0, ymm6, ymm0
 84 | 	vpor    ymm3, ymm7, ymm3
 85 | 	vmovdqa ymm6, [rsi+0x180]
 86 | 	vpand   ymm7, ymm3, ymm1
 87 | 	vpxor   ymm1, ymm3, ymm1
 88 | 	vpand   ymm3, ymm4, ymm1
 89 | 	vpxor   ymm1, ymm4, ymm1
 90 | 	vpor    ymm3, ymm7, ymm3
 91 | 	vmovdqa ymm4, [rsi+0x1A0]
 92 | 	vpand   ymm7, ymm5, ymm0
 93 | 	vpxor   ymm0, ymm5, ymm0
 94 | 	vpand   ymm5, ymm6, ymm0
 95 | 	vpxor   ymm0, ymm6, ymm0
 96 | 	vpor    ymm5, ymm7, ymm5
 97 | 	vmovdqa ymm6, [rsi+0x1C0]
 98 | 	vpbroadcastd ymm15, [magic+72+rip]
 99 | 	vpbroadcastd ymm13, [magic+76+rip]
100 | 	vpand   ymm7, ymm4, ymm0
101 | 	vpxor   ymm0, ymm4, ymm0
102 | 	vpand   ymm4, ymm6, ymm0
103 | 	vpxor   ymm0, ymm6, ymm0
104 | 	vpor    ymm4, ymm7, ymm4
105 | 	vpxor   ymm8, ymm8, ymm8
106 | 	vpxor   ymm9, ymm9, ymm9
107 | 	vpand   ymm7, ymm4, ymm1
108 | 	vpxor   ymm1, ymm4, ymm1
109 | 	vpand   ymm4, ymm5, ymm1
110 | 	vpxor   ymm1, ymm5, ymm1
111 | 	vpor    ymm4, ymm7, ymm4
112 | 	vpxor   ymm10, ymm10, ymm10
113 | 	vpxor   ymm11, ymm11, ymm11
114 | 	vpand   ymm7, ymm3, ymm2
115 | 	vpxor   ymm2, ymm3, ymm2
116 | 	vpand   ymm3, ymm4, ymm2
117 | 	vpxor   ymm2, ymm4, ymm2
118 | 	vpor    ymm3, ymm7, ymm3
119 | 	add     rsi, 480
120 | 	sub     rcx, 992
121 | 	jl      .L22052
122 | 	mov     eax, 65535
123 | .L22050:vmovdqa ymm4, [rsi]
124 | 	vmovdqa ymm5, [rsi+0x20]
125 | 	vmovdqa ymm6, [rsi+0x40]
126 | 	vmovdqa ymm12, [rsi+0x60]
127 | 	vmovdqa ymm14, [rsi+0x80]
128 | 	vpand   ymm7, ymm4, ymm0
129 | 	vpxor   ymm0, ymm4, ymm0
130 | 	vpand   ymm4, ymm5, ymm0
131 | 	vpxor   ymm0, ymm5, ymm0
132 | 	vpor    ymm4, ymm7, ymm4
133 | 	vmovdqa ymm5, [rsi+0xA0]
134 | 	vpand   ymm7, ymm12, ymm6
135 | 	vpxor   ymm6, ymm12, ymm6
136 | 	vpand   ymm12, ymm14, ymm6
137 | 	vpxor   ymm6, ymm14, ymm6
138 | 	vpor    ymm12, ymm7, ymm12
139 | 	vmovdqa ymm14, [rsi+0xC0]
140 | 	vpand   ymm7, ymm4, ymm1
141 | 	vpxor   ymm1, ymm4, ymm1
142 | 	vpand   ymm4, ymm12, ymm1
143 | 	vpxor   ymm1, ymm12, ymm1
144 | 	vpor    ymm4, ymm7, ymm4
145 | 	vmovdqa ymm12, [rsi+0xE0]
146 | 	vpand   ymm7, ymm5, ymm0
147 | 	vpxor   ymm0, ymm5, ymm0
148 | 	vpand   ymm5, ymm6, ymm0
149 | 	vpxor   ymm0, ymm6, ymm0
150 | 	vpor    ymm5, ymm7, ymm5
151 | 	vmovdqa ymm6, [rsi+0x100]
152 | 	vpand   ymm7, ymm12, ymm6
153 | 	vpxor   ymm6, ymm12, ymm6
154 | 	vpand   ymm12, ymm14, ymm6
155 | 	vpxor   ymm6, ymm14, ymm6
156 | 	vpor    ymm12, ymm7, ymm12
157 | 	vmovdqa ymm14, [rsi+0x120]
158 | 	vpand   ymm7, ymm5, ymm1
159 | 	vpxor   ymm1, ymm5, ymm1
160 | 	vpand   ymm5, ymm12, ymm1
161 | 	vpxor   ymm1, ymm12, ymm1
162 | 	vpor    ymm5, ymm7, ymm5
163 | 	vmovdqa ymm12, [rsi+0x140]
164 | 	vpand   ymm7, ymm12, ymm0
165 | 	vpxor   ymm0, ymm12, ymm0
166 | 	vpand   ymm12, ymm14, ymm0
167 | 	vpxor   ymm0, ymm14, ymm0
168 | 	vpor    ymm12, ymm7, ymm12
169 | 	vmovdqa ymm14, [rsi+0x160]
170 | 	vpand   ymm7, ymm4, ymm2
171 | 	vpxor   ymm2, ymm4, ymm2
172 | 	vpand   ymm4, ymm5, ymm2
173 | 	vpxor   ymm2, ymm5, ymm2
174 | 	vpor    ymm4, ymm7, ymm4
175 | 	vmovdqa ymm5, [rsi+0x180]
176 | 	vpand   ymm7, ymm6, ymm0
177 | 	vpxor   ymm0, ymm6, ymm0
178 | 	vpand   ymm6, ymm14, ymm0
179 | 	vpxor   ymm0, ymm14, ymm0
180 | 	vpor    ymm6, ymm7, ymm6
181 | 	vmovdqa ymm14, [rsi+0x1A0]
182 | 	vpand   ymm7, ymm6, ymm1
183 | 	vpxor   ymm1, ymm6, ymm1
184 | 	vpand   ymm6, ymm12, ymm1
185 | 	vpxor   ymm1, ymm12, ymm1
186 | 	vpor    ymm6, ymm7, ymm6
187 | 	vmovdqa ymm12, [rsi+0x1C0]
188 | 	vpand   ymm7, ymm12, ymm5
189 | 	vpxor   ymm5, ymm12, ymm5
190 | 	vpand   ymm12, ymm14, ymm5
191 | 	vpxor   ymm5, ymm14, ymm5
192 | 	vpor    ymm12, ymm7, ymm12
193 | 	vmovdqa ymm14, [rsi+0x1E0]
194 | 	vpand   ymm7, ymm5, ymm0
195 | 	vpxor   ymm0, ymm5, ymm0
196 | 	vpand   ymm5, ymm14, ymm0
197 | 	vpxor   ymm0, ymm14, ymm0
198 | 	vpor    ymm5, ymm7, ymm5
199 | 	add     rsi, 512
200 | 	prefetcht0 [rsi]
201 | 	prefetcht0 [rsi+0x20]
202 | 	vpand   ymm7, ymm5, ymm1
203 | 	vpxor   ymm1, ymm5, ymm1
204 | 	vpand   ymm5, ymm12, ymm1
205 | 	vpxor   ymm1, ymm12, ymm1
206 | 	vpor    ymm5, ymm7, ymm5
207 | 	vpand   ymm7, ymm5, ymm2
208 | 	vpxor   ymm2, ymm5, ymm2
209 | 	vpand   ymm5, ymm6, ymm2
210 | 	vpxor   ymm2, ymm6, ymm2
211 | 	vpor    ymm5, ymm7, ymm5
212 | 	vpand   ymm7, ymm4, ymm3
213 | 	vpxor   ymm3, ymm4, ymm3
214 | 	vpand   ymm4, ymm5, ymm3
215 | 	vpxor   ymm3, ymm5, ymm3
216 | 	vpor    ymm4, ymm7, ymm4
217 | 	vpbroadcastd ymm12, [magic+84+rip]
218 | 	vpbroadcastd ymm14, [magic+80+rip]
219 | 	vpand   ymm5, ymm15, ymm4
220 | 	vpandn  ymm6, ymm15, ymm4
221 | 	vpsrld  ymm6, ymm6, 1
222 | 	vperm2i128 ymm4, ymm5, ymm6, 0x20
223 | 	vperm2i128 ymm5, ymm5, ymm6, 0x31
224 | 	vpaddd  ymm4, ymm4, ymm5
225 | 	vpand   ymm5, ymm13, ymm4
226 | 	vpandn  ymm6, ymm13, ymm4
227 | 	vpsrld  ymm6, ymm6, 2
228 | 	vpunpcklqdq ymm4, ymm5, ymm6
229 | 	vpunpckhqdq ymm5, ymm5, ymm6
230 | 	vpaddd  ymm4, ymm4, ymm5
231 | 	vpand   ymm5, ymm14, ymm4
232 | 	vpandn  ymm6, ymm14, ymm4
233 | 	vpslld  ymm5, ymm5, 4
234 | 	vperm2i128 ymm4, ymm5, ymm6, 0x20
235 | 	vperm2i128 ymm5, ymm5, ymm6, 0x31
236 | 	vpunpcklwd ymm6, ymm4, ymm5
237 | 	vpunpckhwd ymm7, ymm4, ymm5
238 | 	vpunpckldq ymm4, ymm6, ymm7
239 | 	vpunpckhdq ymm5, ymm6, ymm7
240 | 	vpermq  ymm4, ymm4, 0xD8
241 | 	vpermq  ymm5, ymm5, 0xD8
242 | 	vpand   ymm6, ymm12, ymm4
243 | 	vpand   ymm7, ymm12, ymm5
244 | 	vpaddw  ymm8, ymm8, ymm6
245 | 	vpaddw  ymm10, ymm10, ymm7
246 | 	vpsrlw  ymm4, ymm4, 8
247 | 	vpsrlw  ymm5, ymm5, 8
248 | 	vpaddw  ymm9, ymm9, ymm4
249 | 	vpaddw  ymm11, ymm11, ymm5
250 | 	sub     eax, 64
251 | 	cmp     eax, 184
252 | 	jge     .L22051
253 | 	vpxor   ymm7, ymm7, ymm7
254 | 	call    rbx
255 | 	vpxor   ymm8, ymm8, ymm8
256 | 	vpxor   ymm9, ymm9, ymm9
257 | 	vpxor   ymm10, ymm10, ymm10
258 | 	vpxor   ymm11, ymm11, ymm11
259 | 	mov     eax, 65535
260 | .L22051:sub     rcx, 512
261 | 	jge     .L22050
262 | .L22052:vpbroadcastd ymm14, [magic+80+rip]
263 | 	vpand   ymm5, ymm15, ymm1
264 | 	vpaddd  ymm5, ymm5, ymm5
265 | 	vpand   ymm7, ymm15, ymm3
266 | 	vpaddd  ymm7, ymm7, ymm7
267 | 	vpand   ymm4, ymm15, ymm0
268 | 	vpand   ymm6, ymm15, ymm2
269 | 	vpor    ymm4, ymm5, ymm4
270 | 	vpor    ymm5, ymm7, ymm6
271 | 	vpandn  ymm0, ymm15, ymm0
272 | 	vpsrld  ymm0, ymm0, 1
273 | 	vpandn  ymm2, ymm15, ymm2
274 | 	vpsrld  ymm2, ymm2, 1
275 | 	vpandn  ymm1, ymm15, ymm1
276 | 	vpandn  ymm3, ymm15, ymm3
277 | 	vpor    ymm6, ymm1, ymm0
278 | 	vpor    ymm7, ymm3, ymm2
279 | 	vpand   ymm1, ymm13, ymm5
280 | 	vpslld  ymm1, ymm1, 2
281 | 	vpand   ymm3, ymm13, ymm7
282 | 	vpslld  ymm3, ymm3, 2
283 | 	vpand   ymm0, ymm13, ymm4
284 | 	vpand   ymm2, ymm13, ymm6
285 | 	vpor    ymm0, ymm1, ymm0
286 | 	vpor    ymm1, ymm3, ymm2
287 | 	vpandn  ymm4, ymm13, ymm4
288 | 	vpsrld  ymm4, ymm4, 2
289 | 	vpandn  ymm6, ymm13, ymm6
290 | 	vpsrld  ymm6, ymm6, 2
291 | 	vpandn  ymm5, ymm13, ymm5
292 | 	vpandn  ymm7, ymm13, ymm7
293 | 	vpor    ymm2, ymm5, ymm4
294 | 	vpor    ymm3, ymm7, ymm6
295 | 	vpunpcklbw ymm5, ymm0, ymm1
296 | 	vpunpckhbw ymm0, ymm0, ymm1
297 | 	vpunpcklbw ymm6, ymm2, ymm3
298 | 	vpunpckhbw ymm1, ymm2, ymm3
299 | 	vpunpcklwd ymm4, ymm5, ymm6
300 | 	vpunpckhwd ymm5, ymm5, ymm6
301 | 	vpunpcklwd ymm6, ymm0, ymm1
302 | 	vpunpckhwd ymm7, ymm0, ymm1
303 | 	vpand   ymm0, ymm14, ymm4
304 | 	vpsrld  ymm4, ymm4, 4
305 | 	vpand   ymm4, ymm14, ymm4
306 | 	vpand   ymm1, ymm14, ymm5
307 | 	vpsrld  ymm5, ymm5, 4
308 | 	vpand   ymm5, ymm14, ymm5
309 | 	vpand   ymm2, ymm14, ymm6
310 | 	vpsrld  ymm6, ymm6, 4
311 | 	vpand   ymm6, ymm14, ymm6
312 | 	vpand   ymm3, ymm14, ymm7
313 | 	vpsrld  ymm7, ymm7, 4
314 | 	vpand   ymm7, ymm14, ymm7
315 | 	vpaddb  ymm0, ymm0, ymm2
316 | 	vpaddb  ymm1, ymm1, ymm3
317 | 	vpaddb  ymm2, ymm4, ymm6
318 | 	vpaddb  ymm3, ymm5, ymm7
319 | 	vpunpckldq ymm4, ymm0, ymm2
320 | 	vpunpckhdq ymm5, ymm0, ymm2
321 | 	vpunpckldq ymm6, ymm1, ymm3
322 | 	vpunpckhdq ymm7, ymm1, ymm3
323 | 	vperm2i128 ymm0, ymm4, ymm5, 0x20
324 | 	vperm2i128 ymm2, ymm4, ymm5, 0x31
325 | 	vperm2i128 ymm1, ymm6, ymm7, 0x20
326 | 	vperm2i128 ymm3, ymm6, ymm7, 0x31
327 | 	vpaddb  ymm0, ymm0, ymm2
328 | 	vpaddb  ymm1, ymm1, ymm3
329 | 	vpxor   ymm7, ymm7, ymm7
330 | 	vpunpcklbw ymm4, ymm0, ymm7
331 | 	vpunpckhbw ymm5, ymm0, ymm7
332 | 	vpunpcklbw ymm6, ymm1, ymm7
333 | 	vpunpckhbw ymm1, ymm1, ymm7
334 | 	vpaddw  ymm8, ymm8, ymm4
335 | 	vpaddw  ymm9, ymm9, ymm5
336 | 	vpaddw  ymm10, ymm10, ymm6
337 | 	vpaddw  ymm11, ymm11, ymm1
338 | 	cmp     ecx, -512
339 | 	je      .L22055
340 | 	vpbroadcastq ymm2, [magic+64+rip]
341 | 	vmovdqu ymm3, [magic+rip]
342 | 	vmovdqu ymm7, [magic+32+rip]
343 | 	vpxor   ymm0, ymm0, ymm0
344 | 	vpxor   ymm1, ymm1, ymm1
345 | 	sub     ecx, -504
346 | 	jle     .L22054
347 | .L22053:vpbroadcastq ymm4, [rsi]
348 | 	vpshufb ymm5, ymm4, ymm7
349 | 	vpshufb ymm4, ymm4, ymm3
350 | 	vpand   ymm5, ymm5, ymm2
351 | 	vpand   ymm4, ymm4, ymm2
352 | 	vpcmpeqb ymm5, ymm5, ymm2
353 | 	vpcmpeqb ymm4, ymm4, ymm2
354 | 	vpsubb  ymm1, ymm1, ymm5
355 | 	vpsubb  ymm0, ymm0, ymm4
356 | 	add     rsi, 8
357 | 	sub     ecx, 8
358 | 	jg      .L22053
359 | .L22054:lea     ecx, [rcx*8+0x40]
360 | 	bzhi    rax, [rsi], rcx
361 | 	vmovq   xmm6, rax
362 | 	vpbroadcastq ymm4, xmm6
363 | 	vpshufb ymm5, ymm4, ymm7
364 | 	vpshufb ymm4, ymm4, ymm3
365 | 	vpand   ymm5, ymm5, ymm2
366 | 	vpand   ymm4, ymm4, ymm2
367 | 	vpcmpeqb ymm5, ymm5, ymm2
368 | 	vpcmpeqb ymm4, ymm4, ymm2
369 | 	vpsubb  ymm1, ymm1, ymm5
370 | 	vpsubb  ymm0, ymm0, ymm4
371 | 	vpxor   ymm7, ymm7, ymm7
372 | 	vpunpcklbw ymm4, ymm0, ymm7
373 | 	vpunpckhbw ymm5, ymm0, ymm7
374 | 	vpunpcklbw ymm6, ymm1, ymm7
375 | 	vpunpckhbw ymm7, ymm1, ymm7
376 | 	vpaddw  ymm8, ymm8, ymm4
377 | 	vpaddw  ymm9, ymm9, ymm5
378 | 	vpaddw  ymm10, ymm10, ymm6
379 | 	vpaddw  ymm11, ymm11, ymm7
380 | .L22055:vpxor   ymm7, ymm7, ymm7
381 | 	call    rbx
382 | 	vzeroupper
383 | 	pop     rbp
384 | 	ret
385 | 
386 | .L22056:
387 | 	vpbroadcastq ymm2, [magic+64+rip]
388 | 	vmovdqu ymm3, [magic+rip]
389 | 	vmovdqu ymm7, [magic+32+rip]
390 | 	vpxor   ymm0, ymm0, ymm0
391 | 	vpxor   ymm1, ymm1, ymm1
392 | 	sub     ecx, 8
393 | 	jl      .L22058
394 | .L22057:vpbroadcastq ymm4, [rsi]
395 | 	vpshufb ymm5, ymm4, ymm7
396 | 	vpshufb ymm4, ymm4, ymm3
397 | 	vpand   ymm5, ymm5, ymm2
398 | 	vpand   ymm4, ymm4, ymm2
399 | 	vpcmpeqb ymm5, ymm5, ymm2
400 | 	vpcmpeqb ymm4, ymm4, ymm2
401 | 	vpsubb  ymm1, ymm1, ymm5
402 | 	vpsubb  ymm0, ymm0, ymm4
403 | 	add     rsi, 8
404 | 	sub     ecx, 8
405 | 	jge     .L22057
406 | .L22058:cmp     ecx, -8
407 | 	jle     .L22061
408 | 	lea     edx, [rsi+rcx+0x7]
409 | 	xor     edx, esi
410 | 	lea     ecx, [rcx*8+0x40]
411 | 	test    edx, 0x8
412 | 	jnz     .L22059
413 | 	lea     eax, [rsi*8]
414 | 	and     rsi, 0xFFFFFFFFFFFFFFF8
415 | 	mov     r8, [rsi]
416 | 	shrx    r8, r8, rax
417 | 	bzhi    r8, r8, rcx
418 | 	jmp     .L22060
419 | 
420 | .L22059:bzhi    r8, [rsi], rcx
421 | .L22060:vmovq   xmm6, r8
422 | 	vpbroadcastq ymm4, xmm6
423 | 	vpshufb ymm5, ymm4, ymm7
424 | 	vpshufb ymm4, ymm4, ymm3
425 | 	vpand   ymm5, ymm5, ymm2
426 | 	vpand   ymm4, ymm4, ymm2
427 | 	vpcmpeqb ymm5, ymm5, ymm2
428 | 	vpcmpeqb ymm4, ymm4, ymm2
429 | 	vpsubb  ymm1, ymm1, ymm5
430 | 	vpsubb  ymm0, ymm0, ymm4
431 | .L22061:vpxor   ymm7, ymm7, ymm7
432 | 	vpunpcklbw ymm8, ymm0, ymm7
433 | 	vpunpckhbw ymm9, ymm0, ymm7
434 | 	vpunpcklbw ymm10, ymm1, ymm7
435 | 	vpunpckhbw ymm11, ymm1, ymm7
436 | 	call    rbx
437 | 	vzeroupper
438 | 	pop     rbp
439 | 	ret
440 | 	.size	countavx2, .-countavx2
441 | 
442 | 	.balign 16
443 | 	.globl count8avx2
444 | 	.type   count8avx2, @function
445 | count8avx2:
446 | 	push	rbp
447 | 	mov	rbp, rsp
448 | 	push	rbx
449 | 	mov	rcx, rdx
450 | 	lea     rbx, [accum8+rip]
451 | 	call    countavx2
452 | 	pop	rbx
453 | 	pop	rbp
454 | 	ret
455 | 	.size   count8avx2, .-count8avx2
456 | 
457 | 	.balign 16
458 | 	.globl count16avx2
459 | 	.type   count16avx2, @function
460 | count16avx2:
461 | 	push	rbp
462 | 	mov	rbp, rsp
463 | 	push	rbx
464 | 	mov	rcx, rdx
465 | 	lea     rbx, [accum16+rip]
466 | 	shl     rcx, 1
467 | 	call    countavx2
468 | 	pop	rbx
469 | 	pop	rbp
470 | 	ret
471 | 	.size   count16avx2, .-count16avx2
472 | 
473 | 	.balign 16
474 | 	.globl count32avx2
475 | 	.type   count32avx2, @function
476 | count32avx2:
477 | 	push	rbp
478 | 	mov	rbp, rsp
479 | 	push	rbx
480 | 	mov	rcx, rdx
481 | 	lea     rbx, [accum32+rip]
482 | 	shl	rcx, 2
483 | 	call    countavx2
484 | 	pop	rbx
485 | 	pop	rbp
486 | 	ret
487 | 	.size   count32avx2, .-count32avx2
488 | 
489 | 	.balign 16
490 | 	.globl count64avx2
491 | 	.type   count64avx2, @function
492 | count64avx2:
493 | 	push	rbp
494 | 	mov	rbp, rsp
495 | 	push	rbx
496 | 	mov	rcx, rdx
497 | 	lea     rbx, [accum64+rip]
498 | 	shl	rcx, 3
499 | 	call    countavx2
500 | 	pop	rbx
501 | 	pop	rbp
502 | 	ret
503 | 	.size   count64avx2, .-count64avx2
504 | 


--------------------------------------------------------------------------------
/asm/kernelavx512.S:
--------------------------------------------------------------------------------
  1 | # AVX-512 positional popcount with 15-fold CSA reduction
  2 | # by Robert Clausecker <fuz@fuz.su>
  3 | # from github.com/clausecker/pospop@1.3.0
  4 | # with slight copy-editing
  5 | 
  6 | 	.intel_syntax noprefix
  7 | 	.section .rodata
  8 | 	.balign	32
  9 | magic:	.long 0x55555555
 10 | 	.long 0x33333333
 11 | 	.long 0x0f0f0f0f
 12 | 	.long 0x00ff00ff
 13 | 	.quad 0x1c1814100c080400
 14 | 	.quad 0x1d1915110d090501
 15 | 	.quad 0x1e1a16120e0a0602
 16 | 	.quad 0x1f1b17130f0b0703
 17 | 	.size magic, .-magic
 18 | 
 19 | 	.section .text
 20 | 	.type countavx512, @function
 21 | 	.balign	16
 22 | countavx512:
 23 | 	push    rbp
 24 | 	mov     rbp, rsp
 25 | 	vpternlogd zmm30, zmm30, zmm30, 0xFF
 26 | 	vpxord  ymm25, ymm25, ymm25
 27 | 	cmp     rcx, 960
 28 | 	jl      .L22072
 29 | 	mov     rax, -1
 30 | 	shlx    rax, rax, rsi
 31 | 	kmovq   k1, rax
 32 | 	add     rcx, rsi
 33 | 	and     rsi, 0xFFFFFFFFFFFFFFC0
 34 | 	sub     rcx, rsi
 35 | 	vmovdqu8 zmm0 {k1}{z}, [rsi]
 36 | 	vmovdqa64 zmm1, [rsi+0x1*0x40]
 37 | 	vmovdqa64 zmm4, [rsi+0x2*0x40]
 38 | 	vpxor   ymm8, ymm8, ymm8
 39 | 	vpxor   ymm9, ymm9, ymm9
 40 | 	vmovdqa64 zmm2, [rsi+0x3*0x40]
 41 | 	vmovdqa64 zmm3, [rsi+0x4*0x40]
 42 | 	vmovdqa64 zmm5, [rsi+0x5*0x40]
 43 | 	vmovdqa64 zmm22, zmm0
 44 | 	vpternlogd zmm0, zmm1, zmm4, 0x96
 45 | 	vpternlogd zmm1, zmm22, zmm4, 0xE8
 46 | 	vmovdqa64 zmm6, [rsi+0x6*0x40]
 47 | 	vmovdqa64 zmm7, [rsi+0x7*0x40]
 48 | 	vmovdqa64 zmm10, [rsi+0x8*0x40]
 49 | 	vmovdqa64 zmm22, zmm2
 50 | 	vpternlogd zmm2, zmm3, zmm5, 0x96
 51 | 	vpternlogd zmm3, zmm22, zmm5, 0xE8
 52 | 	vmovdqa64 zmm11, [rsi+0x9*0x40]
 53 | 	vmovdqa64 zmm12, [rsi+0xA*0x40]
 54 | 	vmovdqa64 zmm13, [rsi+0xB*0x40]
 55 | 	vmovdqa64 zmm22, zmm6
 56 | 	vpternlogd zmm6, zmm7, zmm10, 0x96
 57 | 	vpternlogd zmm7, zmm22, zmm10, 0xE8
 58 | 	vmovdqa64 zmm4, [rsi+0xC*0x40]
 59 | 	vmovdqa64 zmm5, [rsi+0xD*0x40]
 60 | 	vmovdqa64 zmm10, [rsi+0xE*0x40]
 61 | 	vmovdqa64 zmm22, zmm11
 62 | 	vpternlogd zmm11, zmm12, zmm13, 0x96
 63 | 	vpternlogd zmm12, zmm22, zmm13, 0xE8
 64 | 	vpbroadcastd zmm28, [magic+rip]
 65 | 	vpbroadcastd zmm27, [magic+4+rip]
 66 | 	vpbroadcastd zmm26, [magic+8+rip]
 67 | 	vmovdqa64 zmm22, zmm4
 68 | 	vpternlogd zmm4, zmm5, zmm10, 0x96
 69 | 	vpternlogd zmm5, zmm22, zmm10, 0xE8
 70 | 	vmovdqa64 zmm22, zmm0
 71 | 	vpternlogd zmm0, zmm2, zmm6, 0x96
 72 | 	vpternlogd zmm2, zmm22, zmm6, 0xE8
 73 | 	vmovdqa64 zmm22, zmm1
 74 | 	vpternlogd zmm1, zmm3, zmm7, 0x96
 75 | 	vpternlogd zmm3, zmm22, zmm7, 0xE8
 76 | 	vmovdqa64 zmm22, zmm0
 77 | 	vpternlogd zmm0, zmm11, zmm4, 0x96
 78 | 	vpternlogd zmm11, zmm22, zmm4, 0xE8
 79 | 	vmovdqa64 zmm22, zmm2
 80 | 	vpternlogd zmm2, zmm12, zmm5, 0x96
 81 | 	vpternlogd zmm12, zmm22, zmm5, 0xE8
 82 | 	vmovdqa64 zmm22, zmm1
 83 | 	vpternlogd zmm1, zmm2, zmm11, 0x96
 84 | 	vpternlogd zmm2, zmm22, zmm11, 0xE8
 85 | 	vmovdqa64 zmm22, zmm2
 86 | 	vpternlogd zmm2, zmm3, zmm12, 0x96
 87 | 	vpternlogd zmm3, zmm22, zmm12, 0xE8
 88 | 	add     rsi, 960
 89 | 	sub     rcx, 1984
 90 | 	jl      .L22068
 91 | 	vpbroadcastd zmm24, [magic+12+rip]
 92 | 
 93 | 	vpmovzxbw zmm23, [magic+16+rip]
 94 | 	mov     eax, 65535
 95 | .L22066:vmovdqa64	zmm4, [rsi]
 96 | 	vmovdqa64 zmm5, [rsi+0x1*0x40]
 97 | 	vmovdqa64 zmm6, [rsi+0x2*0x40]
 98 | 	vmovdqa64 zmm7, [rsi+0x3*0x40]
 99 | 	vmovdqa64 zmm10, [rsi+0x4*0x40]
100 | 	vmovdqa64 zmm22, zmm0
101 | 	vpternlogd zmm0, zmm4, zmm5, 0x96
102 | 	vpternlogd zmm4, zmm22, zmm5, 0xE8
103 | 	vmovdqa64 zmm5, [rsi+0x5*0x40]
104 | 	vmovdqa64 zmm11, [rsi+0x6*0x40]
105 | 	vmovdqa64 zmm12, [rsi+0x7*0x40]
106 | 	vmovdqa64 zmm22, zmm6
107 | 	vpternlogd zmm6, zmm7, zmm10, 0x96
108 | 	vpternlogd zmm7, zmm22, zmm10, 0xE8
109 | 	vmovdqa64 zmm10, [rsi+0x8*0x40]
110 | 	vmovdqa64 zmm13, [rsi+0x9*0x40]
111 | 	vmovdqa64 zmm14, [rsi+0xA*0x40]
112 | 	vmovdqa64 zmm22, zmm5
113 | 	vpternlogd zmm5, zmm11, zmm12, 0x96
114 | 	vpternlogd zmm11, zmm22, zmm12, 0xE8
115 | 	vmovdqa64 zmm12, [rsi+0xB*0x40]
116 | 	vmovdqa64 zmm15, [rsi+0xC*0x40]
117 | 	vmovdqa64 zmm16, [rsi+0xD*0x40]
118 | 	vmovdqa64 zmm22, zmm10
119 | 	vpternlogd zmm10, zmm13, zmm14, 0x96
120 | 	vpternlogd zmm13, zmm22, zmm14, 0xE8
121 | 	vmovdqa64 zmm14, [rsi+0xE*0x40]
122 | 	vmovdqa64 zmm17, [rsi+0xF*0x40]
123 | 	vmovdqa64 zmm22, zmm12
124 | 	vpternlogd zmm12, zmm15, zmm16, 0x96
125 | 	vpternlogd zmm15, zmm22, zmm16, 0xE8
126 | 	add     rsi, 1024
127 | 	prefetcht0 [rsi]
128 | 	vmovdqa64 zmm22, zmm0
129 | 	vpternlogd zmm0, zmm5, zmm6, 0x96
130 | 	vpternlogd zmm5, zmm22, zmm6, 0xE8
131 | 	prefetcht0 [rsi+0x40]
132 | 	vmovdqa64 zmm22, zmm1
133 | 	vpternlogd zmm1, zmm4, zmm7, 0x96
134 | 	vpternlogd zmm4, zmm22, zmm7, 0xE8
135 | 	vmovdqa64 zmm22, zmm10
136 | 	vpternlogd zmm10, zmm12, zmm14, 0x96
137 | 	vpternlogd zmm12, zmm22, zmm14, 0xE8
138 | 	vmovdqa64 zmm22, zmm11
139 | 	vpternlogd zmm11, zmm13, zmm15, 0x96
140 | 	vpternlogd zmm13, zmm22, zmm15, 0xE8
141 | 	vmovdqa64 zmm22, zmm0
142 | 	vpternlogd zmm0, zmm10, zmm17, 0x96
143 | 	vpternlogd zmm10, zmm22, zmm17, 0xE8
144 | 	vmovdqa64 zmm22, zmm1
145 | 	vpternlogd zmm1, zmm5, zmm11, 0x96
146 | 	vpternlogd zmm5, zmm22, zmm11, 0xE8
147 | 	vmovdqa64 zmm22, zmm2
148 | 	vpternlogd zmm2, zmm4, zmm13, 0x96
149 | 	vpternlogd zmm4, zmm22, zmm13, 0xE8
150 | 	vmovdqa64 zmm22, zmm1
151 | 	vpternlogd zmm1, zmm10, zmm12, 0x96
152 | 	vpternlogd zmm10, zmm22, zmm12, 0xE8
153 | 	vmovdqa64 zmm22, zmm2
154 | 	vpternlogd zmm2, zmm5, zmm10, 0x96
155 | 	vpternlogd zmm5, zmm22, zmm10, 0xE8
156 | 	vmovdqa64 zmm22, zmm3
157 | 	vpternlogd zmm3, zmm4, zmm5, 0x96
158 | 	vpternlogd zmm4, zmm22, zmm5, 0xE8
159 | 	vpandd  zmm5, zmm28, zmm4
160 | 	vpandnd zmm6, zmm28, zmm4
161 | 	vpsrld  zmm6, zmm6, 1
162 | 	vshufi64x2 zmm10, zmm5, zmm6, 0x44
163 | 	vshufi64x2 zmm11, zmm5, zmm6, 0xEE
164 | 	vpaddd  zmm4, zmm11, zmm10
165 | 	vpandd  zmm5, zmm27, zmm4
166 | 	vpandnd zmm6, zmm27, zmm4
167 | 	vpsrld  zmm6, zmm6, 2
168 | 	vshufi64x2 zmm10, zmm5, zmm6, 0x88
169 | 	vshufi64x2 zmm11, zmm5, zmm6, 0xDD
170 | 	vpaddd  zmm4, zmm11, zmm10
171 | 	vpandd  zmm5, zmm26, zmm4
172 | 	vpandnd zmm6, zmm26, zmm4
173 | 	vpslld  zmm5, zmm5, 4
174 | 	vpermq  zmm5, zmm5, 0xD8
175 | 	vpermq  zmm6, zmm6, 0xD8
176 | 	vshufi64x2 zmm10, zmm5, zmm6, 0x88
177 | 	vshufi64x2 zmm11, zmm5, zmm6, 0xDD
178 | 	vpaddd  zmm4, zmm11, zmm10
179 | 	vpsrlw  zmm6, zmm4, 8
180 | 	vpandd  zmm5, zmm24, zmm4
181 | 	vpaddw  zmm8, zmm8, zmm5
182 | 	vpaddw  zmm9, zmm9, zmm6
183 | 	sub     eax, 128
184 | 	cmp     eax, 368
185 | 	jge     .L22067
186 | 	vpermw  zmm8, zmm23, zmm8
187 | 	vpermw  zmm9, zmm23, zmm9
188 | 	call    rbx
189 | 	vpxor   ymm8, ymm8, ymm8
190 | 	vpxor   ymm9, ymm9, ymm9
191 | 	mov     eax, 65535
192 | .L22067:sub	rcx, 1024
193 | 	jge     .L22066
194 | 	vpermw  zmm8, zmm23, zmm8
195 | 	vpermw  zmm9, zmm23, zmm9
196 | .L22068:vpsrld	zmm4, zmm0, 1
197 | 	vpaddd  zmm5, zmm1, zmm1
198 | 	vpsrld  zmm6, zmm2, 1
199 | 	vpaddd  zmm7, zmm3, zmm3
200 | 	vpternlogd zmm0, zmm5, zmm28, 0xE4
201 | 	vpternlogd zmm1, zmm4, zmm28, 0xD8
202 | 	vpternlogd zmm2, zmm7, zmm28, 0xE4
203 | 	vpternlogd zmm3, zmm6, zmm28, 0xD8
204 | 	vpsrld  zmm4, zmm0, 2
205 | 	vpsrld  zmm6, zmm1, 2
206 | 	vpslld  zmm5, zmm2, 2
207 | 	vpslld  zmm7, zmm3, 2
208 | 	vpternlogd zmm2, zmm4, zmm27, 0xD8
209 | 	vpternlogd zmm3, zmm6, zmm27, 0xD8
210 | 	vpternlogd zmm0, zmm5, zmm27, 0xE4
211 | 	vpternlogd zmm1, zmm7, zmm27, 0xE4
212 | 	vpunpcklbw zmm6, zmm2, zmm3
213 | 	vpunpckhbw zmm3, zmm2, zmm3
214 | 	vpunpcklbw zmm5, zmm0, zmm1
215 | 	vpunpckhbw zmm2, zmm0, zmm1
216 | 	vpunpcklwd zmm4, zmm5, zmm6
217 | 	vpunpckhwd zmm5, zmm5, zmm6
218 | 	vpunpcklwd zmm6, zmm2, zmm3
219 | 	vpunpckhwd zmm7, zmm2, zmm3
220 | 	vpandd  zmm0, zmm4, zmm26
221 | 	vpsrld  zmm4, zmm4, 4
222 | 	vpandd  zmm4, zmm4, zmm26
223 | 	vpandd  zmm1, zmm5, zmm26
224 | 	vpsrld  zmm5, zmm5, 4
225 | 	vpandd  zmm5, zmm5, zmm26
226 | 	vpandd  zmm2, zmm6, zmm26
227 | 	vpsrld  zmm6, zmm6, 4
228 | 	vpandd  zmm6, zmm6, zmm26
229 | 	vpandd  zmm3, zmm7, zmm26
230 | 	vpsrld  zmm7, zmm7, 4
231 | 	vpandd  zmm7, zmm7, zmm26
232 | 	vpaddb  zmm0, zmm0, zmm2
233 | 	vpaddb  zmm1, zmm1, zmm3
234 | 	vpaddb  zmm2, zmm4, zmm6
235 | 	vpaddb  zmm3, zmm5, zmm7
236 | 	vpunpckldq zmm4, zmm0, zmm2
237 | 	vpunpckhdq zmm5, zmm0, zmm2
238 | 	vpunpckldq zmm6, zmm1, zmm3
239 | 	vpunpckhdq zmm7, zmm1, zmm3
240 | 	vshufi64x2 zmm0, zmm4, zmm5, 0x44
241 | 	vshufi64x2 zmm1, zmm4, zmm5, 0xEE
242 | 	vshufi64x2 zmm2, zmm6, zmm7, 0x44
243 | 	vshufi64x2 zmm3, zmm6, zmm7, 0xEE
244 | 	vpaddb  zmm0, zmm0, zmm1
245 | 	vpaddb  zmm2, zmm2, zmm3
246 | 	vshufi64x2 zmm1, zmm0, zmm2, 0x88
247 | 	vshufi64x2 zmm0, zmm0, zmm2, 0xDD
248 | 	vpaddb  zmm0, zmm0, zmm1
249 | 	vpunpcklbw zmm1, zmm0, zmm25
250 | 	vpunpckhbw zmm2, zmm0, zmm25
251 | 	vpaddw  zmm8, zmm8, zmm1
252 | 	vpaddw  zmm9, zmm9, zmm2
253 | 	vpxor   ymm0, ymm0, ymm0
254 | 	cmp     ecx, -1024
255 | 	jz      .L22071
256 | 	sub     ecx, -1016
257 | 	jle     .L22070
258 | .L22069:kmovq	k1, [rsi]
259 | 	add     rsi, 8
260 | 	vpsubb  zmm0 {k1}, zmm0, zmm30
261 | 	sub     ecx, 8
262 | 	jg      .L22069
263 | .L22070:lea	ecx, [rcx*8+0x40]
264 | 	bzhi    rax, [rsi], rcx
265 | 	kmovq   k1, rax
266 | 	vpsubb  zmm0 {k1}, zmm0, zmm30
267 | 	vpunpcklbw zmm1, zmm0, zmm25
268 | 	vpunpckhbw zmm2, zmm0, zmm25
269 | 	vpaddw  zmm8, zmm8, zmm1
270 | 	vpaddw  zmm9, zmm9, zmm2
271 | .L22071:call	rbx
272 | 	vzeroupper
273 | 	pop     rbp
274 | 	ret
275 | 
276 | .L22072:vpxor   ymm0, ymm0, ymm0
277 | 	sub     ecx, 8
278 | 	jle     .L22074
279 | .L22073:kmovq   k1, [rsi]
280 | 	add     rsi, 8
281 | 	vpsubb  zmm0 {k1}, zmm0, zmm30
282 | 	sub     rcx, 8
283 | 	jg      .L22073
284 | 	lea     edx, [rcx*8]
285 | 	neg     edx
286 | 	shrx    rax, [rsi+rcx], rdx
287 | 	kmovq   k1, rax
288 | 	vpsubb  zmm0 {k1}, zmm0, zmm30
289 | 	vpunpcklbw zmm8, zmm0, zmm25
290 | 	vpunpckhbw zmm9, zmm0, zmm25
291 | 	call    rbx
292 | 	vzeroupper
293 | 	pop     rbp
294 | 	ret
295 | 
296 | .L22074:.type   .L22074, @function
297 | 	add     ecx, 8
298 | 	xor     eax, eax
299 | 	bts     eax, ecx
300 | 	dec     eax
301 | 	kmovd   k1, eax
302 | 	vmovdqu8 xmm4 {k1}{z}, [rsi]
303 | 	vmovq   rax, xmm4
304 | 	kmovq   k1, rax
305 | 	vpsubb  zmm0 {k1}, zmm0, zmm30
306 | 	vpunpcklbw zmm8, zmm0, zmm25
307 | 	vpunpckhbw zmm9, zmm0, zmm25
308 | 	call    rbx
309 | 	vzeroupper
310 | 	pop     rbp
311 | 	ret
312 | 
313 | 	.balign 16
314 | 	.globl count8avx512
315 | 	.type   count8avx512, @function
316 | count8avx512:
317 | 	push	rbp
318 | 	mov	rbp, rsp
319 | 	push	rbx
320 | 	mov	rcx, rdx
321 | 	lea     rbx, [accum8+rip]
322 | 	call    countavx512
323 | 	pop	rbx
324 | 	pop	rbp
325 | 	ret
326 | 	.size   count8avx512, .-count8avx512
327 | 
328 | 	.balign 16
329 | 	.globl count16avx512
330 | 	.type   count16avx512, @function
331 | count16avx512:
332 | 	push	rbp
333 | 	mov	rbp, rsp
334 | 	push	rbx
335 | 	mov	rcx, rdx
336 | 	lea     rbx, [accum16+rip]
337 | 	shl     rcx, 1
338 | 	call    countavx512
339 | 	pop	rbx
340 | 	pop	rbp
341 | 	ret
342 | 	.size   count16avx512, .-count16avx512
343 | 
344 | 	.balign 16
345 | 	.globl count32avx512
346 | 	.type   count32avx512, @function
347 | count32avx512:
348 | 	push	rbp
349 | 	mov	rbp, rsp
350 | 	push	rbx
351 | 	mov	rcx, rdx
352 | 	lea     rbx, [accum32+rip]
353 | 	shl	rcx, 2
354 | 	call    countavx512
355 | 	pop	rbx
356 | 	pop	rbp
357 | 	ret
358 | 	.size   count32avx512, .-count32avx512
359 | 
360 | 	.balign 16
361 | 	.globl count64avx512
362 | 	.type   count64avx512, @function
363 | count64avx512:
364 | 	push	rbp
365 | 	mov	rbp, rsp
366 | 	push	rbx
367 | 	mov	rcx, rdx
368 | 	lea     rbx, [accum64+rip]
369 | 	shl	rcx, 3
370 | 	call    countavx512
371 | 	pop	rbx
372 | 	pop	rbp
373 | 	ret
374 | 	.size   count64avx512, .-count64avx512
375 | 


--------------------------------------------------------------------------------
/asm/kernelneon.S:
--------------------------------------------------------------------------------
  1 | 	.section .rodata
  2 | 	.balign	16
  3 | magic:	.quad	0x8040201008040201
  4 | 	.quad	0, 0, -1, -1
  5 | 
  6 | 	.section .text
  7 | 	// X0: accumulation function
  8 | 	// X1: input buffer
  9 | 	// X2: counters
 10 | 	// X3: remaining length
 11 | countneon:
 12 | 	str	x30, [sp, #-16]!
 13 | 	adrp	x4, magic
 14 | 	add	x4, x4, #:lo12:magic
 15 | 	ld1r	{v28.2d}, [x4], #8
 16 | 	movi	v30.8b, #0x1
 17 | 	movi	v29.16b, #0x2
 18 | 	add	v29.16b, v29.16b, v30.16b
 19 | 	movi	v8.16b, #0x0
 20 | 	movi	v10.16b, #0x0
 21 | 	movi	v12.16b, #0x0
 22 | 	movi	v14.16b, #0x0
 23 | 	cmp	x3, #0xf0
 24 | 	blt	.Lrunt
 25 | 	and	x6, x1, #0xf
 26 | 	and	x1, x1, #0xfffffffffffffff0
 27 | 	sub	x5, x6, #0x10
 28 | 	add	x3, x3, x6
 29 | 	neg	x5, x5
 30 | 	ld1	{v3.16b}, [x1], #16
 31 | 	ldr	q5, [x4, x5]
 32 | 	and	v0.16b, v3.16b, v5.16b
 33 | 	ld1	{v1.16b, v2.16b}, [x1], #32
 34 | 	ld1	{v3.16b-v6.16b}, [x1], #64
 35 | 	ld1	{v16.16b-v19.16b}, [x1], #64
 36 | 	eor	v31.16b, v0.16b, v1.16b
 37 | 	eor	v0.16b, v31.16b, v2.16b
 38 | 	bit	v1.16b, v2.16b, v31.16b
 39 | 	movi	v27.16b, #0x55
 40 | 	eor	v2.16b, v3.16b, v0.16b
 41 | 	eor	v0.16b, v4.16b, v2.16b
 42 | 	bsl	v2.16b, v4.16b, v3.16b
 43 | 	movi	v26.16b, #0x33
 44 | 	eor	v3.16b, v5.16b, v0.16b
 45 | 	eor	v0.16b, v6.16b, v3.16b
 46 | 	bsl	v3.16b, v6.16b, v5.16b
 47 | 	ld1	{v4.16b-v7.16b}, [x1], #64
 48 | 	eor	v31.16b, v1.16b, v2.16b
 49 | 	eor	v1.16b, v31.16b, v3.16b
 50 | 	bit	v2.16b, v3.16b, v31.16b
 51 | 	eor	v31.16b, v0.16b, v16.16b
 52 | 	eor	v0.16b, v31.16b, v17.16b
 53 | 	bit	v16.16b, v17.16b, v31.16b
 54 | 	movi	v25.16b, #0xf
 55 | 	eor	v31.16b, v0.16b, v18.16b
 56 | 	eor	v0.16b, v31.16b, v19.16b
 57 | 	bit	v18.16b, v19.16b, v31.16b
 58 | 	mov	x6, #65535
 59 | 	eor	v3.16b, v16.16b, v1.16b
 60 | 	eor	v1.16b, v18.16b, v3.16b
 61 | 	bsl	v3.16b, v18.16b, v16.16b
 62 | 	movi	v9.16b, #0x0
 63 | 	eor	v31.16b, v0.16b, v4.16b
 64 | 	eor	v0.16b, v31.16b, v5.16b
 65 | 	bit	v4.16b, v5.16b, v31.16b
 66 | 	movi	v11.16b, #0x0
 67 | 	eor	v31.16b, v0.16b, v6.16b
 68 | 	eor	v0.16b, v31.16b, v7.16b
 69 | 	bit	v6.16b, v7.16b, v31.16b
 70 | 	movi	v13.16b, #0x0
 71 | 	eor	v31.16b, v1.16b, v4.16b
 72 | 	eor	v1.16b, v31.16b, v6.16b
 73 | 	bit	v4.16b, v6.16b, v31.16b
 74 | 	movi	v15.16b, #0x0
 75 | 	eor	v31.16b, v2.16b, v3.16b
 76 | 	eor	v2.16b, v31.16b, v4.16b
 77 | 	bit	v3.16b, v4.16b, v31.16b
 78 | 	subs	x3, x3, #0x1f0
 79 | 	blt	.Lpost
 80 | 
 81 | .Lvec:	ld1	{v4.16b-v7.16b}, [x1], #64
 82 | 	ld1	{v16.16b-v19.16b}, [x1], #64
 83 | 	ld1	{v20.16b-v23.16b}, [x1], #64
 84 | 	eor	v31.16b, v4.16b, v5.16b
 85 | 	eor	v4.16b, v31.16b, v6.16b
 86 | 	bit	v5.16b, v6.16b, v31.16b
 87 | 	eor	v31.16b, v0.16b, v17.16b
 88 | 	eor	v0.16b, v31.16b, v19.16b
 89 | 	bit	v17.16b, v19.16b, v31.16b
 90 | 	eor	v31.16b, v7.16b, v16.16b
 91 | 	eor	v7.16b, v31.16b, v18.16b
 92 | 	bit	v16.16b, v18.16b, v31.16b
 93 | 	eor	v31.16b, v21.16b, v22.16b
 94 | 	eor	v21.16b, v31.16b, v20.16b
 95 | 	bit	v22.16b, v20.16b, v31.16b
 96 | 	eor	v31.16b, v1.16b, v5.16b
 97 | 	eor	v1.16b, v31.16b, v17.16b
 98 | 	bit	v5.16b, v17.16b, v31.16b
 99 | 	ld1	{v17.16b-v20.16b}, [x1], #64
100 | 	eor	v31.16b, v0.16b, v4.16b
101 | 	eor	v0.16b, v31.16b, v7.16b
102 | 	bit	v4.16b, v7.16b, v31.16b
103 | 	eor	v31.16b, v17.16b, v18.16b
104 | 	eor	v17.16b, v31.16b, v23.16b
105 | 	bit	v18.16b, v23.16b, v31.16b
106 | 	eor	v31.16b, v19.16b, v20.16b
107 | 	eor	v19.16b, v31.16b, v21.16b
108 | 	bit	v20.16b, v21.16b, v31.16b
109 | 	eor	v31.16b, v16.16b, v18.16b
110 | 	eor	v16.16b, v31.16b, v22.16b
111 | 	bit	v18.16b, v22.16b, v31.16b
112 | 	eor	v31.16b, v1.16b, v4.16b
113 | 	eor	v1.16b, v31.16b, v20.16b
114 | 	bit	v4.16b, v20.16b, v31.16b
115 | 	eor	v31.16b, v0.16b, v17.16b
116 | 	eor	v0.16b, v31.16b, v19.16b
117 | 	bit	v17.16b, v19.16b, v31.16b
118 | 	eor	v31.16b, v2.16b, v5.16b
119 | 	eor	v2.16b, v31.16b, v18.16b
120 | 	bit	v5.16b, v18.16b, v31.16b
121 | 	eor	v31.16b, v1.16b, v16.16b
122 | 	eor	v1.16b, v31.16b, v17.16b
123 | 	bit	v16.16b, v17.16b, v31.16b
124 | 	eor	v31.16b, v2.16b, v4.16b
125 | 	eor	v2.16b, v31.16b, v16.16b
126 | 	bit	v4.16b, v16.16b, v31.16b
127 | 	eor	v31.16b, v3.16b, v4.16b
128 | 	eor	v3.16b, v31.16b, v5.16b
129 | 	bit	v4.16b, v5.16b, v31.16b
130 | 	and	v5.16b, v4.16b, v27.16b
131 | 	bic	v6.16b, v4.16b, v27.16b
132 | 	ushr	v6.16b, v6.16b, #1
133 | 	zip1	v4.2d, v5.2d, v6.2d
134 | 	zip2	v5.2d, v5.2d, v6.2d
135 | 	add	v4.16b, v4.16b, v5.16b
136 | 	and	v5.16b, v4.16b, v26.16b
137 | 	bic	v6.16b, v4.16b, v26.16b
138 | 	ushr	v6.16b, v6.16b, #2
139 | 	and	v4.16b, v5.16b, v25.16b
140 | 	bic	v5.16b, v5.16b, v25.16b
141 | 	bic	v7.16b, v6.16b, v25.16b
142 | 	and	v6.16b, v6.16b, v25.16b
143 | 	shl	v4.16b, v4.16b, #4
144 | 	shl	v6.16b, v6.16b, #4
145 | 	zip1	v16.16b, v4.16b, v6.16b
146 | 	zip2	v17.16b, v4.16b, v6.16b
147 | 	zip1	v18.16b, v5.16b, v7.16b
148 | 	zip2	v19.16b, v5.16b, v7.16b
149 | 	zip1	v4.16b, v16.16b, v17.16b
150 | 	zip2	v5.16b, v16.16b, v17.16b
151 | 	zip1	v6.16b, v18.16b, v19.16b
152 | 	zip2	v7.16b, v18.16b, v19.16b
153 | 	zip1	v16.4s, v4.4s, v6.4s
154 | 	zip2	v17.4s, v4.4s, v6.4s
155 | 	zip1	v18.4s, v5.4s, v7.4s
156 | 	zip2	v19.4s, v5.4s, v7.4s
157 | 	uaddw	v8.8h, v8.8h, v16.8b
158 | 	uaddw2	v9.8h, v9.8h, v16.16b
159 | 	uaddw	v10.8h, v10.8h, v17.8b
160 | 	uaddw2	v11.8h, v11.8h, v17.16b
161 | 	uaddw	v12.8h, v12.8h, v18.8b
162 | 	uaddw2	v13.8h, v13.8h, v18.16b
163 | 	uaddw	v14.8h, v14.8h, v19.8b
164 | 	uaddw2	v15.8h, v15.8h, v19.16b
165 | 	sub	x6, x6, #0x1e
166 | 	cmp	x6, #0x5c
167 | 	bge	.Lhave_space
168 | 
169 | 	blr	x0
170 | 	movi	v8.16b, #0x0
171 | 	movi	v9.16b, #0x0
172 | 	movi	v10.16b, #0x0
173 | 	movi	v11.16b, #0x0
174 | 	movi	v12.16b, #0x0
175 | 	movi	v13.16b, #0x0
176 | 	movi	v14.16b, #0x0
177 | 	movi	v15.16b, #0x0
178 | 	mov	x6, #65535
179 | 
180 | .Lhave_space:
181 | 	subs	x3, x3, #0x100
182 | 	bge	.Lvec
183 | 
184 | .Lpost:	ushr	v4.16b, v0.16b, #1
185 | 	add	v5.16b, v1.16b, v1.16b
186 | 	ushr	v6.16b, v2.16b, #1
187 | 	add	v7.16b, v3.16b, v3.16b
188 | 	bif	v0.16b, v5.16b, v27.16b
189 | 	bit	v1.16b, v4.16b, v27.16b
190 | 	bif	v2.16b, v7.16b, v27.16b
191 | 	bit	v3.16b, v6.16b, v27.16b
192 | 	ushr	v4.16b, v0.16b, #2
193 | 	ushr	v6.16b, v1.16b, #2
194 | 	shl	v5.16b, v2.16b, #2
195 | 	shl	v7.16b, v3.16b, #2
196 | 	bit	v2.16b, v4.16b, v26.16b
197 | 	bit	v3.16b, v6.16b, v26.16b
198 | 	bif	v0.16b, v5.16b, v26.16b
199 | 	bif	v1.16b, v7.16b, v26.16b
200 | 	zip1	v6.16b, v2.16b, v3.16b
201 | 	zip2	v3.16b, v2.16b, v3.16b
202 | 	zip1	v5.16b, v0.16b, v1.16b
203 | 	zip2	v2.16b, v0.16b, v1.16b
204 | 	zip1	v4.8h, v5.8h, v6.8h
205 | 	zip2	v5.8h, v5.8h, v6.8h
206 | 	zip1	v6.8h, v2.8h, v3.8h
207 | 	zip2	v7.8h, v2.8h, v3.8h
208 | 	and	v0.16b, v25.16b, v4.16b
209 | 	ushr	v4.16b, v4.16b, #4
210 | 	and	v1.16b, v25.16b, v5.16b
211 | 	ushr	v5.16b, v5.16b, #4
212 | 	and	v2.16b, v25.16b, v6.16b
213 | 	add	v0.16b, v2.16b, v0.16b
214 | 	usra	v4.16b, v6.16b, #4
215 | 	and	v3.16b, v25.16b, v7.16b
216 | 	add	v1.16b, v3.16b, v1.16b
217 | 	usra	v5.16b, v7.16b, #4
218 | 	zip1	v2.4s, v0.4s, v4.4s
219 | 	zip2	v3.4s, v0.4s, v4.4s
220 | 	zip1	v6.4s, v1.4s, v5.4s
221 | 	zip2	v7.4s, v1.4s, v5.4s
222 | 	uaddw	v8.8h, v8.8h, v2.8b
223 | 	uaddw2	v9.8h, v9.8h, v2.16b
224 | 	uaddw	v10.8h, v10.8h, v3.8b
225 | 	uaddw2	v11.8h, v11.8h, v3.16b
226 | 	uaddw	v12.8h, v12.8h, v6.8b
227 | 	uaddw2	v13.8h, v13.8h, v6.16b
228 | 	uaddw	v14.8h, v14.8h, v7.8b
229 | 	uaddw2	v15.8h, v15.8h, v7.16b
230 | 
231 | .Lendvec:
232 | 	movi	v0.16b, #0x0
233 | 	movi	v1.16b, #0x0
234 | 	movi	v2.16b, #0x0
235 | 	movi	v3.16b, #0x0
236 | 	adds	x3, x3, #0xf8
237 | 	blt	.Ltail1
238 | 
239 | .Ltail8:
240 | 	subs	x3, x3, #0x8
241 | 	ldr	s6, [x1], #4
242 | 	ldr	s7, [x1], #4
243 | 	tbl	v4.16b, {v6.16b}, v30.16b
244 | 	tbl	v5.16b, {v6.16b}, v29.16b
245 | 	cmtst	v4.16b, v4.16b, v28.16b
246 | 	cmtst	v5.16b, v5.16b, v28.16b
247 | 	sub	v0.16b, v0.16b, v4.16b
248 | 	sub	v1.16b, v1.16b, v5.16b
249 | 	tbl	v4.16b, {v7.16b}, v30.16b
250 | 	tbl	v5.16b, {v7.16b}, v29.16b
251 | 	cmtst	v4.16b, v4.16b, v28.16b
252 | 	cmtst	v5.16b, v5.16b, v28.16b
253 | 	sub	v2.16b, v2.16b, v4.16b
254 | 	sub	v3.16b, v3.16b, v5.16b
255 | 	bge	.Ltail8
256 | 
257 | .Ltail1:
258 | 	adds	x3, x3, #0x8
259 | 	ble	.Lend
260 | 
261 | 	ldr	d6, [x1]
262 | 	sub	x6, x4, x3
263 | 	ldr	q5, [x6, #16]
264 | 	bic	v6.16b, v6.16b, v5.16b
265 | 	ext	v7.16b, v6.16b, v6.16b, #4
266 | 	tbl	v4.16b, {v6.16b}, v30.16b
267 | 	tbl	v5.16b, {v6.16b}, v29.16b
268 | 	cmtst	v4.16b, v4.16b, v28.16b
269 | 	cmtst	v5.16b, v5.16b, v28.16b
270 | 	sub	v0.16b, v0.16b, v4.16b
271 | 	sub	v1.16b, v1.16b, v5.16b
272 | 	tbl	v4.16b, {v7.16b}, v30.16b
273 | 	tbl	v5.16b, {v7.16b}, v29.16b
274 | 	cmtst	v4.16b, v4.16b, v28.16b
275 | 	cmtst	v5.16b, v5.16b, v28.16b
276 | 	sub	v2.16b, v2.16b, v4.16b
277 | 	sub	v3.16b, v3.16b, v5.16b
278 | 
279 | .Lend:	uaddw	v9.8h, v9.8h, v0.8b
280 | 	uaddw2	v8.8h, v8.8h, v0.16b
281 | 	uaddw	v11.8h, v11.8h, v1.8b
282 | 	uaddw2	v10.8h, v10.8h, v1.16b
283 | 	uaddw	v13.8h, v13.8h, v2.8b
284 | 	uaddw2	v12.8h, v12.8h, v2.16b
285 | 	uaddw	v15.8h, v15.8h, v3.8b
286 | 	uaddw2	v14.8h, v14.8h, v3.16b
287 | 	blr	x0
288 | 	ldr	x30, [sp], #16
289 | 	ret
290 | 
291 | .Lrunt:
292 | 	subs	x3, x3, #0x8
293 | 	blt	.Lrunt1
294 | 
295 | .Lrunt8:
296 | 	subs	x3, x3, #0x8
297 | 	ldr	s6, [x1], #4
298 | 	ldr	s7, [x1], #4
299 | 	tbl	v4.16b, {v6.16b}, v30.16b
300 | 	tbl	v5.16b, {v6.16b}, v29.16b
301 | 	cmtst	v4.16b, v4.16b, v28.16b
302 | 	cmtst	v5.16b, v5.16b, v28.16b
303 | 	sub	v8.16b, v8.16b, v4.16b
304 | 	sub	v10.16b, v10.16b, v5.16b
305 | 	tbl	v4.16b, {v7.16b}, v30.16b
306 | 	tbl	v5.16b, {v7.16b}, v29.16b
307 | 	cmtst	v4.16b, v4.16b, v28.16b
308 | 	cmtst	v5.16b, v5.16b, v28.16b
309 | 	sub	v12.16b, v12.16b, v4.16b
310 | 	sub	v14.16b, v14.16b, v5.16b
311 | 	bge	.Lrunt8
312 | 
313 | .Lrunt1:
314 | 	adds	x3, x3, #0x8
315 | 	ble	.Lrunt_accum
316 | 
317 | 	and	x5, x1, #0x7
318 | 	add	x8, x3, x5
319 | 	lsl	x3, x3, #3
320 | 	mov	x7, #0xffffffffffffffff    	// #-1
321 | 	lsl	x7, x7, x3
322 | 	cmp	x8, #0x8
323 | 	bgt	.Lcrossrunt1
324 | 
325 | 	and	x1, x1, #0xfffffffffffffff8
326 | 	ldr	x6, [x1]
327 | 	lsl	x5, x5, #3
328 | 	lsr	x6, x6, x5
329 | 	b	.Ldorunt1
330 | 
331 | .Lcrossrunt1:
332 | 	ldr	x6, [x1]
333 | 
334 | .Ldorunt1:
335 | 	bic	x6, x6, x7
336 | 	fmov	d6, x6
337 | 	ext	v7.16b, v6.16b, v6.16b, #4
338 | 	tbl	v4.16b, {v6.16b}, v30.16b
339 | 	tbl	v5.16b, {v6.16b}, v29.16b
340 | 	cmtst	v4.16b, v4.16b, v28.16b
341 | 	cmtst	v5.16b, v5.16b, v28.16b
342 | 	sub	v8.16b, v8.16b, v4.16b
343 | 	sub	v10.16b, v10.16b, v5.16b
344 | 	tbl	v4.16b, {v7.16b}, v30.16b
345 | 	tbl	v5.16b, {v7.16b}, v29.16b
346 | 	cmtst	v4.16b, v4.16b, v28.16b
347 | 	cmtst	v5.16b, v5.16b, v28.16b
348 | 	sub	v12.16b, v12.16b, v4.16b
349 | 	sub	v14.16b, v14.16b, v5.16b
350 | 
351 | .Lrunt_accum:
352 | 	uxtl	v9.8h, v8.8b
353 | 	uxtl2	v8.8h, v8.16b
354 | 	uxtl	v11.8h, v10.8b
355 | 	uxtl2	v10.8h, v10.16b
356 | 	uxtl	v13.8h, v12.8b
357 | 	uxtl2	v12.8h, v12.16b
358 | 	uxtl	v15.8h, v14.8b
359 | 	uxtl2	v14.8h, v14.16b
360 | 	blr	x0
361 | 	ldr	x30, [sp], #16
362 | 	ret
363 | 
364 | 	.globl	count8neon
365 | 	.type	count8neon, %function
366 | 	// extern void count8neon(flags_type *flags, const uint16_t *data, uint32_t len);
367 | count8neon:
368 | 	sub	x3, sp, #4*16
369 | 	str	x30, [sp, #-5*16]!
370 | 	st1	{v8.1d-v11.1d}, [x3], #32
371 | 	st1	{v12.1d-v15.1d}, [x3], #32
372 | 	mov	w3, w2
373 | 	mov	x2, x0
374 | 	adr	x0, accum8
375 | 	bl	countneon
376 | 	ldr	x30, [sp], #16
377 | 	ld1	{v8.1d-v11.1d}, [sp], #32
378 | 	ld1	{v12.1d-v15.1d}, [sp], #32
379 | 	ret
380 | 
381 | 	.globl	count16neon
382 | 	.type	count16neon, %function
383 | 	// extern void count16neon(flags_type *flags, const uint16_t *data, uint32_t len);
384 | count16neon:
385 | 	sub	x3, sp, #4*16
386 | 	str	x30, [sp, #-5*16]!
387 | 	st1	{v8.1d-v11.1d}, [x3], #32
388 | 	st1	{v12.1d-v15.1d}, [x3], #32
389 | 	ubfiz	x3, x2, #1, #32
390 | 	mov	x2, x0
391 | 	adr	x0, accum16
392 | 	bl	countneon
393 | 	ldr	x30, [sp], #16
394 | 	ld1	{v8.1d-v11.1d}, [sp], #32
395 | 	ld1	{v12.1d-v15.1d}, [sp], #32
396 | 	ret
397 | 
398 | 	.globl	count32neon
399 | 	.type	count32neon, %function
400 | 	// extern void count32neon(flags_type *flags, const uint16_t *data, uint32_t len);
401 | count32neon:
402 | 	sub	x3, sp, #4*16
403 | 	str	x30, [sp, #-5*16]!
404 | 	st1	{v8.1d-v11.1d}, [x3], #32
405 | 	st1	{v12.1d-v15.1d}, [x3], #32
406 | 	ubfiz	x3, x2, #2, #32
407 | 	mov	x2, x0
408 | 	adr	x0, accum32
409 | 	bl	countneon
410 | 	ldr	x30, [sp], #16
411 | 	ld1	{v8.1d-v11.1d}, [sp], #32
412 | 	ld1	{v12.1d-v15.1d}, [sp], #32
413 | 	ret
414 | 
415 | 	.globl	count64neon
416 | 	.type	count64neon, %function
417 | 	// extern void count64neon(flags_type *flags, const uint16_t *data, uint32_t len);
418 | count64neon:
419 | 	sub	x3, sp, #4*16
420 | 	str	x30, [sp, #-5*16]!
421 | 	st1	{v8.1d-v11.1d}, [x3], #32
422 | 	st1	{v12.1d-v15.1d}, [x3], #32
423 | 	ubfiz	x3, x2, #3, #32
424 | 	mov	x2, x0
425 | 	adr	x0, accum64
426 | 	bl	countneon
427 | 	ldr	x30, [sp], #16
428 | 	ld1	{v8.1d-v11.1d}, [sp], #32
429 | 	ld1	{v12.1d-v15.1d}, [sp], #32
430 | 	ret
431 | 


--------------------------------------------------------------------------------
/benchmark/golike/benchmark.c:
--------------------------------------------------------------------------------
  1 | #define _GNU_SOURCE
  2 | #include <ctype.h>
  3 | #include <math.h>
  4 | #include <stdint.h>
  5 | #include <stdlib.h>
  6 | #include <stdio.h>
  7 | #include <string.h>
  8 | #include <time.h>
  9 | #include <unistd.h>
 10 | 
 11 | #include <linux/perf_event.h>
 12 | #include <linux/hw_breakpoint.h>
 13 | #include <sys/syscall.h>
 14 | 
 15 | #if FLAGSIZE == 64
 16 | typedef uint64_t flags_type;
 17 | #else
 18 | typedef uint32_t flags_type;
 19 | #endif
 20 | 
 21 | #ifndef TIME_GOAL
 22 | /* run each benchmark for at least this many seconds */
 23 | #define TIME_GOAL 2.0
 24 | #endif
 25 | 
 26 | #ifdef ALIGN
 27 | #define memory_allocate(size) aligned_alloc(64, (size) + 63 & ~63)
 28 | #else
 29 | #define memory_allocate(size) malloc(size)
 30 | #endif
 31 | 
 32 | #ifdef __x86_64__
 33 | #include "pospopcnt_avx512bw.h"
 34 | #endif
 35 | #include "pospopcnt.h"
 36 | 
 37 | typedef void pospopcnt_u16(const uint16_t *data, uint32_t len, flags_type *flags);
 38 | static pospopcnt_u16 pospopcnt_u16_dummy;
 39 | 
 40 | /* a "do nothing" implementation to measure the memory overhead */
 41 | static void pospopcnt_u16_dummy(const uint16_t *data, uint32_t len, flags_type *flags)
 42 | {
 43 | 	volatile uint16_t sink;
 44 | 	uint16_t sum = 0;
 45 | 
 46 | 	for (uint32_t i = 0; i < len; i++)
 47 | 		sum += data[i];
 48 | 
 49 | 	sink = sum;
 50 | 
 51 | 	for (uint32_t i = 0; i < 16; i++)
 52 | 		flags[i] += sum;
 53 | }
 54 | 
 55 | #ifdef __x86_64__
 56 | extern void count16avx512(flags_type *flags, const uint16_t *data, uint32_t len);
 57 | extern void count16avx2(flags_type *flags, const uint16_t *data, uint32_t len);
 58 | 
 59 | static void
 60 | pospopcnt_count16avx512(const uint16_t *data, uint32_t len, flags_type *flags)
 61 | {
 62 | 	count16avx512(flags, data, len);
 63 | }
 64 | 
 65 | static void
 66 | pospopcnt_count16avx2(const uint16_t *data, uint32_t len, flags_type *flags)
 67 | {
 68 | 	count16avx2(flags, data, len);
 69 | }
 70 | #endif
 71 | 
 72 | #ifdef __aarch64__
 73 | extern void count16neon(flags_type *flags, const uint16_t *data, uint32_t len);
 74 | static void
 75 | pospopcnt_count16neon(const uint16_t *data, uint32_t len, flags_type *flags)
 76 | {
 77 | 	count16neon(flags, data, len);
 78 | }
 79 | #endif
 80 | 
 81 | static const struct pospopcnt_u16_method {
 82 | 	const char *const name;
 83 | 	pospopcnt_u16 *const method;
 84 | } methods[] = {
 85 | 	{ "overhead", pospopcnt_u16_dummy },
 86 | 	{ "scalar", pospopcnt_u16_scalar },
 87 | #ifdef __x86_64__
 88 | 	{ "avx512bw_harvey_seal_1KB", pospopcnt_u16_avx512bw_harvey_seal_1KB },
 89 | 	{ "avx512bw_harvey_seal_512B", pospopcnt_u16_avx512bw_harvey_seal_512B },
 90 | 	{ "avx512bw_harvey_seal_256B", pospopcnt_u16_avx512bw_harvey_seal_256B },
 91 | 	{ "count16avx512", pospopcnt_count16avx512 },
 92 | 	{ "count16avx2", pospopcnt_count16avx2 },
 93 | #endif
 94 | #ifdef __aarch64__
 95 | 	{ "count16neon", pospopcnt_count16neon },
 96 | #endif
 97 | 	{ NULL, NULL },
 98 | };
 99 | 
100 | static int event_group = -1, num_counters = 0;
101 | enum { EVENT_COUNT = 2 };
102 | static struct event {
103 | 	uint32_t type;
104 | 	int fd;
105 | 	uint64_t conf;
106 | } events[EVENT_COUNT] = {
107 | 	{ PERF_TYPE_HARDWARE, -1, PERF_COUNT_HW_CPU_CYCLES },
108 | 	{ PERF_TYPE_HARDWARE, -1, PERF_COUNT_HW_INSTRUCTIONS },
109 | };
110 | 
111 | /* set up performance counters so performance measurements can be taken */
112 | static void init_counters()
113 | {
114 | 	int i;
115 | 	struct perf_event_attr attribs;
116 | 
117 | 	memset(&attribs, 0, sizeof attribs);
118 | 	attribs.exclude_kernel = 1;
119 | 	attribs.exclude_hv = 1;
120 | 	attribs.sample_period = 0;
121 | 	attribs.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
122 | 
123 | 	for (i = 0; i < EVENT_COUNT; i++) {
124 | 		attribs.type = events[i].type;
125 | 		attribs.config = events[i].conf;
126 | 		events[i].fd = syscall(SYS_perf_event_open, &attribs, 0, -1, event_group, 0);
127 | 		if (events[i].fd == -1) {
128 | 			perror("perf_event_open");
129 | 			continue;
130 | 		}
131 | 
132 | 		num_counters++;
133 | 
134 | 		if (event_group == -1)
135 | 			event_group = events[i].fd;
136 | 	}
137 | }
138 | 
139 | /* state of performance counters at one point in time */
140 | struct counters {
141 | 	struct timespec ts;
142 | 	uint64_t counters[2 * EVENT_COUNT + 1];
143 | };
144 | 
145 | typedef int testfunc(struct counters *c, void *payload, size_t n, size_t m);
146 | 
147 | /* initialise hardware perf counters */
148 | 
149 | /* set counters to their current value */
150 | static int
151 | reset_counters(struct counters *c)
152 | {
153 | 	ssize_t count;
154 | 	int res;
155 | 
156 | 	res = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &c->ts);
157 | 	if (res == -1) {
158 | 		perror("clock_gettime");
159 | 		return (-1);
160 | 	}
161 | 
162 | 	if (event_group != -1) {
163 | 		count = read(event_group, c->counters, (2 * num_counters + 1) * sizeof c->counters[0]);
164 | 		if (count < 0) {
165 | 			perror("read(event_group, ...)");
166 | 			return (-1);
167 | 		}
168 | 	}
169 | 
170 | 	return (0);
171 | }
172 | 
173 | /* compute the difference between two timespec as a double */
174 | static double
175 | tsdiff(struct timespec start, struct timespec end)
176 | {
177 | 	time_t sec;
178 | 	long nsec;
179 | 
180 | 	sec = end.tv_sec - start.tv_sec;
181 | 	nsec = end.tv_nsec - start.tv_nsec;
182 | 	if (nsec < 0) {
183 | 		sec--;
184 | 		nsec += 1000000000L;
185 | 	}
186 | 
187 | 	return (sec + nsec * 1.0e-9);
188 | }
189 | 
190 | /* compute the difference between the two counter vectors */
191 | static void
192 | counterdiff(uint64_t out[EVENT_COUNT], uint64_t start[], uint64_t end[])
193 | {
194 | 	int i, j;
195 | 
196 | 	for (i = 0, j = 0; i < EVENT_COUNT; i++) {
197 | 		if (events[i].fd == -1)
198 | 			out[i] = 0;
199 | 		else {
200 | 			out[i] = end[2*j+1] - start[2*j+1];
201 | 			j++;
202 | 		}
203 | 	}
204 | }
205 | 
206 | /* print test results */
207 | /* https://golang.org/design/14313-benchmark-format */
208 | static void
209 | print_results(
210 |     const char *name,
211 |     struct counters *start, struct counters *end,
212 |     size_t n, size_t m) {
213 | 	uint64_t counts[EVENT_COUNT];
214 | 	double elapsed;
215 | 
216 | 	elapsed = tsdiff(start->ts, end->ts);
217 | 	counterdiff(counts, start->counters, end->counters);
218 | 
219 | 	if (name == NULL || name[0] == '\0')
220 | 		name = " ";
221 | 
222 | 	printf("Benchmark%c%s/%zuB\t%10zu\t"
223 | 	    "%.8g ns/op\t%.8g MB/s",
224 | 	    toupper(name[0]), name+1, n, m,
225 | 	    (elapsed * 1e9) / m, (1e-6 * n * m) / elapsed);
226 | 
227 | 	if (num_counters == EVENT_COUNT) {
228 | 		printf("\t%.8g cy/B\t%.8g ins/B\t%.8g ipc\n",
229 | 		    counts[0]/((double)n * m), counts[1]/((double)n * m),
230 | 		    (double)counts[1] / counts[0]);
231 | 	} else
232 | 		putchar('\n');
233 | }
234 | 
235 | /* run one test case for the specified n and print the result */
236 | static void
237 | run_test(const char *name, testfunc *test, void *payload, size_t n)
238 | {
239 | 	struct counters start, end;
240 | 	size_t m = 1;
241 | 	int first_run = 1;
242 | 
243 | 	/* repeatedly run benchmark and adjust m until result is meaningful */
244 | 	for (;; first_run = 0) {
245 | 		double elapsed;
246 | 		size_t newm;
247 | 		int res;
248 | 
249 | 		/* printf("RUN %s: n = %zu m = %zu\n", name, n, m); */
250 | 
251 | 		res = reset_counters(&start);
252 | 		if (res != 0) {
253 | 			printf("FAIL\t%s\n", name);
254 | 			return;
255 | 		}
256 | 
257 | 		res = test(&start, payload, n, m);
258 | 		if (res != 0) {
259 | 			printf("FAIL\t%s\n", name);
260 | 			return;
261 | 		}
262 | 
263 | 		res = reset_counters(&end);
264 | 		if (res != 0) {
265 | 			printf("FAIL\t%s\n", name);
266 | 			return;
267 | 		}
268 | 
269 | 		elapsed = tsdiff(start.ts, end.ts);
270 | //		printf("m = %zu, elapsed = %f s\n", m, elapsed);
271 | 		if (elapsed < TIME_GOAL) {
272 | 			if (elapsed < TIME_GOAL * 0.5)
273 | 				m *= 2;
274 | 			else {
275 | 				/* try to overshoot 1s time goal slightly */
276 | 				newm = ceil(m * TIME_GOAL * 1.05 / elapsed);
277 | 				m = newm > m ? newm : m + 1;
278 | 			}
279 | 
280 | 			continue;
281 | 		}
282 | 
283 | 		/* make sure to perform at least one warm-up iteration */
284 | 		if (!first_run)
285 | 			break;
286 | 	}
287 | 
288 | 	print_results(name, &start, &end, n, m);
289 | }
290 | 
291 | /* test positiona population count function payload with an n byte array */
292 | static int
293 | test_pospop_u16(struct counters *c, void *payload, size_t n, size_t m)
294 | {
295 | 	pospopcnt_u16 *pospop = (pospopcnt_u16 *)payload;
296 | 	uint16_t *data;
297 | 	size_t len = n / sizeof *data, i;
298 | 	long int num;
299 | 	int res;
300 | 	flags_type flags[16], accum = 0;
301 | 	volatile flags_type sum;
302 | 
303 | 	memset(flags, 0, sizeof flags);
304 | 	data = memory_allocate(len * sizeof *data);
305 | 	if (data == NULL) {
306 | 		perror("memory_allocate");
307 | 		return (-1);
308 | 	}
309 | 
310 | 	srand48(42);
311 | 	for (i = 0; i < len; i += 2) {
312 | 		num = mrand48();
313 | 		data[i] = num & 0xffff;
314 | 		data[i+1] = num >> 16 & 0xffff;
315 | 	}
316 | 
317 | 	/* skip initialisation step in benchmark measurements */
318 | 	res = reset_counters(c);
319 | 	if (res != 0)
320 | 		return (-1);
321 | 
322 | 	for (i = 0; i < m; i++)
323 | 		pospop(data, len, flags);
324 | 
325 | 	/* make sure the result is used */
326 | 	accum = 0;
327 | 	for (i = 0; i < 16; i++)
328 | 		accum += flags[i];
329 | 	sum = accum;
330 | 
331 | 	free(data);
332 | 
333 | 	return (0);
334 | }
335 | 
336 | extern int
337 | main(int argc, char *argv[])
338 | {
339 | 	size_t n;
340 | 	int i, j;
341 | 	char *fake_argv[] = { argv[0], "100000", NULL };
342 | 
343 | 	setlinebuf(stdout);
344 | 	init_counters();
345 | 
346 | 	if (argc < 2) {
347 | 		argv = fake_argv;
348 | 		argc = 2;
349 | 	}
350 | 
351 | 	for (j = 1; j < argc; j++) {
352 | 		n = atoll(argv[j]);
353 | 		for (i = 0; methods[i].method != NULL; i++) {
354 | 			run_test(methods[i].name, test_pospop_u16, methods[i].method, n);
355 | 		}
356 | 	}
357 | }
358 | 


--------------------------------------------------------------------------------
/benchmark/linux/aligned_alloc.h:
--------------------------------------------------------------------------------
 1 | // functions borrowed from
 2 | // https://github.com/lemire/simdjson/blob/master/include/simdjson/portability.h
 3 | #pragma once
 4 | 
 5 | #ifdef __cplusplus
 6 | #include <cstddef>
 7 | #include <cstdlib>
 8 | #else
 9 | #include <stddef.h>
10 | #include <stdlib.h>
11 | #endif
12 | 
13 | template <class T> static inline int get_alignment(T *data) {
14 |   uintptr_t addr = reinterpret_cast<uintptr_t>(data);
15 |   return addr & ~(addr - 1);
16 | }
17 | 
18 | // portable version of  posix_memalign
19 | static inline void *aligned_malloc(size_t alignment, size_t size) {
20 |   void *p;
21 | #ifdef _MSC_VER
22 |   p = _aligned_malloc(size, alignment);
23 | #elif defined(__MINGW32__) || defined(__MINGW64__)
24 |   p = __mingw_aligned_malloc(size, alignment);
25 | #else
26 |   // somehow, if this is used before including "x86intrin.h", it creates an
27 |   // implicit defined warning.
28 |   if (posix_memalign(&p, alignment, size) != 0)
29 |     return NULL;
30 | #endif
31 |   return p;
32 | }
33 | 
34 | static inline void aligned_free(void *memblock) {
35 |   if (memblock == NULL)
36 |     return;
37 | #ifdef _MSC_VER
38 |   _aligned_free(memblock);
39 | #elif defined(__MINGW32__) || defined(__MINGW64__)
40 |   __mingw_aligned_free(memblock);
41 | #else
42 |   free(memblock);
43 | #endif
44 | }
45 | 


--------------------------------------------------------------------------------
/benchmark/linux/instrumented_benchmark.cpp:
--------------------------------------------------------------------------------
  1 | #include <cinttypes>
  2 | #if FLAGSIZE == 64
  3 | typedef uint64_t flags_type;
  4 | #else
  5 | typedef uint32_t flags_type;
  6 | #endif
  7 | 
  8 | #ifdef __linux__
  9 | #include <cassert>
 10 | #include <cstdio>
 11 | #include <cstdlib>
 12 | #include <cstring>
 13 | #include <cstring>
 14 | #include <iomanip>
 15 | #include <iostream>
 16 | #include <algorithm>
 17 | #include <chrono>
 18 | #include <libgen.h>
 19 | #include <random>
 20 | #include <string>
 21 | #include <vector>
 22 | 
 23 | #include "linux-perf-events.h"
 24 | #include "aligned_alloc.h"
 25 | #ifdef __x86_64__
 26 | #include "pospopcnt_avx512bw.h"
 27 | #endif
 28 | #include "pospopcnt.h"
 29 | #ifdef ALIGN
 30 | #include "memalloc.h"
 31 | #define memory_allocate(size) aligned_alloc(64, (size))
 32 | #else
 33 | #define memory_allocate(size) malloc(size)
 34 | #endif
 35 | 
 36 | // command line options
 37 | enum {
 38 |   OPT_VERBOSE    = 1 << 0,
 39 |   OPT_TEST       = 1 << 1,
 40 |   OPT_COMPENSATE = 1 << 2,
 41 |   OPT_TOUCH      = 1 << 3,
 42 |   OPT_FORCE      = 1 << 4,
 43 | };
 44 | 
 45 | // Function pointer definition.
 46 | typedef void (*pospopcnt_u16_method_type)(const uint16_t *data, uint32_t len,
 47 |                                           flags_type *flags);
 48 | 
 49 | #ifdef __x86_64__
 50 | extern "C" void count16avx512(flags_type flags[16], const uint16_t *buf, size_t len);
 51 | extern "C" void count16avx2(flags_type flags[16], const uint16_t *buf, size_t len);
 52 | 
 53 | static void pospopcnt_count16avx512(const uint16_t *data, uint32_t len, flags_type *flags)
 54 | {
 55 | 	count16avx512(flags, data, len);
 56 | }
 57 | 
 58 | static void pospopcnt_count16avx2(const uint16_t *data, uint32_t len, flags_type *flags)
 59 | {
 60 | 	count16avx2(flags, data, len);
 61 | }
 62 | #endif
 63 | 
 64 | #ifdef __aarch64__
 65 | extern "C" void count16neon(flags_type flags[16], const uint16_t *buf, size_t len);
 66 | 
 67 | static void pospopcnt_count16neon(const uint16_t *data, uint32_t len, flags_type *flags)
 68 | {
 69 | 	count16neon(flags, data, len);
 70 | }
 71 | #endif
 72 | 
 73 | // dummy for taking the overhead
 74 | static void pospopcnt_dummy(const uint16_t *data, uint32_t len, flags_type *flags)
 75 | {
 76 | 	(void)data;
 77 | 	(void)len;
 78 | 	(void)flags;
 79 | }
 80 | 
 81 | static const struct {
 82 |   pospopcnt_u16_method_type method;
 83 |   const char *name;
 84 | } methods[] = {
 85 |   { pospopcnt_u16_scalar, "pospopcnt_u16_scalar" },
 86 | #ifdef __x86_64__
 87 |   { pospopcnt_u16_avx512bw_harvey_seal_1KB, "pospopcnt_u16_avx512bw_harvey_seal_1KB" },
 88 |   { pospopcnt_u16_avx512bw_harvey_seal_512B, "pospopcnt_u16_avx512bw_harvey_seal_512B" },
 89 |   { pospopcnt_u16_avx512bw_harvey_seal_256B, "pospopcnt_u16_avx512bw_harvey_seal_256B" },
 90 |   { pospopcnt_count16avx512, "pospopcnt_count16avx512" },
 91 |   { pospopcnt_count16avx2, "pospopcnt_count16avx2" },
 92 | #endif
 93 | #ifdef __aarch64__
 94 |   { pospopcnt_count16neon, "pospopcnt_count16neon" },
 95 | #endif
 96 |   { NULL, NULL },
 97 | };
 98 | 
 99 | void print16(flags_type *flags) {
100 |   for (int k = 0; k < 16; k++)
101 |     printf(" %8ju ", (uintmax_t)flags[k]);
102 |   printf("\n");
103 | }
104 | 
105 | std::vector<unsigned long long>
106 | compute_mins(std::vector<std::vector<unsigned long long> > allresults) {
107 |   if (allresults.size() == 0)
108 |     return std::vector<unsigned long long>();
109 | 
110 |   std::vector<unsigned long long> answer = allresults[0];
111 | 
112 |   for (size_t k = 1; k < allresults.size(); k++) {
113 |     assert(allresults[k].size() == answer.size());
114 |     for (size_t z = 0; z < answer.size(); z++) {
115 |       if (allresults[k][z] < answer[z])
116 |         answer[z] = allresults[k][z];
117 |     }
118 |   }
119 |   return answer;
120 | }
121 | 
122 | std::vector<double>
123 | compute_averages(std::vector<std::vector<unsigned long long> > allresults) {
124 |   if (allresults.size() == 0)
125 |     return std::vector<double>();
126 | 
127 |   std::vector<double> answer(allresults[0].size());
128 | 
129 |   for (size_t k = 0; k < allresults.size(); k++) {
130 |     assert(allresults[k].size() == answer.size());
131 |     for (size_t z = 0; z < answer.size(); z++) {
132 |       answer[z] += allresults[k][z];
133 |     }
134 |   }
135 | 
136 |   for (size_t z = 0; z < answer.size(); z++) {
137 |     answer[z] /= allresults.size();
138 |   }
139 |   return answer;
140 | }
141 | 
142 | class BenchmarkState {
143 |   std::vector<int> evts = {
144 |     PERF_COUNT_HW_CPU_CYCLES,
145 |     PERF_COUNT_HW_INSTRUCTIONS,
146 |     PERF_COUNT_HW_BRANCH_MISSES,
147 |     PERF_COUNT_HW_CACHE_REFERENCES,
148 |     PERF_COUNT_HW_CACHE_MISSES,
149 |     PERF_COUNT_HW_REF_CPU_CYCLES
150 |   };
151 | 
152 |   BenchmarkState *overhead = nullptr;
153 |   LinuxEvents<PERF_TYPE_HARDWARE> unified;
154 |   std::vector<unsigned long long> results; // tmp buffer
155 |   std::chrono::time_point<std::chrono::steady_clock> start;
156 |   std::vector<std::vector<unsigned long long> > allresults;
157 |   std::vector<double> timings;
158 |   std::vector<double> freqs;
159 |   bool in_progress = false;
160 | 
161 | public:
162 |   BenchmarkState() : unified(evts) {
163 |     results.resize(evts.size());
164 |   }
165 | 
166 |   BenchmarkState(BenchmarkState *oh) : unified(evts) {
167 |     overhead = oh;
168 |     results.resize(evts.size());
169 |   }
170 | 
171 |   // begin measurement of a benchmark iteration
172 |   void begin() {
173 |     assert(!in_progress);
174 |     in_progress = true;
175 | 
176 |     start = std::chrono::steady_clock::now();
177 |     unified.start();
178 |   }
179 | 
180 |   // end measurement of a benchmark iteration
181 |   void end() {
182 |     assert(in_progress);
183 | 
184 |     unified.end(results);
185 |     auto end = std::chrono::steady_clock::now();
186 |     std::chrono::duration<double> secs = end - start;
187 |     double time_in_s = secs.count();
188 |     timings.push_back(time_in_s);
189 |     freqs.push_back(results[0]/(1e9*time_in_s));
190 |     allresults.push_back(results);
191 | 
192 |     in_progress = false;
193 |   }
194 | 
195 |   void printResults(bool verbose, uint32_t n, uint32_t m)
196 |   {
197 |     std::vector<unsigned long long> mins = compute_mins(allresults);
198 |     std::vector<double> avg = compute_averages(allresults);
199 |     double min_timing = *min_element(timings.begin(), timings.end());
200 |     double min_freq = *min_element(freqs.begin(), freqs.end());
201 |     double max_freq = *max_element(freqs.begin(), freqs.end());
202 | 
203 |     // compensate for measuring overhead by subtracting the overhead
204 |     if (overhead != nullptr) {
205 |       std::vector<unsigned long long> oh_mins = compute_mins(overhead->allresults);
206 |       std::vector<double> oh_avg = compute_averages(overhead->allresults);
207 |       min_timing -= *min_element(overhead->timings.begin(), overhead->timings.end());
208 | 
209 |       assert(mins.size() == oh_mins.size());
210 |       for (size_t i = 0; i < mins.size(); i++) {
211 |         mins[i] -= oh_mins[i];
212 |         avg[i] -= oh_avg[i];
213 |       }
214 |     }
215 | 
216 |     double speedinGBs = (m * n * sizeof(uint16_t)) / (min_timing * 1e9);
217 | 
218 |     if (verbose) {
219 |       printf("instructions per cycle %4.2f, cycles per 16-bit word: %4.3f, "
220 |              "instructions per 16-bit word %4.3f \n",
221 |              double(mins[1]) / mins[0], double(mins[0]) / (m * n),
222 |              double(mins[1]) / (n * m));
223 |       // first we display mins
224 |       printf("min: %8llu cycles, %8llu instructions, \t%8llu branch mis., %8llu "
225 |              "cache ref., %8llu cache mis.\n",
226 |              mins[0], mins[1], mins[2], mins[3], mins[4]);
227 |       printf("avg: %8.1f cycles, %8.1f instructions, \t%8.1f branch mis., %8.1f "
228 |              "cache ref., %8.1f cache mis.\n",
229 |              avg[0], avg[1], avg[2], avg[3], avg[4]);
230 |       printf(" %4.3f GB/s \n", speedinGBs);
231 |       printf("estimated clock in range %4.3f GHz to %4.3f GHz\n", min_freq, max_freq);
232 |     } else {
233 |       printf("cycles per 16-bit word:  %4.3f; ref cycles per 16-bit word: %4.3f; speed in GB/s %4.3f \n",
234 |       double(mins[0]) / (n * m), double(mins[5]) / (n * m), speedinGBs);
235 |     }
236 |   }
237 | };
238 | 
239 | // initialise all subarrays of the vdata array
240 | template <class C>
241 | void init_vdata(C &vdata)
242 | {
243 |   std::mt19937 gen;
244 | 
245 |   for (size_t k = 0; k < vdata.size(); k++) {
246 |     for (size_t k2 = 0; k2 < vdata[k].size(); k2++) {
247 |       vdata[k][k2] = gen() & 0xffff; // initialise to random integer
248 |     }
249 |   }
250 | }
251 | 
252 | // Read all array entries and sum them up.  Discard the sum.
253 | // The purpose of this is to ensure that the array has been
254 | // recently accessed.
255 | template <class C>
256 | void touch(C &vdata)
257 | {
258 |   int sum;
259 |   volatile int total;
260 | 
261 |   for (size_t k = 0; k < vdata.size(); k++) {
262 |     sum += vdata[k];
263 |   }
264 | 
265 |   total = sum;
266 | }
267 | 
268 | /**
269 |  * @brief
270 |  *
271 |  * @param n          Number of integers.
272 |  * @parem m          Number of arrays.
273 |  * @param iterations Number of iterations.
274 |  * @param fn         Target function pointer.
275 |  * @param options    Command line options
276 |  * @return           Benchmark results.
277 |  */
278 | template <class C>
279 | BenchmarkState benchmarkMany(C & vdata, BenchmarkState *overhead, uint32_t n, uint32_t m,
280 |                    uint32_t iterations,
281 |                    pospopcnt_u16_method_type fn, int options) {
282 | #ifdef ALIGN
283 |   for (auto &x : vdata) {
284 |     assert(get_alignment(x.data()) == 64);
285 |   }
286 | #endif
287 |   BenchmarkState bench(overhead);
288 | 
289 |   init_vdata(vdata);
290 | 
291 |   uint32_t test_iterations = 1; // we run one test iteration
292 |   for (uint32_t i = 0; i < test_iterations; i++) {
293 |     std::vector<std::vector<flags_type> > correctflags(m,
294 |                                                      std::vector<flags_type>(16));
295 |     for (size_t k = 0; k < m; k++) {
296 |       pospopcnt_u16_scalar(vdata[k].data(), vdata[k].size(),
297 |                            correctflags[k].data()); // this is our gold standard
298 |     }
299 |     std::vector<std::vector<flags_type> > flags(m, std::vector<flags_type>(16));
300 |     for (size_t k = 0; k < m; k++) {
301 |       fn(vdata[k].data(), vdata[k].size(), flags[k].data());
302 |     }
303 | 
304 |     uint64_t tot_obs = 0;
305 |     for (size_t km = 0; km < m; ++km)
306 |       for (size_t k = 0; k < 16; ++k)
307 |         tot_obs += flags[km][k];
308 |     if (tot_obs == 0 && options & OPT_TEST) { // when a method is not supported it returns all zero
309 |       printf("method not supported\n");
310 |     }
311 |     for (size_t km = 0; km < m; ++km) {
312 |       for (size_t k = 0; k < 16; k++) {
313 |         if (correctflags[km][k] != flags[km][k]) {
314 |           if (options & OPT_TEST) {
315 |             printf("bug:\n");
316 |             printf("expected : ");
317 |             print16(correctflags[km].data());
318 |             printf("got      : ");
319 |             print16(flags[km].data());
320 |           }
321 |         }
322 |       }
323 |     }
324 |   }
325 | 
326 |   for (uint32_t i = 0; i < iterations; i++) {
327 |     std::vector<std::vector<flags_type> > flags(m, std::vector<flags_type>(16));
328 |     bench.begin();
329 |     for (size_t k = 0; k < m; k++) {
330 |       if (options & OPT_TOUCH) {
331 |         touch(vdata[k]);
332 |       }
333 |       fn(vdata[k].data(), vdata[k].size(), flags[k].data());
334 |     }
335 |     bench.end();
336 |   }
337 | 
338 |   bench.printResults(options & OPT_VERBOSE, n, m);
339 | 
340 |   return bench;
341 | }
342 | 
343 | template <class C>
344 | void  benchmarkCopy(C & vdata, BenchmarkState *overhead, uint32_t n, uint32_t m,
345 |                     uint32_t iterations, int options) {
346 |   size_t maxsize = 0;
347 | #ifdef ALIGN
348 |   for (auto &x : vdata) {
349 |      if(maxsize < x.size()) maxsize = x.size();
350 |      assert(get_alignment(x.data()) == 64);
351 |   }
352 | #endif
353 |   for (auto &x : vdata) {
354 |      if(maxsize < x.size()) maxsize = x.size();
355 |   }
356 | 
357 |   init_vdata(vdata);
358 | 
359 |   BenchmarkState bench(overhead);
360 |   std::vector<uint16_t> copybuf(maxsize);
361 | 
362 |   for (uint32_t i = 0; i < iterations; i++) {
363 |     std::vector<std::vector<flags_type> > flags(m, std::vector<flags_type>(16));
364 |     bench.begin();
365 |     for (size_t k = 0; k < m; k++) {
366 |       if (options & OPT_TOUCH) {
367 |         touch(vdata[k]);
368 |       }
369 |       ::memcpy(copybuf.data(),vdata[k].data(),vdata[k].size());
370 |     }
371 |     bench.end();
372 |   }
373 | 
374 |   bench.printResults(options & OPT_VERBOSE, n, m);
375 | }
376 | 
377 | static void print_usage(char *command) {
378 |   printf(" Try %s -n 100000 -i 15 -v\n", command);
379 |   printf("-c compensate overhead in measurements\n");
380 |   printf("-f force use of suboptimal benchmark parameters\n");
381 |   printf("-m number of arrays\n");
382 |   printf("-n number of 16-bit words per array\n");
383 |   printf("-i number of iterations\n");
384 |   printf("-t load arrays into cache before benchmarking\n");
385 |   printf("-v enable verbose (perf counter) output\n");
386 | }
387 | 
388 | int main(int argc, char **argv) {
389 |   size_t n = 10000000;
390 |   size_t m = 1;
391 |   size_t iterations = 0;
392 |   int options = OPT_TEST;
393 |   int c;
394 | 
395 |   while ((c = getopt(argc, argv, "cfi:hm:n:tv")) != -1) {
396 |     switch (c) {
397 |     case 'c':
398 |       options |= OPT_COMPENSATE;
399 |       break;
400 |     case 'f':
401 |       options |= OPT_FORCE;
402 |       break;
403 |     case 't':
404 |       options |= OPT_TOUCH;
405 |       break;
406 |     case 'n':
407 |       n = atoll(optarg);
408 |       break;
409 |     case 'm':
410 |       m = atoll(optarg);
411 |       break;
412 |     case 'v':
413 |       options |= OPT_VERBOSE;
414 |       break;
415 |     case 'h':
416 |       print_usage(argv[0]);
417 |       return EXIT_SUCCESS;
418 |     case 'i':
419 |       iterations = atoi(optarg);
420 |       break;
421 |     default:
422 |       print_usage(argv[0]);
423 |       return EXIT_FAILURE;
424 |     }
425 |   }
426 | 
427 |   if (n > UINT32_MAX) {
428 |     printf("setting n to %u \n", UINT32_MAX);
429 |     n = UINT32_MAX;
430 |   }
431 | 
432 |   if (iterations > UINT32_MAX) {
433 |     printf("setting iterations to %u \n", UINT32_MAX);
434 |     iterations = UINT32_MAX;
435 |   }
436 | 
437 |   if (iterations == 0) {
438 |     iterations = 100;
439 |   }
440 | 
441 |   size_t min_volume = 1000000;
442 |   if(~options & OPT_FORCE && m * n < min_volume) {
443 |     printf("The benchmark is designed to measure the time in units of m*n inputs.\n");
444 |     printf("But your choices make m*n too small, so increasing m.\n");
445 |     while(m * n < min_volume) {
446 |        m++;
447 |     }
448 |   }
449 | 
450 |   printf("n = %zu m = %zu \n", n, m);
451 |   printf("iterations = %zu \n", iterations);
452 |   if (n == 0) {
453 |     printf("n cannot be zero.\n");
454 |     return EXIT_FAILURE;
455 |   }
456 | 
457 |   size_t array_in_bytes = sizeof(uint16_t) * n * m;
458 |   if (array_in_bytes < 1024) {
459 |     printf("array size: %zu B\n", array_in_bytes);
460 |   } else if (array_in_bytes < 1024 * 1024) {
461 |     printf("array size: %.3f kB\n", array_in_bytes / 1024.);
462 |   } else {
463 |     printf("array size: %.3f MB\n", array_in_bytes / (1024 * 1024.));
464 |   }
465 | 
466 |   int maxtrial = 3;
467 | #ifdef ALIGN
468 |   std::vector<std::vector<uint16_t, AlignedSTLAllocator<uint16_t, 64> > > vdata(
469 |       m, std::vector<uint16_t, AlignedSTLAllocator<uint16_t, 64> >(n));
470 | #else
471 |   std::vector<std::vector<uint16_t> > vdata(m, std::vector<uint16_t>(n));
472 | #endif
473 | 
474 |   printf("%-40s\t", "overhead");
475 |   auto ohbench = benchmarkMany(vdata, nullptr, n, m, iterations, pospopcnt_dummy, options & ~OPT_TEST);
476 |   BenchmarkState *overhead = options & OPT_COMPENSATE ? &ohbench : nullptr;
477 | 
478 |   printf("%-40s\t", "memcpy");
479 |   benchmarkCopy(vdata, overhead, n, m, iterations, options);
480 |   printf("\n");
481 |    
482 |   for (int t = 0; t < maxtrial; t++) {
483 |     printf("\n== Trial %d out of %d \n", t + 1, maxtrial);
484 |     for (size_t k = 0; methods[k].name != NULL; k++) {
485 |       printf("\n");
486 |       printf("%-40s\t", methods[k].name);
487 |       fflush(NULL);
488 |       benchmarkMany(vdata, overhead, n, m, iterations, methods[k].method, options);
489 |       if (options & OPT_VERBOSE)
490 |         printf("\n");
491 |     }
492 |   }
493 |   if (~options & OPT_VERBOSE)
494 |     printf("Try -v to get more details.\n");
495 | 
496 |   return EXIT_SUCCESS;
497 | }
498 | #else //  __linux__
499 | 
500 | #include <stdio.h>
501 | #include <stdlib.h>
502 | 
503 | int main() {
504 |   printf("This is a linux-specific benchmark\n");
505 |   return EXIT_SUCCESS;
506 | }
507 | 
508 | #endif
509 | 


--------------------------------------------------------------------------------
/benchmark/linux/linux-perf-events.h:
--------------------------------------------------------------------------------
 1 | // https://github.com/WojciechMula/toys/blob/master/000helpers/linux-perf-events.h
 2 | #pragma once
 3 | #ifdef __linux__
 4 | 
 5 | #include <asm/unistd.h>       // for __NR_perf_event_open
 6 | #include <linux/perf_event.h> // for perf event constants
 7 | #include <sys/ioctl.h>        // for ioctl
 8 | #include <unistd.h>           // for syscall
 9 | #include <iostream>
10 | #include <cerrno>  // for errno
11 | #include <cstring> // for memset
12 | #include <stdexcept>
13 | 
14 | #include <vector>
15 | 
16 | template <int TYPE = PERF_TYPE_HARDWARE> class LinuxEvents {
17 |   int group;
18 |   bool working;
19 |   perf_event_attr attribs;
20 |   int num_events;
21 |   std::vector<uint64_t> temp_result_vec;
22 |   std::vector<int> fds;
23 | 
24 | public:
25 |   explicit LinuxEvents(std::vector<int> config_vec) : group(-1), working(true) {
26 |     memset(&attribs, 0, sizeof(attribs));
27 |     attribs.type = TYPE;
28 |     attribs.size = sizeof(attribs);
29 |     attribs.disabled = 1;
30 |     attribs.exclude_kernel = 1;
31 |     attribs.exclude_hv = 1;
32 | 
33 |     attribs.sample_period = 0;
34 |     attribs.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID;
35 |     const int pid = 0;  // the current process
36 |     const int cpu = -1; // all CPUs
37 |     const unsigned long flags = 0;
38 | 
39 |     num_events = config_vec.size();
40 |     uint32_t i = 0;
41 |     for (auto config : config_vec) {
42 |       attribs.config = config;
43 |       int fd = syscall(__NR_perf_event_open, &attribs, pid, cpu, group, flags);
44 |       if (fd == -1) {
45 |         report_error("perf_event_open");
46 |       }
47 | 
48 |       fds.push_back(fd);
49 |       if (group == -1) {
50 |         group = fd;
51 |       }
52 |     }
53 | 
54 |     temp_result_vec.resize(num_events * 2 + 1);
55 |   }
56 | 
57 |   ~LinuxEvents() {
58 |     for (auto fd : fds) {
59 |       close(fd);
60 |     }
61 |   }
62 | 
63 |   inline void start() {
64 |     if (ioctl(group, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
65 |       report_error("ioctl(PERF_EVENT_IOC_RESET)");
66 |     }
67 | 
68 |     if (ioctl(group, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
69 |       report_error("ioctl(PERF_EVENT_IOC_ENABLE)");
70 |     }
71 |   }
72 | 
73 |   inline void end(std::vector<unsigned long long> &results) {
74 |     if (ioctl(group, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP) == -1) {
75 |       report_error("ioctl(PERF_EVENT_IOC_DISABLE)");
76 |     }
77 | 
78 |     if (read(group, &temp_result_vec[0], temp_result_vec.size() * 8) == -1) {
79 |       report_error("read");
80 |     }
81 |     // our actual results are in slots 1,3,5, ... of this structure
82 |     // we really should be checking our ids obtained earlier to be safe
83 |     for (uint32_t i = 1; i < temp_result_vec.size(); i += 2) {
84 |       results[i / 2] = temp_result_vec[i];
85 |     }
86 |   }
87 | 
88 | private:
89 |   void report_error(const std::string &context) {
90 |     if (working)
91 |       std::cerr << (context + ": " + std::string(strerror(errno))) << std::endl;
92 |     working = false;
93 |   }
94 | };
95 | #endif
96 | 


--------------------------------------------------------------------------------
/benchmark/linux/memalloc.h:
--------------------------------------------------------------------------------
 1 | #pragma once
 2 | 
 3 | template <class T, size_t alignment> T *moveToBoundary(T *inbyte) {
 4 |   return reinterpret_cast<T *>(
 5 |       (reinterpret_cast<uintptr_t>(inbyte) + (alignment - 1)) &
 6 |       ~(alignment - 1));
 7 | }
 8 | 
 9 | // use this when calling STL object if you want
10 | // their memory to be aligned on cache lines
11 | template <class T, size_t alignment> class AlignedSTLAllocator {
12 | public:
13 |   // type definitions
14 |   typedef T value_type;
15 |   typedef T *pointer;
16 |   typedef const T *const_pointer;
17 |   typedef T &reference;
18 |   typedef const T &const_reference;
19 |   typedef std::size_t size_type;
20 |   typedef std::ptrdiff_t difference_type;
21 | 
22 |   // rebind allocator to type U
23 |   template <class U> struct rebind {
24 |     typedef AlignedSTLAllocator<U, alignment> other;
25 |   };
26 | 
27 |   pointer address(reference value) const { return &value; }
28 |   const_pointer address(const_reference value) const { return &value; }
29 | 
30 |   /* constructors and destructor
31 |    * - nothing to do because the allocator has no state
32 |    */
33 |   AlignedSTLAllocator() {}
34 |   AlignedSTLAllocator(const AlignedSTLAllocator &) {}
35 |   template <typename U>
36 |   AlignedSTLAllocator(const AlignedSTLAllocator<U, alignment> &) {}
37 |   ~AlignedSTLAllocator() {}
38 | 
39 |   // return maximum number of elements that can be allocated
40 |   size_type max_size() const throw() {
41 |     return (std::numeric_limits<std::size_t>::max)() / sizeof(T);
42 |   }
43 | 
44 |   /*
45 |    *  allocate but don't initialize num elements of type T
46 |    *
47 |    *  This implementation is potentially unsafe on some compilers.
48 |    */
49 |   pointer allocate(size_type num, const void * = 0) {
50 |     /**
51 |      * The nasty trick here is to make the position of the actual pointer
52 |      * within the newly allocated memory. The alternative is to use
53 |      * some kind of data structure like a hash table, which might be slow.
54 |      */
55 |     size_t *buffer = reinterpret_cast<size_t *>(
56 |         ::operator new(sizeof(uintptr_t) + (num + alignment) * sizeof(T)));
57 |     size_t *answer = moveToBoundary<size_t, alignment>(buffer + 1);
58 |     *(answer - 1) = reinterpret_cast<uintptr_t>(answer) -
59 |                     reinterpret_cast<uintptr_t>(buffer);
60 |     return reinterpret_cast<pointer>(answer);
61 |   }
62 | 
63 |   void construct(pointer p, const T &value) {
64 |     // initialize memory with placement new
65 |     new (p) T(value);
66 |   }
67 | 
68 |   // destroy elements of initialized storage p
69 |   void destroy(pointer p) { p->~T(); }
70 | 
71 |   // deallocate storage p of deleted elements
72 |   void deallocate(pointer p, size_type /*num*/) {
73 |     const size_t *assize_t = reinterpret_cast<size_t *>(p);
74 |     const size_t offset = assize_t[-1];
75 |     ::operator delete(
76 |         reinterpret_cast<pointer>(reinterpret_cast<uintptr_t>(p) - offset));
77 |   }
78 | };
79 | 
80 | // for our purposes, we don't want to distinguish between allocators.
81 | template <class T1, size_t t, class T2>
82 | bool operator==(const AlignedSTLAllocator<T1, t> &, const T2 &) throw() {
83 |   return true;
84 | }
85 | 
86 | template <class T1, size_t t, class T2>
87 | bool operator!=(const AlignedSTLAllocator<T1, t> &, const T2 &) throw() {
88 |   return false;
89 | }
90 | // typical cache line
91 | typedef AlignedSTLAllocator<uint32_t, 64> cacheallocator;
92 | 


--------------------------------------------------------------------------------
/benchmark/linux/stream_benchmark.cpp:
--------------------------------------------------------------------------------
  1 | #ifdef __linux__
  2 | #include <cassert>
  3 | #include <cinttypes>
  4 | #include <cstdio>
  5 | #include <cstdlib>
  6 | #include <cstring>
  7 | #include <cstring>
  8 | #include <iomanip>
  9 | #include <iostream>
 10 | #include <algorithm>
 11 | #include <chrono>
 12 | #include <libgen.h>
 13 | #include <random>
 14 | #include <string>
 15 | #include <vector>
 16 | 
 17 | #include "linux-perf-events.h"
 18 | #include "aligned_alloc.h"
 19 | #include "pospopcnt_avx512bw.h"
 20 | #include "pospopcnt.h"
 21 | #ifdef ALIGN
 22 | #include "memalloc.h"
 23 | #define memory_allocate(size) aligned_alloc(64, (size))
 24 | #else
 25 | #define memory_allocate(size) malloc(size)
 26 | #endif
 27 | 
 28 | // Function pointer definition.
 29 | typedef void (*pospopcnt_u16_method_type)(const uint16_t *data, uint32_t len,
 30 |                                           uint32_t *flags);
 31 | #define PPOPCNT_NUMBER_METHODS 4
 32 | pospopcnt_u16_method_type pospopcnt_u16_methods[] = {
 33 |   pospopcnt_u16_scalar, pospopcnt_u16_avx512bw_harvey_seal_1KB, pospopcnt_u16_avx512bw_harvey_seal_512B, pospopcnt_u16_avx512bw_harvey_seal_256B
 34 | };
 35 | 
 36 | static const char *const pospopcnt_u16_method_names[] = {
 37 |   "pospopcnt_u16_scalar", "pospopcnt_u16_avx512bw_harvey_seal_1KB", "pospopcnt_u16_avx512bw_harvey_seal_512B", "pospopcnt_u16_avx512bw_harvey_seal_256B"
 38 | };
 39 | 
 40 | 
 41 | template <class C>
 42 | double benchmark(C & vdata, uint32_t n, pospopcnt_u16_method_type fn) {
 43 |   std::vector<double> timings;
 44 |   std::vector<uint32_t> correctflags(16);
 45 |   pospopcnt_u16_scalar(vdata.data(), n, correctflags.data()); // this is our gold standard
 46 |   for(size_t i = 0; i < 100; i++) {
 47 |     std::vector<uint32_t> flags(16);
 48 |     auto start = std::chrono::steady_clock::now();
 49 |     fn(vdata.data(), n, flags.data());
 50 |     auto end = std::chrono::steady_clock::now();
 51 |     if(correctflags != flags) { throw std::runtime_error("bug\n"); }
 52 |     std::chrono::duration<double> secs = end - start;
 53 |     double time_in_s = secs.count();
 54 |     timings.push_back(time_in_s);
 55 |   }
 56 |   double min_timing = *min_element(timings.begin(), timings.end());
 57 |   double speedinGBs = (n * sizeof(uint16_t)) / (min_timing * 1000000000.0);
 58 |   return speedinGBs;
 59 | }
 60 | 
 61 | 
 62 | int main(int argc, char **argv) {
 63 |   size_t max_val = 536870912;
 64 |   std::vector<uint16_t> vdata(max_val);
 65 |   std::random_device rd;
 66 |   std::mt19937 gen(rd());
 67 |   std::uniform_int_distribution<> dis(0, 0xFFFF);
 68 |   for (size_t k2 = 0; k2 < vdata.size(); k2++) {
 69 |     vdata[k2] = dis(gen); // random init.
 70 |   }
 71 |   printf("#");  
 72 |   for (size_t k = 0; k < PPOPCNT_NUMBER_METHODS; k++) {
 73 |       printf("\t");
 74 |       printf("%-40s\t", pospopcnt_u16_method_names[k]);
 75 |       fflush(NULL);
 76 |   }
 77 |   printf("\n");
 78 |    
 79 |   for (int n = 1024; n <= max_val; n*=2) {
 80 |     printf("%d ", n);
 81 |     for (size_t k = 0; k < PPOPCNT_NUMBER_METHODS; k++) {
 82 |       printf("\t");
 83 |       fflush(NULL);
 84 |       double speed = benchmark(vdata,n, pospopcnt_u16_methods[k]);
 85 |       printf("%f\t",speed);
 86 |     }
 87 |     printf("\n");
 88 |   }
 89 |   return EXIT_SUCCESS;
 90 | }
 91 | #else //  __linux__
 92 | 
 93 | #include <stdio.h>
 94 | #include <stdlib.h>
 95 | 
 96 | int main() {
 97 |   printf("This is a linux-specific benchmark\n");
 98 |   return EXIT_SUCCESS;
 99 | }
100 | 
101 | #endif
102 | 


--------------------------------------------------------------------------------
/goscript.sh:
--------------------------------------------------------------------------------
 1 | max=30
 2 | n=10
 3 | #output=golike
 4 | 
 5 | #for i in `seq 0 $max`
 6 | #do
 7 | #	for j in `seq $n`
 8 | #	do
 9 | #		./golike_benchmark $((2**i))
10 | #	done
11 | #done >${output}_ij.out
12 | 
13 | for j in `seq $n`
14 | do
15 | 	./golike_benchmark 1
16 | 	for i in `seq 1 $max`
17 | 	do
18 | 		./golike_benchmark $((2**i))
19 | 		./golike_benchmark $((2**(i-1)*3))
20 | 	done
21 | done
22 | 


--------------------------------------------------------------------------------
/include/pospopcnt.h:
--------------------------------------------------------------------------------
 1 | #ifndef POSPOPCNT_H
 2 | #define POSPOPCNT_H
 3 | 
 4 | #include <stdint.h>
 5 | 
 6 | #if defined(__GNUC__) && !defined(__clang__)
 7 | __attribute__((optimize("no-tree-vectorize")))
 8 | #endif
 9 | // given a stream of len 16-bit words (in data), generates
10 | // an histogram of 16 counts stored in flags, corresponding
11 | // to the number of bit sets at the corresponding indexes (0,1,...,15).
12 | static void pospopcnt_u16_scalar(const uint16_t *data, uint32_t len,
13 |                                  flags_type *flags) {
14 | #if defined(__clang__)
15 | #pragma clang loop vectorize(disable)
16 | #endif
17 |   for (int i = 0; i < len; ++i) {
18 |       uint64_t w = data[i];
19 |       flags[0] += ((w >> 0) & 1);
20 |       flags[1] += ((w >> 1) & 1);
21 |       flags[2] += ((w >> 2) & 1);
22 |       flags[3] += ((w >> 3) & 1);
23 |       flags[4] += ((w >> 4) & 1);
24 |       flags[5] += ((w >> 5) & 1);
25 |       flags[6] += ((w >> 6) & 1);
26 |       flags[7] += ((w >> 7) & 1);
27 |       flags[8] += ((w >> 8) & 1);
28 |       flags[9] += ((w >> 9) & 1);
29 |       flags[10] += ((w >> 10) & 1);
30 |       flags[11] += ((w >> 11) & 1);
31 |       flags[12] += ((w >> 12) & 1);
32 |       flags[13] += ((w >> 13) & 1);
33 |       flags[14] += ((w >> 14) & 1);
34 |       flags[15] += ((w >> 15) & 1);
35 |   }
36 | }
37 | 
38 | #endif // POSPOPCNT_H
39 | 


--------------------------------------------------------------------------------
/include/pospopcnt_avx512bw.h:
--------------------------------------------------------------------------------
  1 | #ifndef POSPOPCNT_AVX512BW_H
  2 | #define POSPOPCNT_AVX512BW_H
  3 | 
  4 | #include <x86intrin.h>
  5 | #include <stdint.h>
  6 | 
  7 | #if defined(__AVX512BW__) && __AVX512BW__ == 1
  8 | 
  9 | // utility function for use in pospopcnt_u16_avx512bw_harvey_seal
 10 | static inline void pospopcnt_csa_avx512(__m512i *__restrict__ h,
 11 |                                         __m512i *__restrict__ l, __m512i b,
 12 |                                         __m512i c) {
 13 |   *h = _mm512_ternarylogic_epi32(c, b, *l, 0xE8); // 11101000
 14 |   *l = _mm512_ternarylogic_epi32(c, b, *l, 0x96); // 10010110
 15 | }
 16 | 
 17 | // given a stream of len 16-bit words (in array), generates
 18 | // an histogram of 16 counts stored in flags, corresponding
 19 | // to the number of bit sets at the corresponding indexes (0,1,...,15).
 20 | //
 21 | // Uses 1KB blocks
 22 | static void pospopcnt_u16_avx512bw_harvey_seal_1KB(const uint16_t *array,
 23 |                                                uint32_t len, flags_type *flags) {
 24 |   for (uint32_t i = len - (len % (32 * 16)); i < len; ++i) {
 25 |     for (int j = 0; j < 16; ++j) {
 26 |       flags[j] += (((array[i]) >> j) & 1);
 27 |     }
 28 |   }
 29 | 
 30 |   const __m512i *data = (const __m512i *)array;
 31 |   __m512i v1 = _mm512_setzero_si512();
 32 |   __m512i v2 = _mm512_setzero_si512();
 33 |   __m512i v4 = _mm512_setzero_si512();
 34 |   __m512i v8 = _mm512_setzero_si512();
 35 |   __m512i v16 = _mm512_setzero_si512();
 36 |   __m512i twosA, twosB, foursA, foursB, eightsA, eightsB;
 37 |   __m512i one = _mm512_set1_epi16(1);
 38 |   __m512i counter[16];
 39 | 
 40 |   const size_t size = len / 32;
 41 |   const uint64_t limit = size - size % 16;
 42 | 
 43 |   uint16_t buffer[32];
 44 | 
 45 |   uint64_t i = 0;
 46 |   while (i < limit) {
 47 |     for (size_t i = 0; i < 16; ++i)
 48 |       counter[i] = _mm512_setzero_si512();
 49 | 
 50 |     size_t thislimit = limit;
 51 |     if (thislimit - i >= (1 << 16))
 52 |       thislimit = i + (1 << 16) - 1;
 53 | 
 54 |     for (/**/; i < thislimit; i += 16) {
 55 | #define U(pos)                                                                 \
 56 |   {                                                                            \
 57 |     counter[pos] = _mm512_add_epi16(                                           \
 58 |         counter[pos], _mm512_and_si512(v16, _mm512_set1_epi16(1)));            \
 59 |     v16 = _mm512_srli_epi16(v16, 1);                                           \
 60 |   }
 61 |       pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 0),
 62 |                            _mm512_loadu_si512(data + i + 1));
 63 |       pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 2),
 64 |                            _mm512_loadu_si512(data + i + 3));
 65 |       pospopcnt_csa_avx512(&foursA, &v2, twosA, twosB);
 66 |       pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 4),
 67 |                            _mm512_loadu_si512(data + i + 5));
 68 |       pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 6),
 69 |                            _mm512_loadu_si512(data + i + 7));
 70 |       pospopcnt_csa_avx512(&foursB, &v2, twosA, twosB);
 71 |       pospopcnt_csa_avx512(&eightsA, &v4, foursA, foursB);
 72 |       pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 8),
 73 |                            _mm512_loadu_si512(data + i + 9));
 74 |       pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 10),
 75 |                            _mm512_loadu_si512(data + i + 11));
 76 |       pospopcnt_csa_avx512(&foursA, &v2, twosA, twosB);
 77 |       pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 12),
 78 |                            _mm512_loadu_si512(data + i + 13));
 79 |       pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 14),
 80 |                            _mm512_loadu_si512(data + i + 15));
 81 |       pospopcnt_csa_avx512(&foursB, &v2, twosA, twosB);
 82 |       pospopcnt_csa_avx512(&eightsB, &v4, foursA, foursB);
 83 |       U(0) U(1) U(2) U(3) U(4) U(5) U(6) U(7) U(8) U(9) U(10) U(11) U(12) U(13)
 84 |           U(14) U(15) // Updates
 85 |           pospopcnt_csa_avx512(&v16, &v8, eightsA, eightsB);
 86 |     }
 87 |     // Update the counters after the last iteration.
 88 |     for (size_t i = 0; i < 16; ++i)
 89 |       U(i)
 90 | #undef U
 91 | 
 92 |     for (size_t i = 0; i < 16; ++i) {
 93 |       _mm512_storeu_si512((__m512i *)buffer, counter[i]);
 94 |       for (size_t z = 0; z < 32; z++) {
 95 |         flags[i] += 16 * (flags_type)buffer[z];
 96 |       }
 97 |     }
 98 |   }
 99 | 
100 |   _mm512_storeu_si512((__m512i *)buffer, v1);
101 |   for (size_t i = 0; i < 32; i++) {
102 |     for (int j = 0; j < 16; j++) {
103 |       flags[j] += 1 * ((buffer[i] >> j) & 1);
104 |     }
105 |   }
106 | 
107 |   _mm512_storeu_si512((__m512i *)buffer, v2);
108 |   for (size_t i = 0; i < 32; i++) {
109 |     for (int j = 0; j < 16; j++) {
110 |       flags[j] += 2 * ((buffer[i] >> j) & 1);
111 |     }
112 |   }
113 | 
114 |   _mm512_storeu_si512((__m512i *)buffer, v4);
115 |   for (size_t i = 0; i < 32; i++) {
116 |     for (int j = 0; j < 16; j++) {
117 |       flags[j] += 4 * ((buffer[i] >> j) & 1);
118 |     }
119 |   }
120 | 
121 |   _mm512_storeu_si512((__m512i *)buffer, v8);
122 |   for (size_t i = 0; i < 32; i++) {
123 |     for (int j = 0; j < 16; j++) {
124 |       flags[j] += 8 * ((buffer[i] >> j) & 1);
125 |     }
126 |   }
127 | }
128 | 
129 | // given a stream of len 16-bit words (in array), generates
130 | // an histogram of 16 counts stored in flags, corresponding
131 | // to the number of bit sets at the corresponding indexes (0,1,...,15).
132 | //
133 | // Uses 512B blocks
134 | static void pospopcnt_u16_avx512bw_harvey_seal_512B(const uint16_t *array,
135 |                                                     uint32_t len,
136 |                                                     flags_type *flags) {
137 |   for (uint32_t i = len - (len % (32 * 8)); i < len; ++i) {
138 |     for (int j = 0; j < 16; ++j) {
139 |       flags[j] += (((array[i]) >> j) & 1);
140 |     }
141 |   }
142 | 
143 |   const __m512i *data = (const __m512i *)array;
144 |   __m512i v1 = _mm512_setzero_si512();
145 |   __m512i v2 = _mm512_setzero_si512();
146 |   __m512i v4 = _mm512_setzero_si512();
147 |   __m512i v8 = _mm512_setzero_si512();
148 |   __m512i twosA, twosB, foursA, foursB;
149 |   __m512i one = _mm512_set1_epi16(1);
150 |   __m512i counter[16];
151 | 
152 |   const size_t size = len / 32;
153 |   const uint64_t limit = size - size % 8;
154 | 
155 |   uint16_t buffer[32];
156 | 
157 |   uint64_t i = 0;
158 |   while (i < limit) {
159 |     for (size_t i = 0; i < 16; ++i)
160 |       counter[i] = _mm512_setzero_si512();
161 | 
162 |     size_t thislimit = limit;
163 |     if (thislimit - i >= (1 << 16))
164 |       thislimit = i + (1 << 16) - 1;
165 | 
166 |     for (/**/; i < thislimit; i += 8) {
167 | #define U(pos)                                                                 \
168 |   {                                                                            \
169 |     counter[pos] = _mm512_add_epi16(                                           \
170 |         counter[pos], _mm512_and_si512(v8, _mm512_set1_epi16(1)));             \
171 |     v8 = _mm512_srli_epi16(v8, 1);                                             \
172 |   }
173 |       pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 0),
174 |                            _mm512_loadu_si512(data + i + 1));
175 |       pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 2),
176 |                            _mm512_loadu_si512(data + i + 3));
177 |       pospopcnt_csa_avx512(&foursA, &v2, twosA, twosB);
178 |       pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 4),
179 |                            _mm512_loadu_si512(data + i + 5));
180 |       pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 6),
181 |                            _mm512_loadu_si512(data + i + 7));
182 |       pospopcnt_csa_avx512(&foursB, &v2, twosA, twosB);
183 |       U(0) U(1) U(2) U(3) U(4) U(5) U(6) U(7) U(8) U(9) U(10) U(11) U(12) U(13)
184 |           U(14) U(15) // Updates
185 |           pospopcnt_csa_avx512(&v8, &v4, foursA, foursB);
186 |     }
187 |     // Update the counters after the last iteration.
188 |     for (size_t i = 0; i < 16; ++i)
189 |       U(i)
190 | #undef U
191 | 
192 |     for (size_t i = 0; i < 16; ++i) {
193 |       _mm512_storeu_si512((__m512i *)buffer, counter[i]);
194 |       for (size_t z = 0; z < 32; z++) {
195 |         flags[i] += 8 * (flags_type)buffer[z];
196 |       }
197 |     }
198 |   }
199 | 
200 |   _mm512_storeu_si512((__m512i *)buffer, v1);
201 |   for (size_t i = 0; i < 32; i++) {
202 |     for (int j = 0; j < 16; j++) {
203 |       flags[j] += 1 * ((buffer[i] >> j) & 1);
204 |     }
205 |   }
206 | 
207 |   _mm512_storeu_si512((__m512i *)buffer, v2);
208 |   for (size_t i = 0; i < 32; i++) {
209 |     for (int j = 0; j < 16; j++) {
210 |       flags[j] += 2 * ((buffer[i] >> j) & 1);
211 |     }
212 |   }
213 | 
214 |   _mm512_storeu_si512((__m512i *)buffer, v4);
215 |   for (size_t i = 0; i < 32; i++) {
216 |     for (int j = 0; j < 16; j++) {
217 |       flags[j] += 4 * ((buffer[i] >> j) & 1);
218 |     }
219 |   }
220 | 
221 |   _mm512_storeu_si512((__m512i *)buffer, v8);
222 |   for (size_t i = 0; i < 32; i++) {
223 |     for (int j = 0; j < 16; j++) {
224 |       flags[j] += 8 * ((buffer[i] >> j) & 1);
225 |     }
226 |   }
227 | }
228 | 
229 | 
230 | 
231 | // given a stream of len 16-bit words (in array), generates
232 | // an histogram of 16 counts stored in flags, corresponding
233 | // to the number of bit sets at the corresponding indexes (0,1,...,15).
234 | //
235 | // Uses 256B blocks
236 | static void pospopcnt_u16_avx512bw_harvey_seal_256B(const uint16_t *array,
237 |                                                     uint32_t len,
238 |                                                     flags_type *flags) {
239 |   for (uint32_t i = len - (len % (32 * 4)); i < len; ++i) {
240 |     for (int j = 0; j < 16; ++j) {
241 |       flags[j] += (((array[i]) >> j) & 1);
242 |     }
243 |   }
244 | 
245 |   const __m512i *data = (const __m512i *)array;
246 |   __m512i v1 = _mm512_setzero_si512();
247 |   __m512i v2 = _mm512_setzero_si512();
248 |   __m512i v4 = _mm512_setzero_si512();
249 |   __m512i twosA, twosB;
250 |   __m512i one = _mm512_set1_epi16(1);
251 |   __m512i counter[16];
252 | 
253 |   const size_t size = len / 32;
254 |   const uint64_t limit = size - size % 4;
255 | 
256 |   uint16_t buffer[32];
257 | 
258 |   uint64_t i = 0;
259 |   while (i < limit) {
260 |     for (size_t i = 0; i < 16; ++i)
261 |       counter[i] = _mm512_setzero_si512();
262 | 
263 |     size_t thislimit = limit;
264 |     if (thislimit - i >= (1 << 16))
265 |       thislimit = i + (1 << 16) - 1;
266 | 
267 |     for (/**/; i < thislimit; i += 4) {
268 | #define U(pos)                                                                 \
269 |   {                                                                            \
270 |     counter[pos] = _mm512_add_epi16(                                           \
271 |         counter[pos], _mm512_and_si512(v4, _mm512_set1_epi16(1)));             \
272 |     v4 = _mm512_srli_epi16(v4, 1);                                             \
273 |   }
274 |       pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 0),
275 |                            _mm512_loadu_si512(data + i + 1));
276 |       pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 2),
277 |                            _mm512_loadu_si512(data + i + 3));
278 |       U(0) U(1) U(2) U(3) U(4) U(5) U(6) U(7) U(8) U(9) U(10) U(11) U(12) U(13)
279 |           U(14) U(15) // Updates
280 |       pospopcnt_csa_avx512(&v4, &v2, twosA, twosB);
281 |     }
282 |     // Update the counters after the last iteration.
283 |     for (size_t i = 0; i < 16; ++i)
284 |       U(i)
285 | #undef U
286 | 
287 |     for (size_t i = 0; i < 16; ++i) {
288 |       _mm512_storeu_si512((__m512i *)buffer, counter[i]);
289 |       for (size_t z = 0; z < 32; z++) {
290 |         flags[i] += 4 * (flags_type)buffer[z];
291 |       }
292 |     }
293 |   }
294 | 
295 |   _mm512_storeu_si512((__m512i *)buffer, v1);
296 |   for (size_t i = 0; i < 32; i++) {
297 |     for (int j = 0; j < 16; j++) {
298 |       flags[j] += 1 * ((buffer[i] >> j) & 1);
299 |     }
300 |   }
301 | 
302 |   _mm512_storeu_si512((__m512i *)buffer, v2);
303 |   for (size_t i = 0; i < 32; i++) {
304 |     for (int j = 0; j < 16; j++) {
305 |       flags[j] += 2 * ((buffer[i] >> j) & 1);
306 |     }
307 |   }
308 | 
309 |   _mm512_storeu_si512((__m512i *)buffer, v4);
310 |   for (size_t i = 0; i < 32; i++) {
311 |     for (int j = 0; j < 16; j++) {
312 |       flags[j] += 4 * ((buffer[i] >> j) & 1);
313 |     }
314 |   }
315 | }
316 | 
317 | 
318 | 
319 | #endif // __AVX512BW__
320 | 
321 | #endif // POSPOPCNT_AVX512BW_H
322 | 


--------------------------------------------------------------------------------
/results.txt:
--------------------------------------------------------------------------------
  1 | $ ./script.sh 
  2 | n = 8388608 m = 1 
  3 | iterations = 100 
  4 | array size: 16.000 MB
  5 | nothing                                         instructions per cycle 0.20, cycles per 16-bit word:  0.000, instructions per 16-bit word 0.000 
  6 | min:       65 cycles,       13 instructions,           1 branch mis.,        0 cache ref.,        0 cache mis.
  7 | avg:     75.6 cycles,     13.0 instructions,         1.0 branch mis.,      0.1 cache ref.,      0.1 cache mis.
  8 | 
  9 | == Trial 1 out of 3 
 10 | 
 11 | pospopcnt_u16_scalar                            alignments: 16 
 12 | instructions per cycle 3.75, cycles per 16-bit word:  17.321, instructions per 16-bit word 65.000 
 13 | min: 145300780 cycles, 545259670 instructions,         3 branch mis.,   342794 cache ref.,   227818 cache mis.
 14 | avg: 145437051.2 cycles, 545259670.8 instructions,           8.1 branch mis., 343725.7 cache ref., 240702.4 cache mis.
 15 |  0.367 GB/s 
 16 | estimated clock in range 3.114 GHz to 3.183 GHz
 17 | 
 18 | 
 19 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
 20 | instructions per cycle 0.44, cycles per 16-bit word:  0.510, instructions per 16-bit word 0.226 
 21 | min:  4275703 cycles,  1896353 instructions,          89 branch mis.,   456562 cache ref.,   210984 cache mis.
 22 | avg: 4435881.5 cycles, 1896353.4 instructions,     114.7 branch mis., 462022.3 cache ref., 224756.1 cache mis.
 23 |  12.242 GB/s 
 24 | estimated clock in range 2.995 GHz to 3.130 GHz
 25 | 
 26 | 
 27 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
 28 | instructions per cycle 0.59, cycles per 16-bit word:  0.553, instructions per 16-bit word 0.323 
 29 | min:  4636354 cycles,  2713472 instructions,          73 branch mis.,   405240 cache ref.,   212428 cache mis.
 30 | avg: 4740082.5 cycles, 2713472.5 instructions,      99.2 branch mis., 410034.9 cache ref., 221150.8 cache mis.
 31 |  11.334 GB/s 
 32 | estimated clock in range 2.960 GHz to 3.138 GHz
 33 | 
 34 | 
 35 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
 36 | instructions per cycle 0.80, cycles per 16-bit word:  0.650, instructions per 16-bit word 0.518 
 37 | min:  5455458 cycles,  4348093 instructions,          93 branch mis.,   369665 cache ref.,   214089 cache mis.
 38 | avg: 5685385.1 cycles, 4348093.8 instructions,     100.7 branch mis., 372731.5 cache ref., 227855.8 cache mis.
 39 |  9.658 GB/s 
 40 | estimated clock in range 3.016 GHz to 3.145 GHz
 41 | 
 42 | 
 43 | == Trial 2 out of 3 
 44 | 
 45 | pospopcnt_u16_scalar                            alignments: 16 
 46 | instructions per cycle 3.75, cycles per 16-bit word:  17.319, instructions per 16-bit word 65.000 
 47 | min: 145278186 cycles, 545259670 instructions,         3 branch mis.,   342538 cache ref.,   228564 cache mis.
 48 | avg: 145437966.3 cycles, 545259671.1 instructions,           9.4 branch mis., 343842.9 cache ref., 240921.6 cache mis.
 49 |  0.368 GB/s 
 50 | estimated clock in range 3.103 GHz to 3.186 GHz
 51 | 
 52 | 
 53 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
 54 | instructions per cycle 0.44, cycles per 16-bit word:  0.510, instructions per 16-bit word 0.226 
 55 | min:  4280028 cycles,  1896353 instructions,          93 branch mis.,   456611 cache ref.,   212101 cache mis.
 56 | avg: 4427292.3 cycles, 1896353.4 instructions,     118.8 branch mis., 461757.7 cache ref., 223975.2 cache mis.
 57 |  12.248 GB/s 
 58 | estimated clock in range 2.954 GHz to 3.125 GHz
 59 | 
 60 | 
 61 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
 62 | instructions per cycle 0.58, cycles per 16-bit word:  0.554, instructions per 16-bit word 0.323 
 63 | min:  4646521 cycles,  2713472 instructions,          69 branch mis.,   406160 cache ref.,   213902 cache mis.
 64 | avg: 4782781.6 cycles, 2713472.5 instructions,      99.2 branch mis., 412982.4 cache ref., 226466.0 cache mis.
 65 |  11.284 GB/s 
 66 | estimated clock in range 2.757 GHz to 3.134 GHz
 67 | 
 68 | 
 69 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
 70 | instructions per cycle 0.79, cycles per 16-bit word:  0.653, instructions per 16-bit word 0.518 
 71 | min:  5476340 cycles,  4348093 instructions,          89 branch mis.,   369385 cache ref.,   215783 cache mis.
 72 | avg: 5675215.1 cycles, 4348093.8 instructions,      99.7 branch mis., 372441.2 cache ref., 227343.4 cache mis.
 73 |  9.616 GB/s 
 74 | estimated clock in range 2.982 GHz to 3.141 GHz
 75 | 
 76 | 
 77 | == Trial 3 out of 3 
 78 | 
 79 | pospopcnt_u16_scalar                            alignments: 16 
 80 | instructions per cycle 3.75, cycles per 16-bit word:  17.323, instructions per 16-bit word 65.000 
 81 | min: 145315376 cycles, 545259670 instructions,         3 branch mis.,   342409 cache ref.,   225642 cache mis.
 82 | avg: 145436719.9 cycles, 545259670.8 instructions,           9.5 branch mis., 343947.5 cache ref., 238966.5 cache mis.
 83 |  0.367 GB/s 
 84 | estimated clock in range 3.106 GHz to 3.183 GHz
 85 | 
 86 | 
 87 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
 88 | instructions per cycle 0.44, cycles per 16-bit word:  0.508, instructions per 16-bit word 0.226 
 89 | min:  4264660 cycles,  1896353 instructions,          89 branch mis.,   456718 cache ref.,   211363 cache mis.
 90 | avg: 4409605.7 cycles, 1896353.5 instructions,     110.8 branch mis., 461534.5 cache ref., 222450.5 cache mis.
 91 |  12.274 GB/s 
 92 | estimated clock in range 2.983 GHz to 3.126 GHz
 93 | 
 94 | 
 95 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
 96 | instructions per cycle 0.58, cycles per 16-bit word:  0.555, instructions per 16-bit word 0.323 
 97 | min:  4654504 cycles,  2713472 instructions,          46 branch mis.,   405671 cache ref.,   213485 cache mis.
 98 | avg: 4764287.9 cycles, 2713472.5 instructions,      96.6 branch mis., 411845.7 cache ref., 224323.1 cache mis.
 99 |  11.267 GB/s 
100 | estimated clock in range 2.857 GHz to 3.134 GHz
101 | 
102 | 
103 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
104 | instructions per cycle 0.80, cycles per 16-bit word:  0.649, instructions per 16-bit word 0.518 
105 | min:  5441872 cycles,  4348093 instructions,          88 branch mis.,   369764 cache ref.,   217127 cache mis.
106 | avg: 5697221.4 cycles, 4348093.8 instructions,     100.9 branch mis., 372666.7 cache ref., 228013.5 cache mis.
107 |  9.679 GB/s 
108 | estimated clock in range 2.980 GHz to 3.139 GHz
109 | 
110 | n = 16777216 m = 1 
111 | iterations = 100 
112 | array size: 32.000 MB
113 | nothing                                         instructions per cycle 0.20, cycles per 16-bit word:  0.000, instructions per 16-bit word 0.000 
114 | min:       64 cycles,       13 instructions,           1 branch mis.,        0 cache ref.,        0 cache mis.
115 | avg:     71.7 cycles,     13.0 instructions,         1.0 branch mis.,      0.3 cache ref.,      0.0 cache mis.
116 | 
117 | == Trial 1 out of 3 
118 | 
119 | pospopcnt_u16_scalar                            alignments: 16 
120 | instructions per cycle 3.75, cycles per 16-bit word:  17.327, instructions per 16-bit word 65.000 
121 | min: 290698808 cycles, 1090519236 instructions,                5 branch mis.,   686676 cache ref.,   500793 cache mis.
122 | avg: 290938868.1 cycles, 1090519236.9 instructions,         15.7 branch mis., 688354.2 cache ref., 510683.9 cache mis.
123 |  0.367 GB/s 
124 | estimated clock in range 3.129 GHz to 3.183 GHz
125 | 
126 | 
127 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
128 | instructions per cycle 0.50, cycles per 16-bit word:  0.451, instructions per 16-bit word 0.225 
129 | min:  7570267 cycles,  3778030 instructions,         172 branch mis.,   911913 cache ref.,   466824 cache mis.
130 | avg: 8082887.5 cycles, 3778030.8 instructions,     208.4 branch mis., 918840.2 cache ref., 478513.7 cache mis.
131 |  13.966 GB/s 
132 | estimated clock in range 2.991 GHz to 3.156 GHz
133 | 
134 | 
135 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
136 | instructions per cycle 0.67, cycles per 16-bit word:  0.480, instructions per 16-bit word 0.323 
137 | min:  8059538 cycles,  5412301 instructions,         117 branch mis.,   832885 cache ref.,   472215 cache mis.
138 | avg: 8571650.0 cycles, 5412301.8 instructions,     194.3 branch mis., 844191.3 cache ref., 483240.1 cache mis.
139 |  13.091 GB/s 
140 | estimated clock in range 3.005 GHz to 3.155 GHz
141 | 
142 | 
143 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
144 | instructions per cycle 0.92, cycles per 16-bit word:  0.565, instructions per 16-bit word 0.518 
145 | min:  9471909 cycles,  8685323 instructions,         179 branch mis.,   750308 cache ref.,   476978 cache mis.
146 | avg: 9987220.5 cycles, 8685323.0 instructions,     193.4 branch mis., 753474.4 cache ref., 491033.0 cache mis.
147 |  11.186 GB/s 
148 | estimated clock in range 2.987 GHz to 3.161 GHz
149 | 
150 | 
151 | == Trial 2 out of 3 
152 | 
153 | pospopcnt_u16_scalar                            alignments: 16 
154 | instructions per cycle 3.75, cycles per 16-bit word:  17.329, instructions per 16-bit word 65.000 
155 | min: 290733066 cycles, 1090519236 instructions,                4 branch mis.,   686103 cache ref.,   499278 cache mis.
156 | avg: 290946935.9 cycles, 1090519236.8 instructions,         13.0 branch mis., 687911.4 cache ref., 509925.4 cache mis.
157 |  0.368 GB/s 
158 | estimated clock in range 3.127 GHz to 3.185 GHz
159 | 
160 | 
161 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
162 | instructions per cycle 0.50, cycles per 16-bit word:  0.453, instructions per 16-bit word 0.225 
163 | min:  7600302 cycles,  3778030 instructions,         180 branch mis.,   912316 cache ref.,   465633 cache mis.
164 | avg: 8088476.5 cycles, 3778030.7 instructions,     202.9 branch mis., 918700.2 cache ref., 477476.4 cache mis.
165 |  13.892 GB/s 
166 | estimated clock in range 3.018 GHz to 3.155 GHz
167 | 
168 | 
169 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
170 | instructions per cycle 0.72, cycles per 16-bit word:  0.449, instructions per 16-bit word 0.323 
171 | min:  7531365 cycles,  5412301 instructions,         155 branch mis.,   827062 cache ref.,   467937 cache mis.
172 | avg: 8587747.1 cycles, 5412301.8 instructions,     194.7 branch mis., 842218.7 cache ref., 480786.8 cache mis.
173 |  13.457 GB/s 
174 | estimated clock in range 3.021 GHz to 3.157 GHz
175 | 
176 | 
177 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
178 | instructions per cycle 0.92, cycles per 16-bit word:  0.561, instructions per 16-bit word 0.518 
179 | min:  9403750 cycles,  8685323 instructions,         177 branch mis.,   749685 cache ref.,   476669 cache mis.
180 | avg: 10080203.3 cycles, 8685323.0 instructions,            190.7 branch mis., 752866.7 cache ref., 489090.3 cache mis.
181 |  11.265 GB/s 
182 | estimated clock in range 3.103 GHz to 3.160 GHz
183 | 
184 | 
185 | == Trial 3 out of 3 
186 | 
187 | pospopcnt_u16_scalar                            alignments: 16 
188 | instructions per cycle 3.75, cycles per 16-bit word:  17.329, instructions per 16-bit word 65.000 
189 | min: 290740327 cycles, 1090519236 instructions,                3 branch mis.,   684978 cache ref.,   496434 cache mis.
190 | avg: 290947846.2 cycles, 1090519236.7 instructions,         12.6 branch mis., 687683.9 cache ref., 509124.1 cache mis.
191 |  0.367 GB/s 
192 | estimated clock in range 3.142 GHz to 3.184 GHz
193 | 
194 | 
195 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
196 | instructions per cycle 0.50, cycles per 16-bit word:  0.452, instructions per 16-bit word 0.225 
197 | min:  7577793 cycles,  3778030 instructions,         175 branch mis.,   911261 cache ref.,   466200 cache mis.
198 | avg: 8120387.0 cycles, 3778030.7 instructions,     215.3 branch mis., 919429.1 cache ref., 480036.4 cache mis.
199 |  13.928 GB/s 
200 | estimated clock in range 3.003 GHz to 3.153 GHz
201 | 
202 | 
203 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
204 | instructions per cycle 0.67, cycles per 16-bit word:  0.481, instructions per 16-bit word 0.323 
205 | min:  8068715 cycles,  5412301 instructions,         160 branch mis.,   835367 cache ref.,   469650 cache mis.
206 | avg: 8566613.3 cycles, 5412301.9 instructions,     197.6 branch mis., 843327.4 cache ref., 482458.9 cache mis.
207 |  13.104 GB/s 
208 | estimated clock in range 3.062 GHz to 3.156 GHz
209 | 
210 | 
211 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
212 | instructions per cycle 0.92, cycles per 16-bit word:  0.565, instructions per 16-bit word 0.518 
213 | min:  9487169 cycles,  8685323 instructions,         175 branch mis.,   750225 cache ref.,   477681 cache mis.
214 | avg: 10032864.4 cycles, 8685323.1 instructions,            192.5 branch mis., 753271.8 cache ref., 490210.8 cache mis.
215 |  11.161 GB/s 
216 | estimated clock in range 2.993 GHz to 3.159 GHz
217 | 
218 | n = 33554432 m = 1 
219 | iterations = 100 
220 | array size: 64.000 MB
221 | nothing                                         instructions per cycle 0.20, cycles per 16-bit word:  0.000, instructions per 16-bit word 0.000 
222 | min:       65 cycles,       13 instructions,           1 branch mis.,        0 cache ref.,        0 cache mis.
223 | avg:     70.7 cycles,     13.0 instructions,         1.0 branch mis.,      0.4 cache ref.,      0.0 cache mis.
224 | 
225 | == Trial 1 out of 3 
226 | 
227 | pospopcnt_u16_scalar                            alignments: 16 
228 | instructions per cycle 3.75, cycles per 16-bit word:  17.329, instructions per 16-bit word 65.000 
229 | min: 581477149 cycles, 2181038367 instructions,                3 branch mis.,  1371669 cache ref.,  1033291 cache mis.
230 | avg: 581933574.2 cycles, 2181038369.1 instructions,         19.5 branch mis., 1375421.4 cache ref., 1042079.4 cache mis.
231 |  0.367 GB/s 
232 | estimated clock in range 3.142 GHz to 3.181 GHz
233 | 
234 | 
235 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
236 | instructions per cycle 0.55, cycles per 16-bit word:  0.407, instructions per 16-bit word 0.225 
237 | min: 13653935 cycles,  7541384 instructions,         134 branch mis.,  1816669 cache ref.,   969367 cache mis.
238 | avg: 14379394.1 cycles, 7541384.5 instructions,            348.5 branch mis., 1825994.9 cache ref., 980377.7 cache mis.
239 |  15.407 GB/s 
240 | estimated clock in range 3.037 GHz to 3.167 GHz
241 | 
242 | 
243 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
244 | instructions per cycle 0.73, cycles per 16-bit word:  0.439, instructions per 16-bit word 0.322 
245 | min: 14742562 cycles, 10809959 instructions,         138 branch mis.,  1695219 cache ref.,   976523 cache mis.
246 | avg: 15366151.7 cycles, 10809960.0 instructions,           358.6 branch mis., 1702361.3 cache ref., 988884.0 cache mis.
247 |  14.399 GB/s 
248 | estimated clock in range 3.069 GHz to 3.167 GHz
249 | 
250 | 
251 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
252 | instructions per cycle 1.00, cycles per 16-bit word:  0.517, instructions per 16-bit word 0.517 
253 | min: 17349696 cycles, 17359781 instructions,         340 branch mis.,  1507483 cache ref.,   994974 cache mis.
254 | avg: 17986706.1 cycles, 17359781.6 instructions,           366.0 branch mis., 1512351.4 cache ref., 1007333.2 cache mis.
255 |  12.251 GB/s 
256 | estimated clock in range 3.099 GHz to 3.171 GHz
257 | 
258 | 
259 | == Trial 2 out of 3 
260 | 
261 | pospopcnt_u16_scalar                            alignments: 16 
262 | instructions per cycle 3.75, cycles per 16-bit word:  17.331, instructions per 16-bit word 65.000 
263 | min: 581521999 cycles, 2181038367 instructions,                3 branch mis.,  1372754 cache ref.,  1031789 cache mis.
264 | avg: 581909568.5 cycles, 2181038370.3 instructions,          5.0 branch mis., 1376427.2 cache ref., 1040231.2 cache mis.
265 |  0.367 GB/s 
266 | estimated clock in range 3.154 GHz to 3.182 GHz
267 | 
268 | 
269 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
270 | instructions per cycle 0.55, cycles per 16-bit word:  0.407, instructions per 16-bit word 0.225 
271 | min: 13656404 cycles,  7541384 instructions,         339 branch mis.,  1811137 cache ref.,   973011 cache mis.
272 | avg: 14385482.8 cycles, 7541384.5 instructions,            382.7 branch mis., 1825741.2 cache ref., 982132.8 cache mis.
273 |  15.450 GB/s 
274 | estimated clock in range 2.982 GHz to 3.167 GHz
275 | 
276 | 
277 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
278 | instructions per cycle 0.73, cycles per 16-bit word:  0.439, instructions per 16-bit word 0.322 
279 | min: 14740976 cycles, 10809959 instructions,         135 branch mis.,  1690517 cache ref.,   982609 cache mis.
280 | avg: 15344071.8 cycles, 10809959.9 instructions,           354.4 branch mis., 1703487.1 cache ref., 990562.1 cache mis.
281 |  14.362 GB/s 
282 | estimated clock in range 3.025 GHz to 3.167 GHz
283 | 
284 | 
285 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
286 | instructions per cycle 1.00, cycles per 16-bit word:  0.518, instructions per 16-bit word 0.517 
287 | min: 17396176 cycles, 17359781 instructions,         341 branch mis.,  1508967 cache ref.,   997769 cache mis.
288 | avg: 17973735.3 cycles, 17359781.4 instructions,           367.4 branch mis., 1513286.9 cache ref., 1008995.7 cache mis.
289 |  12.215 GB/s 
290 | estimated clock in range 3.034 GHz to 3.170 GHz
291 | 
292 | 
293 | == Trial 3 out of 3 
294 | 
295 | pospopcnt_u16_scalar                            alignments: 16 
296 | instructions per cycle 3.75, cycles per 16-bit word:  17.331, instructions per 16-bit word 65.000 
297 | min: 581529449 cycles, 2181038367 instructions,                3 branch mis.,  1374422 cache ref.,  1033591 cache mis.
298 | avg: 581922860.8 cycles, 2181038371.0 instructions,          4.9 branch mis., 1377283.2 cache ref., 1041910.3 cache mis.
299 |  0.367 GB/s 
300 | estimated clock in range 3.140 GHz to 3.181 GHz
301 | 
302 | 
303 | pospopcnt_u16_avx512bw_harvey_seal_1KB          alignments: 16 
304 | instructions per cycle 0.56, cycles per 16-bit word:  0.404, instructions per 16-bit word 0.225 
305 | min: 13543586 cycles,  7541384 instructions,         352 branch mis.,  1817110 cache ref.,   969972 cache mis.
306 | avg: 14392309.3 cycles, 7541384.4 instructions,            390.9 branch mis., 1825960.7 cache ref., 980936.2 cache mis.
307 |  15.453 GB/s 
308 | estimated clock in range 3.000 GHz to 3.165 GHz
309 | 
310 | 
311 | pospopcnt_u16_avx512bw_harvey_seal_512B         alignments: 16 
312 | instructions per cycle 0.73, cycles per 16-bit word:  0.441, instructions per 16-bit word 0.322 
313 | min: 14789211 cycles, 10809959 instructions,         195 branch mis.,  1695835 cache ref.,   977227 cache mis.
314 | avg: 15385457.1 cycles, 10809959.7 instructions,           352.7 branch mis., 1702342.3 cache ref., 988574.6 cache mis.
315 |  14.313 GB/s 
316 | estimated clock in range 3.023 GHz to 3.166 GHz
317 | 
318 | 
319 | pospopcnt_u16_avx512bw_harvey_seal_256B         alignments: 16 
320 | instructions per cycle 1.00, cycles per 16-bit word:  0.518, instructions per 16-bit word 0.517 
321 | min: 17366428 cycles, 17359781 instructions,         342 branch mis.,  1508192 cache ref.,   997052 cache mis.
322 | avg: 17946271.6 cycles, 17359781.7 instructions,           368.0 branch mis., 1512860.9 cache ref., 1007786.9 cache mis.
323 |  12.235 GB/s 
324 | estimated clock in range 3.060 GHz to 3.169 GHz
325 | 


--------------------------------------------------------------------------------
/script.sh:
--------------------------------------------------------------------------------
1 | for i in {1..26}
2 | do
3 | 	./instrumented_benchmark -c -v -n $((2**i))
4 | 	./instrumented_benchmark -c -v -n $((2**(i-1)*3))
5 | done
6 | 


--------------------------------------------------------------------------------
/smallresults.txt:
--------------------------------------------------------------------------------
  1 | n = 1048576 m = 1 
  2 | iterations = 100 
  3 | array size: 2.000 MB
  4 | nothing                                 	instructions per cycle 0.20, cycles per 16-bit word:  0.000, instructions per 16-bit word 0.000 
  5 | min:       65 cycles,       13 instructions, 	       1 branch mis.,        0 cache ref.,        0 cache mis.
  6 | avg:     78.2 cycles,     13.0 instructions, 	     1.0 branch mis.,      0.1 cache ref.,      0.1 cache mis.
  7 | 
  8 | == Trial 1 out of 3 
  9 | 
 10 | pospopcnt_u16_scalar                    	alignments: 16 
 11 | instructions per cycle 3.76, cycles per 16-bit word:  17.282, instructions per 16-bit word 65.000 
 12 | min: 18121980 cycles, 68157535 instructions, 	       2 branch mis.,    42216 cache ref.,        0 cache mis.
 13 | avg: 18193433.7 cycles, 68157535.8 instructions, 	     3.2 branch mis.,  42874.5 cache ref.,   2131.4 cache mis.
 14 |  0.368 GB/s 
 15 | estimated clock in range 3.014 GHz to 3.179 GHz
 16 | 
 17 | 
 18 | pospopcnt_u16_avx512bw_harvey_seal_1KB  	alignments: 16 
 19 | instructions per cycle 1.89, cycles per 16-bit word:  0.127, instructions per 16-bit word 0.240 
 20 | min:   133206 cycles,   251608 instructions, 	      26 branch mis.,    41303 cache ref.,        0 cache mis.
 21 | avg: 136291.4 cycles, 251608.0 instructions, 	    53.7 branch mis.,  41590.3 cache ref.,     37.4 cache mis.
 22 |  35.961 GB/s 
 23 | estimated clock in range 1.248 GHz to 2.318 GHz
 24 | 
 25 | 
 26 | pospopcnt_u16_avx512bw_harvey_seal_512B 	alignments: 16 
 27 | instructions per cycle 2.49, cycles per 16-bit word:  0.135, instructions per 16-bit word 0.337 
 28 | min:   141993 cycles,   353463 instructions, 	      25 branch mis.,    42463 cache ref.,        0 cache mis.
 29 | avg: 149097.1 cycles, 353463.1 instructions, 	    29.5 branch mis.,  42750.8 cache ref.,     57.8 cache mis.
 30 |  34.328 GB/s 
 31 | estimated clock in range 1.748 GHz to 2.526 GHz
 32 | 
 33 | 
 34 | pospopcnt_u16_avx512bw_harvey_seal_256B 	alignments: 16 
 35 | instructions per cycle 2.41, cycles per 16-bit word:  0.220, instructions per 16-bit word 0.529 
 36 | min:   230490 cycles,   554484 instructions, 	      26 branch mis.,    42979 cache ref.,        0 cache mis.
 37 | avg: 232864.8 cycles, 554484.1 instructions, 	    32.9 branch mis.,  43112.4 cache ref.,     25.7 cache mis.
 38 |  23.554 GB/s 
 39 | estimated clock in range 1.833 GHz to 2.612 GHz
 40 | 
 41 | 
 42 | == Trial 2 out of 3 
 43 | 
 44 | pospopcnt_u16_scalar                    	alignments: 16 
 45 | instructions per cycle 3.76, cycles per 16-bit word:  17.284, instructions per 16-bit word 65.000 
 46 | min: 18123412 cycles, 68157535 instructions, 	       2 branch mis.,    42640 cache ref.,        0 cache mis.
 47 | avg: 18176547.9 cycles, 68157535.8 instructions, 	     3.0 branch mis.,  42931.6 cache ref.,   2825.7 cache mis.
 48 |  0.368 GB/s 
 49 | estimated clock in range 3.023 GHz to 3.179 GHz
 50 | 
 51 | 
 52 | pospopcnt_u16_avx512bw_harvey_seal_1KB  	alignments: 16 
 53 | instructions per cycle 1.89, cycles per 16-bit word:  0.127, instructions per 16-bit word 0.240 
 54 | min:   133219 cycles,   251608 instructions, 	      26 branch mis.,    41361 cache ref.,        0 cache mis.
 55 | avg: 136222.5 cycles, 251608.0 instructions, 	    47.7 branch mis.,  41622.8 cache ref.,     34.3 cache mis.
 56 |  35.817 GB/s 
 57 | estimated clock in range 1.532 GHz to 2.287 GHz
 58 | 
 59 | 
 60 | pospopcnt_u16_avx512bw_harvey_seal_512B 	alignments: 16 
 61 | instructions per cycle 2.49, cycles per 16-bit word:  0.135, instructions per 16-bit word 0.337 
 62 | min:   141952 cycles,   353463 instructions, 	      28 branch mis.,    42541 cache ref.,        0 cache mis.
 63 | avg: 145220.6 cycles, 353463.0 instructions, 	    32.3 branch mis.,  42705.6 cache ref.,     42.5 cache mis.
 64 |  34.098 GB/s 
 65 | estimated clock in range 1.477 GHz to 2.455 GHz
 66 | 
 67 | 
 68 | pospopcnt_u16_avx512bw_harvey_seal_256B 	alignments: 16 
 69 | instructions per cycle 2.40, cycles per 16-bit word:  0.220, instructions per 16-bit word 0.529 
 70 | min:   230683 cycles,   554484 instructions, 	      24 branch mis.,    42957 cache ref.,        0 cache mis.
 71 | avg: 232969.5 cycles, 554484.1 instructions, 	    29.4 branch mis.,  43117.6 cache ref.,     20.7 cache mis.
 72 |  23.439 GB/s 
 73 | estimated clock in range 1.700 GHz to 2.616 GHz
 74 | 
 75 | 
 76 | == Trial 3 out of 3 
 77 | 
 78 | pospopcnt_u16_scalar                    	alignments: 16 
 79 | instructions per cycle 3.76, cycles per 16-bit word:  17.300, instructions per 16-bit word 65.000 
 80 | min: 18140522 cycles, 68157535 instructions, 	       2 branch mis.,    42586 cache ref.,        0 cache mis.
 81 | avg: 18176543.7 cycles, 68157535.8 instructions, 	     3.0 branch mis.,  42930.7 cache ref.,   1025.8 cache mis.
 82 |  0.367 GB/s 
 83 | estimated clock in range 3.041 GHz to 3.179 GHz
 84 | 
 85 | 
 86 | pospopcnt_u16_avx512bw_harvey_seal_1KB  	alignments: 16 
 87 | instructions per cycle 1.89, cycles per 16-bit word:  0.127, instructions per 16-bit word 0.240 
 88 | min:   132920 cycles,   251608 instructions, 	      25 branch mis.,    41335 cache ref.,        0 cache mis.
 89 | avg: 136227.6 cycles, 251608.0 instructions, 	    48.2 branch mis.,  41611.7 cache ref.,     30.9 cache mis.
 90 |  35.520 GB/s 
 91 | estimated clock in range 1.256 GHz to 2.343 GHz
 92 | 
 93 | 
 94 | pospopcnt_u16_avx512bw_harvey_seal_512B 	alignments: 16 
 95 | instructions per cycle 2.49, cycles per 16-bit word:  0.136, instructions per 16-bit word 0.337 
 96 | min:   142113 cycles,   353463 instructions, 	      19 branch mis.,    42518 cache ref.,        0 cache mis.
 97 | avg: 144527.0 cycles, 353463.1 instructions, 	    30.8 branch mis.,  42726.6 cache ref.,     27.5 cache mis.
 98 |  33.835 GB/s 
 99 | estimated clock in range 1.605 GHz to 2.381 GHz
100 | 
101 | 
102 | pospopcnt_u16_avx512bw_harvey_seal_256B 	alignments: 16 
103 | instructions per cycle 2.40, cycles per 16-bit word:  0.220, instructions per 16-bit word 0.529 
104 | min:   230802 cycles,   554484 instructions, 	      23 branch mis.,    42936 cache ref.,        0 cache mis.
105 | avg: 236280.8 cycles, 554484.1 instructions, 	    30.5 branch mis.,  43137.8 cache ref.,     59.7 cache mis.
106 |  23.320 GB/s 
107 | estimated clock in range 1.746 GHz to 2.625 GHz
108 | 
109 | 


--------------------------------------------------------------------------------
/verylargeresults.txt:
--------------------------------------------------------------------------------
 1 | $ ./instrumented_benchmark -v -n $((2**29))
 2 | n = 536870912 m = 1
 3 | iterations = 100
 4 | array size: 1024.000 MB
 5 | nothing                                 	instructions per cycle 0.20, cycles per 16-bit word:  0.000, instructions per 16-bit word 0.000
 6 | min:       65 cycles,       13 instructions, 	       1 branch mis.,        0 cache ref.,        0 cache mis.
 7 | avg:     67.5 cycles,     13.0 instructions, 	     1.0 branch mis.,      0.1 cache ref.,      0.1 cache mis.
 8 | 
 9 | == Trial 1 out of 3
10 | 
11 | pospopcnt_u16_scalar                    	alignments: 16
12 | instructions per cycle 3.75, cycles per 16-bit word:  17.334, instructions per 16-bit word 65.000
13 | min: 9306366083 cycles, 34896612300 instructions, 	       6 branch mis., 21927294 cache ref., 16810210 cache mis.
14 | avg: 9310655628.7 cycles, 34896612310.5 instructions, 	    19.2 branch mis., 22002041.1 cache ref., 16810510.8 cache mis.
15 |  0.367 GB/s
16 | estimated clock in range 3.166 GHz to 3.179 GHz
17 | 
18 | 
19 | pospopcnt_u16_avx512bw_harvey_seal_1KB  	alignments: 16
20 | instructions per cycle 0.62, cycles per 16-bit word:  0.361, instructions per 16-bit word 0.224
21 | min: 193647030 cycles, 120441986 instructions, 	    2400 branch mis., 28808476 cache ref., 15793002 cache mis.
22 | avg: 194467110.1 cycles, 120441986.5 instructions, 	  4906.4 branch mis., 28951104.1 cache ref., 15828987.5 cache mis.
23 |  17.629 GB/s
24 | estimated clock in range 3.107 GHz to 3.182 GHz
25 | 
26 | 
27 | pospopcnt_u16_avx512bw_harvey_seal_512B 	alignments: 16
28 | instructions per cycle 0.82, cycles per 16-bit word:  0.391, instructions per 16-bit word 0.322
29 | min: 209898674 cycles, 172739686 instructions, 	    3081 branch mis., 27373286 cache ref., 15922807 cache mis.
30 | avg: 210382318.2 cycles, 172739686.4 instructions, 	  5058.5 branch mis., 27406810.1 cache ref., 15934282.6 cache mis.
31 |  16.270 GB/s
32 | estimated clock in range 3.127 GHz to 3.184 GHz
33 | 
34 | 
35 | pospopcnt_u16_avx512bw_harvey_seal_256B 	alignments: 16
36 | instructions per cycle 1.11, cycles per 16-bit word:  0.467, instructions per 16-bit word 0.517
37 | min: 250858524 cycles, 277593519 instructions, 	    4980 branch mis., 24257628 cache ref., 16300967 cache mis.
38 | avg: 251285505.1 cycles, 277593520.3 instructions, 	  5401.8 branch mis., 24285567.7 cache ref., 16315059.9 cache mis.
39 |  13.622 GB/s
40 | estimated clock in range 3.128 GHz to 3.184 GHz
41 | 
42 | 
43 | == Trial 2 out of 3
44 | 
45 | pospopcnt_u16_scalar                    	alignments: 16
46 | instructions per cycle 3.75, cycles per 16-bit word:  17.337, instructions per 16-bit word 65.000
47 | min: 9307710055 cycles, 34896612299 instructions, 	       7 branch mis., 21966489 cache ref., 16810382 cache mis.
48 | avg: 9311309436.1 cycles, 34896612309.0 instructions, 	    19.6 branch mis., 22003978.2 cache ref., 16810540.1 cache mis.
49 |  0.367 GB/s
50 | estimated clock in range 3.167 GHz to 3.179 GHz
51 | 
52 | 
53 | pospopcnt_u16_avx512bw_harvey_seal_1KB  	alignments: 16
54 | instructions per cycle 0.62, cycles per 16-bit word:  0.361, instructions per 16-bit word 0.224
55 | min: 193904414 cycles, 120441986 instructions, 	    3513 branch mis., 28916800 cache ref., 15819247 cache mis.
56 | avg: 194369533.6 cycles, 120441986.4 instructions, 	  5209.1 branch mis., 28952621.8 cache ref., 15829349.3 cache mis.
57 |  17.612 GB/s
58 | estimated clock in range 3.120 GHz to 3.185 GHz
59 | 
60 | 
61 | pospopcnt_u16_avx512bw_harvey_seal_512B 	alignments: 16
62 | instructions per cycle 0.82, cycles per 16-bit word:  0.391, instructions per 16-bit word 0.322
63 | min: 209768033 cycles, 172739686 instructions, 	    5176 branch mis., 27390175 cache ref., 15923841 cache mis.
64 | avg: 210364100.0 cycles, 172739686.4 instructions, 	  5619.1 branch mis., 27407305.2 cache ref., 15933567.4 cache mis.
65 |  16.280 GB/s
66 | estimated clock in range 3.124 GHz to 3.185 GHz
67 | 
68 | 
69 | pospopcnt_u16_avx512bw_harvey_seal_256B 	alignments: 16
70 | instructions per cycle 1.11, cycles per 16-bit word:  0.467, instructions per 16-bit word 0.517
71 | min: 250929542 cycles, 277593519 instructions, 	    4854 branch mis., 24167020 cache ref., 16304010 cache mis.
72 | avg: 251337691.8 cycles, 277593520.6 instructions, 	  5129.5 branch mis., 24283613.6 cache ref., 16315664.5 cache mis.
73 |  13.628 GB/s
74 | estimated clock in range 3.125 GHz to 3.185 GHz
75 | 


--------------------------------------------------------------------------------