├── AUTHORS ├── LICENSE ├── Makefile ├── README.md ├── asm ├── countavx2_32.S ├── countavx2_64.S ├── countavx512_32.S ├── countavx512_64.S ├── countneon_32.S ├── countneon_64.S ├── kernelavx2.S ├── kernelavx512.S └── kernelneon.S ├── benchmark ├── golike │ └── benchmark.c └── linux │ ├── aligned_alloc.h │ ├── instrumented_benchmark.cpp │ ├── linux-perf-events.h │ ├── memalloc.h │ └── stream_benchmark.cpp ├── goscript.sh ├── include ├── pospopcnt.h └── pospopcnt_avx512bw.h ├── results.txt ├── script.sh ├── smallresults.txt └── verylargeresults.txt /AUTHORS: -------------------------------------------------------------------------------- 1 | Marcus D. R. Klarqvist 2 | Wojciech Muła 3 | Daniel Lemire 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright 2019, Marcus D. R. Klarqvist, Wojciech Muła and Daniel Lemire 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | OPTFLAGS := -O3 -march=native 2 | CFLAGS = -std=c99 -DFLAGSIZE=$(FLAGSIZE) $(OPTFLAGS) $(DEBUG_FLAGS) 3 | FLAGSIZE = 32 4 | #FLAGSIZE = 64 5 | CPPFLAGS = -std=c++0x -DFLAGSIZE=$(FLAGSIZE) $(OPTFLAGS) $(DEBUG_FLAGS) 6 | ARCH != uname -m 7 | 8 | # Default target 9 | all: instrumented_benchmark golike_benchmark 10 | 11 | # Generic rules 12 | itest: instrumented_benchmark 13 | $(CXX) --version 14 | ./instrumented_benchmark 15 | 16 | DEPS=benchmark/linux/linux-perf-events.h \ 17 | include/pospopcnt_avx512bw.h \ 18 | include/pospopcnt.h \ 19 | 20 | KERNELS= $(KERNELS_$(ARCH)) 21 | KERNELS_amd64= $(KERNELS_x86_64) 22 | KERNELS_x86_64= asm/countavx512_$(FLAGSIZE).S \ 23 | asm/countavx2_$(FLAGSIZE).S 24 | KERNELS_arm64= $(KERNELS_aarch64) 25 | KERNELS_aarch64= asm/countneon_$(FLAGSIZE).S 26 | 27 | stream_benchmark: $(DEPS) benchmark/linux/stream_benchmark.cpp 28 | $(CXX) $(CPPFLAGS) benchmark/linux/stream_benchmark.cpp -Iinclude -Ibenchmark/linux -o $@ 29 | 30 | instrumented_benchmark: $(DEPS) benchmark/linux/instrumented_benchmark.cpp 31 | $(CXX) $(CPPFLAGS) benchmark/linux/instrumented_benchmark.cpp $(KERNELS) -Iinclude -Ibenchmark/linux -o $@ 32 | 33 | instrumented_benchmark_align64: $(DEPS) 34 | $(CXX) $(CPPFLAGS) benchmark/linux/instrumented_benchmark.cpp $(KERNELS) -DALIGN -Iinclude -Ibenchmark/linux -o $@ 35 | 36 | golike_benchmark: $(DEPS) benchmark/golike/benchmark.c 37 | $(CC) $(CFLAGS) -Iinclude -o $@ benchmark/golike/benchmark.c $(KERNELS) 38 | 39 | golike_benchmark_align64: $(DEPS) benchmark/golike/benchmark.c 40 | $(CC) $(CFLAGS) -Iinclude -DALIGN -o $@ benchmark/golike/benchmark.c $(KERNELS) 41 | 42 | clean: 43 | rm -f bench example instrumented_benchmark golike_benchmark 44 | 45 | .PHONY: all clean itest 46 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Position population-count benchmarks 3 | 4 | ### requirements 5 | 6 | - Linux, with bare-metal access (you may need root) 7 | - Make sure your processor is in performance mode (not powersaving) 8 | - x64 processor supporting the AVX512BW extension 9 | 10 | ### instructions 11 | 12 | ``` 13 | make 14 | ./instrumented_benchmark -v 15 | ``` 16 | # Sample output 17 | 18 | On a Cannon Lake processor: 19 | 20 | ``` 21 | $ ./instrumented_benchmark -v 22 | n = 10000000 m = 1 23 | iterations = 100 24 | array size: 19.073 MB 25 | nothing instructions per cycle 0.20, cycles per 16-bit word: 0.000, instructions per 16-bit word 0.000 26 | min: 65 cycles, 13 instructions, 1 branch mis., 0 cache ref., 0 cache mis. 27 | avg: 69.4 cycles, 13.0 instructions, 1.0 branch mis., 0.1 cache ref., 0.1 cache mis. 28 | 29 | 30 | pospopcnt_u16_scalar alignments: 16 31 | instructions per cycle 3.75, cycles per 16-bit word: 17.325, instructions per 16-bit word 65.000 32 | min: 173251245 cycles, 650000159 instructions, 3 branch mis., 407959 cache ref., 283659 cache mis. 33 | avg: 173473725.4 cycles, 650000160.2 instructions, 8.4 branch mis., 409996.0 cache ref., 295584.1 cache mis. 34 | 0.367 GB/s 35 | estimated clock in range 3.102 GHz to 3.183 GHz 36 | 37 | 38 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 39 | instructions per cycle 0.46, cycles per 16-bit word: 0.497, instructions per 16-bit word 0.227 40 | min: 4966068 cycles, 2271648 instructions, 114 branch mis., 547590 cache ref., 262188 cache mis. 41 | avg: 5296987.3 cycles, 2271648.7 instructions, 137.9 branch mis., 552796.2 cache ref., 275536.8 cache mis. 42 | 12.407 GB/s 43 | estimated clock in range 2.916 GHz to 3.140 GHz 44 | 45 | 46 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 47 | instructions per cycle 0.60, cycles per 16-bit word: 0.538, instructions per 16-bit word 0.325 48 | min: 5382323 cycles, 3245605 instructions, 82 branch mis., 487390 cache ref., 268386 cache mis. 49 | avg: 5697539.5 cycles, 3245605.8 instructions, 125.8 branch mis., 498338.1 cache ref., 279114.2 cache mis. 50 | 11.658 GB/s 51 | estimated clock in range 2.985 GHz to 3.142 GHz 52 | 53 | 54 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 55 | instructions per cycle 0.84, cycles per 16-bit word: 0.618, instructions per 16-bit word 0.518 56 | min: 6184692 cycles, 5181931 instructions, 114 branch mis., 441161 cache ref., 267157 cache mis. 57 | avg: 6661233.2 cycles, 5181931.2 instructions, 124.0 branch mis., 446684.9 cache ref., 280685.8 cache mis. 58 | 10.162 GB/s 59 | estimated clock in range 2.956 GHz to 3.148 GHz 60 | ``` 61 | 62 | ### Reference 63 | 64 | - Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire, [Efficient computation of positional population counts using SIMD instructions](https://arxiv.org/abs/1911.02696), Concurrency and Computation: Practice and Experience 33 (17), 2021. 65 | 66 | -------------------------------------------------------------------------------- /asm/countavx2_32.S: -------------------------------------------------------------------------------- 1 | #include "kernelavx2.S" 2 | 3 | # not yet ported to 32 bit flag size 4 | .balign 16 5 | accum8: 6 | .type accum8, @function 7 | vpunpcklwd ymm12, ymm8, ymm7 8 | vpunpckhwd ymm8, ymm8, ymm7 9 | vpunpcklwd ymm14, ymm9, ymm7 10 | vpunpckhwd ymm9, ymm9, ymm7 11 | vpunpcklwd ymm4, ymm10, ymm7 12 | vpunpckhwd ymm10, ymm10, ymm7 13 | vpunpcklwd ymm5, ymm11, ymm7 14 | vpunpckhwd ymm11, ymm11, ymm7 15 | vpaddd ymm12, ymm4, ymm12 16 | vpaddd ymm8, ymm10, ymm8 17 | vpaddd ymm14, ymm5, ymm14 18 | vpaddd ymm9, ymm11, ymm9 19 | vpaddd ymm12, ymm12, ymm14 20 | vpaddd ymm8, ymm8, ymm9 21 | vperm2i128 ymm14, ymm12, ymm8, 0x20 22 | vperm2i128 ymm4, ymm12, ymm8, 0x31 23 | vpaddd ymm12, ymm14, ymm4 24 | vpermq ymm12, ymm12, 0xD8 25 | vpunpckhdq ymm14, ymm12, ymm7 26 | vpunpckldq ymm12, ymm12, ymm7 27 | vpaddq ymm12, ymm12, [rdi] 28 | vpaddq ymm14, ymm14, [rdi+0x20] 29 | vmovdqu [rdi], ymm12 30 | vmovdqu [rdi+0x20], ymm14 31 | ret 32 | 33 | .balign 16 34 | accum16: 35 | .type accum16, @function 36 | vpunpcklwd ymm12, ymm8, ymm7 37 | vpunpckhwd ymm8, ymm8, ymm7 38 | vpunpcklwd ymm14, ymm9, ymm7 39 | vpunpckhwd ymm9, ymm9, ymm7 40 | vpunpcklwd ymm4, ymm10, ymm7 41 | vpunpckhwd ymm10, ymm10, ymm7 42 | vpunpcklwd ymm5, ymm11, ymm7 43 | vpunpckhwd ymm11, ymm11, ymm7 44 | vpaddd ymm12, ymm4, ymm12 # 0- 3, 16-19 45 | vpaddd ymm8, ymm10, ymm8 # 4- 7, 20-23 46 | vpaddd ymm14, ymm5, ymm14 # 8-11, 24-27 47 | vpaddd ymm9, ymm11, ymm9 # 12-15, 28-31 48 | vperm2i128 ymm4, ymm12, ymm8, 0x20 49 | vperm2i128 ymm10, ymm12, ymm8, 0x31 50 | vpaddd ymm12, ymm10, ymm4 51 | vperm2i128 ymm5, ymm14, ymm9, 0x20 52 | vperm2i128 ymm11, ymm14, ymm9, 0x31 53 | vpaddd ymm4, ymm11, ymm5 54 | 55 | vpaddd ymm12, ymm12, [rdi+0x00] 56 | vpaddd ymm4, ymm4, [rdi+0x20] 57 | vmovdqu [rdi+0x00], ymm12 58 | vmovdqu [rdi+0x20], ymm4 59 | ret 60 | 61 | # not yet ported to 32 bit flag size 62 | .balign 16 63 | accum32: 64 | .type accum32, @function 65 | vpunpcklwd ymm12, ymm8, ymm7 66 | vpunpckhwd ymm8, ymm8, ymm7 67 | vpunpcklwd ymm14, ymm9, ymm7 68 | vpunpckhwd ymm9, ymm9, ymm7 69 | vpunpcklwd ymm4, ymm10, ymm7 70 | vpunpckhwd ymm10, ymm10, ymm7 71 | vpunpcklwd ymm5, ymm11, ymm7 72 | vpunpckhwd ymm11, ymm11, ymm7 73 | vpaddd ymm12, ymm4, ymm12 74 | vpaddd ymm8, ymm10, ymm8 75 | vpaddd ymm14, ymm5, ymm14 76 | vpaddd ymm9, ymm11, ymm9 77 | vpermq ymm12, ymm12, 0xD8 78 | vpunpckhdq ymm4, ymm12, ymm7 79 | vpunpckldq ymm12, ymm12, ymm7 80 | vpaddq ymm12, ymm12, [rdi] 81 | vpaddq ymm4, ymm4, [rdi+0x80] 82 | vmovdqu [rdi], ymm12 83 | vmovdqu [rdi+0x80], ymm4 84 | vpermq ymm8, ymm8, 0xD8 85 | vpunpckhdq ymm4, ymm8, ymm7 86 | vpunpckldq ymm8, ymm8, ymm7 87 | vpaddq ymm8, ymm8, [rdi+0x20] 88 | vpaddq ymm4, ymm4, [rdi+0xA0] 89 | vmovdqu [rdi+0x20], ymm8 90 | vmovdqu [rdi+0xA0], ymm4 91 | vpermq ymm14, ymm14, 0xD8 92 | vpunpckhdq ymm4, ymm14, ymm7 93 | vpunpckldq ymm14, ymm14, ymm7 94 | vpaddq ymm14, ymm14, [rdi+0x40] 95 | vpaddq ymm4, ymm4, [rdi+0xC0] 96 | vmovdqu [rdi+0x40], ymm14 97 | vmovdqu [rdi+0xC0], ymm4 98 | vpermq ymm9, ymm9, 0xD8 99 | vpunpckhdq ymm4, ymm9, ymm7 100 | vpunpckldq ymm9, ymm9, ymm7 101 | vpaddq ymm9, ymm9, [rdi+0x60] 102 | vpaddq ymm4, ymm4, [rdi+0xE0] 103 | vmovdqu [rdi+0x60], ymm9 104 | vmovdqu [rdi+0xE0], ymm4 105 | ret 106 | 107 | # not yet ported to 32 bit flag size 108 | .balign 16 109 | accum64: 110 | .type accum64, @function 111 | vpunpcklwd ymm12, ymm8, ymm7 112 | vpunpckhwd ymm14, ymm8, ymm7 113 | vpermq ymm12, ymm12, 0xD8 114 | vpunpckhdq ymm4, ymm12, ymm7 115 | vpunpckldq ymm12, ymm12, ymm7 116 | vpaddq ymm12, ymm12, [rdi] 117 | vpaddq ymm4, ymm4, [rdi+0x80] 118 | vmovdqu [rdi], ymm12 119 | vmovdqu [rdi+0x80], ymm4 120 | vpermq ymm14, ymm14, 0xD8 121 | vpunpckhdq ymm4, ymm14, ymm7 122 | vpunpckldq ymm14, ymm14, ymm7 123 | vpaddq ymm14, ymm14, [rdi+0x20] 124 | vpaddq ymm4, ymm4, [rdi+0xA0] 125 | vmovdqu [rdi+0x20], ymm14 126 | vmovdqu [rdi+0xA0], ymm4 127 | vpunpcklwd ymm12, ymm9, ymm7 128 | vpunpckhwd ymm14, ymm9, ymm7 129 | vpermq ymm12, ymm12, 0xD8 130 | vpunpckhdq ymm4, ymm12, ymm7 131 | vpunpckldq ymm12, ymm12, ymm7 132 | vpaddq ymm12, ymm12, [rdi+0x40] 133 | vpaddq ymm4, ymm4, [rdi+0xC0] 134 | vmovdqu [rdi+0x40], ymm12 135 | vmovdqu [rdi+0xC0], ymm4 136 | vpermq ymm14, ymm14, 0xD8 137 | vpunpckhdq ymm4, ymm14, ymm7 138 | vpunpckldq ymm14, ymm14, ymm7 139 | vpaddq ymm14, ymm14, [rdi+0x60] 140 | vpaddq ymm4, ymm4, [rdi+0xE0] 141 | vmovdqu [rdi+0x60], ymm14 142 | vmovdqu [rdi+0xE0], ymm4 143 | vpunpcklwd ymm12, ymm10, ymm7 144 | vpunpckhwd ymm14, ymm10, ymm7 145 | vpermq ymm12, ymm12, 0xD8 146 | vpunpckhdq ymm4, ymm12, ymm7 147 | vpunpckldq ymm12, ymm12, ymm7 148 | vpaddq ymm12, ymm12, [rdi+0x100] 149 | vpaddq ymm4, ymm4, [rdi+0x180] 150 | vmovdqu [rdi+0x100], ymm12 151 | vmovdqu [rdi+0x180], ymm4 152 | vpermq ymm14, ymm14, 0xD8 153 | vpunpckhdq ymm4, ymm14, ymm7 154 | vpunpckldq ymm14, ymm14, ymm7 155 | vpaddq ymm14, ymm14, [rdi+0x120] 156 | vpaddq ymm4, ymm4, [rdi+0x1A0] 157 | vmovdqu [rdi+0x120], ymm14 158 | vmovdqu [rdi+0x1A0], ymm4 159 | vpunpcklwd ymm12, ymm11, ymm7 160 | vpunpckhwd ymm14, ymm11, ymm7 161 | vpermq ymm12, ymm12, 0xD8 162 | vpunpckhdq ymm4, ymm12, ymm7 163 | vpunpckldq ymm12, ymm12, ymm7 164 | vpaddq ymm12, ymm12, [rdi+0x140] 165 | vpaddq ymm4, ymm4, [rdi+0x1C0] 166 | vmovdqu [rdi+0x140], ymm12 167 | vmovdqu [rdi+0x1C0], ymm4 168 | vpermq ymm14, ymm14, 0xD8 169 | vpunpckhdq ymm4, ymm14, ymm7 170 | vpunpckldq ymm14, ymm14, ymm7 171 | vpaddq ymm14, ymm14, [rdi+0x160] 172 | vpaddq ymm4, ymm4, [rdi+0x1E0] 173 | vmovdqu [rdi+0x160], ymm14 174 | vmovdqu [rdi+0x1E0], ymm4 175 | ret 176 | -------------------------------------------------------------------------------- /asm/countavx2_64.S: -------------------------------------------------------------------------------- 1 | #include "kernelavx2.S" 2 | 3 | .balign 16 4 | accum8: 5 | .type accum8, @function 6 | vpunpcklwd ymm12, ymm8, ymm7 7 | vpunpckhwd ymm8, ymm8, ymm7 8 | vpunpcklwd ymm14, ymm9, ymm7 9 | vpunpckhwd ymm9, ymm9, ymm7 10 | vpunpcklwd ymm4, ymm10, ymm7 11 | vpunpckhwd ymm10, ymm10, ymm7 12 | vpunpcklwd ymm5, ymm11, ymm7 13 | vpunpckhwd ymm11, ymm11, ymm7 14 | vpaddd ymm12, ymm4, ymm12 15 | vpaddd ymm8, ymm10, ymm8 16 | vpaddd ymm14, ymm5, ymm14 17 | vpaddd ymm9, ymm11, ymm9 18 | vpaddd ymm12, ymm12, ymm14 19 | vpaddd ymm8, ymm8, ymm9 20 | vperm2i128 ymm14, ymm12, ymm8, 0x20 21 | vperm2i128 ymm4, ymm12, ymm8, 0x31 22 | vpaddd ymm12, ymm14, ymm4 23 | vpermq ymm12, ymm12, 0xD8 24 | vpunpckhdq ymm14, ymm12, ymm7 25 | vpunpckldq ymm12, ymm12, ymm7 26 | vpaddq ymm12, ymm12, [rdi] 27 | vpaddq ymm14, ymm14, [rdi+0x20] 28 | vmovdqu [rdi], ymm12 29 | vmovdqu [rdi+0x20], ymm14 30 | ret 31 | 32 | .balign 16 33 | accum16: 34 | .type accum16, @function 35 | vpunpcklwd ymm12, ymm8, ymm7 36 | vpunpckhwd ymm8, ymm8, ymm7 37 | vpunpcklwd ymm14, ymm9, ymm7 38 | vpunpckhwd ymm9, ymm9, ymm7 39 | vpunpcklwd ymm4, ymm10, ymm7 40 | vpunpckhwd ymm10, ymm10, ymm7 41 | vpunpcklwd ymm5, ymm11, ymm7 42 | vpunpckhwd ymm11, ymm11, ymm7 43 | vpaddd ymm12, ymm4, ymm12 44 | vpaddd ymm8, ymm10, ymm8 45 | vpaddd ymm14, ymm5, ymm14 46 | vpaddd ymm9, ymm11, ymm9 47 | vperm2i128 ymm4, ymm12, ymm8, 0x20 48 | vperm2i128 ymm10, ymm12, ymm8, 0x31 49 | vpaddd ymm12, ymm10, ymm4 50 | vperm2i128 ymm5, ymm14, ymm9, 0x20 51 | vperm2i128 ymm11, ymm14, ymm9, 0x31 52 | vpaddd ymm4, ymm11, ymm5 53 | vpermq ymm12, ymm12, 0xD8 54 | vpunpckhdq ymm14, ymm12, ymm7 55 | vpunpckldq ymm12, ymm12, ymm7 56 | vpaddq ymm12, ymm12, [rdi] 57 | vpaddq ymm14, ymm14, [rdi+0x20] 58 | vmovdqu [rdi], ymm12 59 | vmovdqu [rdi+0x20], ymm14 60 | vpermq ymm4, ymm4, 0xD8 61 | vpunpckhdq ymm5, ymm4, ymm7 62 | vpunpckldq ymm4, ymm4, ymm7 63 | vpaddq ymm4, ymm4, [rdi+0x40] 64 | vpaddq ymm5, ymm5, [rdi+0x60] 65 | vmovdqu [rdi+0x40], ymm4 66 | vmovdqu [rdi+0x60], ymm5 67 | ret 68 | 69 | .balign 16 70 | accum32: 71 | .type accum32, @function 72 | vpunpcklwd ymm12, ymm8, ymm7 73 | vpunpckhwd ymm8, ymm8, ymm7 74 | vpunpcklwd ymm14, ymm9, ymm7 75 | vpunpckhwd ymm9, ymm9, ymm7 76 | vpunpcklwd ymm4, ymm10, ymm7 77 | vpunpckhwd ymm10, ymm10, ymm7 78 | vpunpcklwd ymm5, ymm11, ymm7 79 | vpunpckhwd ymm11, ymm11, ymm7 80 | vpaddd ymm12, ymm4, ymm12 81 | vpaddd ymm8, ymm10, ymm8 82 | vpaddd ymm14, ymm5, ymm14 83 | vpaddd ymm9, ymm11, ymm9 84 | vpermq ymm12, ymm12, 0xD8 85 | vpunpckhdq ymm4, ymm12, ymm7 86 | vpunpckldq ymm12, ymm12, ymm7 87 | vpaddq ymm12, ymm12, [rdi] 88 | vpaddq ymm4, ymm4, [rdi+0x80] 89 | vmovdqu [rdi], ymm12 90 | vmovdqu [rdi+0x80], ymm4 91 | vpermq ymm8, ymm8, 0xD8 92 | vpunpckhdq ymm4, ymm8, ymm7 93 | vpunpckldq ymm8, ymm8, ymm7 94 | vpaddq ymm8, ymm8, [rdi+0x20] 95 | vpaddq ymm4, ymm4, [rdi+0xA0] 96 | vmovdqu [rdi+0x20], ymm8 97 | vmovdqu [rdi+0xA0], ymm4 98 | vpermq ymm14, ymm14, 0xD8 99 | vpunpckhdq ymm4, ymm14, ymm7 100 | vpunpckldq ymm14, ymm14, ymm7 101 | vpaddq ymm14, ymm14, [rdi+0x40] 102 | vpaddq ymm4, ymm4, [rdi+0xC0] 103 | vmovdqu [rdi+0x40], ymm14 104 | vmovdqu [rdi+0xC0], ymm4 105 | vpermq ymm9, ymm9, 0xD8 106 | vpunpckhdq ymm4, ymm9, ymm7 107 | vpunpckldq ymm9, ymm9, ymm7 108 | vpaddq ymm9, ymm9, [rdi+0x60] 109 | vpaddq ymm4, ymm4, [rdi+0xE0] 110 | vmovdqu [rdi+0x60], ymm9 111 | vmovdqu [rdi+0xE0], ymm4 112 | ret 113 | 114 | .balign 16 115 | accum64: 116 | .type accum64, @function 117 | vpunpcklwd ymm12, ymm8, ymm7 118 | vpunpckhwd ymm14, ymm8, ymm7 119 | vpermq ymm12, ymm12, 0xD8 120 | vpunpckhdq ymm4, ymm12, ymm7 121 | vpunpckldq ymm12, ymm12, ymm7 122 | vpaddq ymm12, ymm12, [rdi] 123 | vpaddq ymm4, ymm4, [rdi+0x80] 124 | vmovdqu [rdi], ymm12 125 | vmovdqu [rdi+0x80], ymm4 126 | vpermq ymm14, ymm14, 0xD8 127 | vpunpckhdq ymm4, ymm14, ymm7 128 | vpunpckldq ymm14, ymm14, ymm7 129 | vpaddq ymm14, ymm14, [rdi+0x20] 130 | vpaddq ymm4, ymm4, [rdi+0xA0] 131 | vmovdqu [rdi+0x20], ymm14 132 | vmovdqu [rdi+0xA0], ymm4 133 | vpunpcklwd ymm12, ymm9, ymm7 134 | vpunpckhwd ymm14, ymm9, ymm7 135 | vpermq ymm12, ymm12, 0xD8 136 | vpunpckhdq ymm4, ymm12, ymm7 137 | vpunpckldq ymm12, ymm12, ymm7 138 | vpaddq ymm12, ymm12, [rdi+0x40] 139 | vpaddq ymm4, ymm4, [rdi+0xC0] 140 | vmovdqu [rdi+0x40], ymm12 141 | vmovdqu [rdi+0xC0], ymm4 142 | vpermq ymm14, ymm14, 0xD8 143 | vpunpckhdq ymm4, ymm14, ymm7 144 | vpunpckldq ymm14, ymm14, ymm7 145 | vpaddq ymm14, ymm14, [rdi+0x60] 146 | vpaddq ymm4, ymm4, [rdi+0xE0] 147 | vmovdqu [rdi+0x60], ymm14 148 | vmovdqu [rdi+0xE0], ymm4 149 | vpunpcklwd ymm12, ymm10, ymm7 150 | vpunpckhwd ymm14, ymm10, ymm7 151 | vpermq ymm12, ymm12, 0xD8 152 | vpunpckhdq ymm4, ymm12, ymm7 153 | vpunpckldq ymm12, ymm12, ymm7 154 | vpaddq ymm12, ymm12, [rdi+0x100] 155 | vpaddq ymm4, ymm4, [rdi+0x180] 156 | vmovdqu [rdi+0x100], ymm12 157 | vmovdqu [rdi+0x180], ymm4 158 | vpermq ymm14, ymm14, 0xD8 159 | vpunpckhdq ymm4, ymm14, ymm7 160 | vpunpckldq ymm14, ymm14, ymm7 161 | vpaddq ymm14, ymm14, [rdi+0x120] 162 | vpaddq ymm4, ymm4, [rdi+0x1A0] 163 | vmovdqu [rdi+0x120], ymm14 164 | vmovdqu [rdi+0x1A0], ymm4 165 | vpunpcklwd ymm12, ymm11, ymm7 166 | vpunpckhwd ymm14, ymm11, ymm7 167 | vpermq ymm12, ymm12, 0xD8 168 | vpunpckhdq ymm4, ymm12, ymm7 169 | vpunpckldq ymm12, ymm12, ymm7 170 | vpaddq ymm12, ymm12, [rdi+0x140] 171 | vpaddq ymm4, ymm4, [rdi+0x1C0] 172 | vmovdqu [rdi+0x140], ymm12 173 | vmovdqu [rdi+0x1C0], ymm4 174 | vpermq ymm14, ymm14, 0xD8 175 | vpunpckhdq ymm4, ymm14, ymm7 176 | vpunpckldq ymm14, ymm14, ymm7 177 | vpaddq ymm14, ymm14, [rdi+0x160] 178 | vpaddq ymm4, ymm4, [rdi+0x1E0] 179 | vmovdqu [rdi+0x160], ymm14 180 | vmovdqu [rdi+0x1E0], ymm4 181 | ret 182 | -------------------------------------------------------------------------------- /asm/countavx512_32.S: -------------------------------------------------------------------------------- 1 | #include "kernelavx512.S" 2 | 3 | # not ported to uint32_t flags yet 4 | .balign 16 5 | accum8: 6 | .type accum8, @function 7 | vpmovzxwq zmm0, xmm8 8 | vextracti64x2 xmm1, zmm8, 0x01 9 | vpmovzxwq zmm1, xmm1 10 | vextracti64x2 xmm2, zmm8, 0x02 11 | vpmovzxwq zmm2, xmm2 12 | vextracti64x2 xmm3, zmm8, 0x03 13 | vpmovzxwq zmm3, xmm3 14 | vpmovzxwq zmm4, xmm9 15 | vextracti64x2 xmm5, zmm9, 0x01 16 | vpmovzxwq zmm5, xmm5 17 | vextracti64x2 xmm6, zmm9, 0x02 18 | vpmovzxwq zmm6, xmm6 19 | vextracti64x2 xmm7, zmm9, 0x03 20 | vpmovzxwq zmm7, xmm7 21 | vpaddq zmm0, zmm0, zmm2 22 | vpaddq zmm1, zmm1, zmm3 23 | vpaddq zmm4, zmm4, zmm6 24 | vpaddq zmm5, zmm5, zmm7 25 | vpaddq zmm0, zmm0, zmm1 26 | vpaddq zmm4, zmm4, zmm5 27 | vpaddq zmm0, zmm0, zmm4 28 | vpaddq zmm0, zmm0, [rdi] 29 | vmovdqu64 [rdi], zmm0 30 | ret 31 | 32 | .balign 16 33 | accum16: 34 | .type accum16, @function 35 | vpmovzxwd zmm10, ymm8 36 | vextracti64x4 ymm12, zmm8, 0x01 37 | vpmovzxwd zmm12, ymm12 38 | vpmovzxwd zmm14, ymm9 39 | vextracti64x4 ymm16, zmm9, 0x01 40 | vpmovzxwd zmm16, ymm16 41 | vpaddd zmm10, zmm10, zmm12 42 | vpaddd zmm14, zmm14, zmm16 43 | vshufi64x2 zmm11, zmm10, zmm14, 0x44 44 | vshufi64x2 zmm15, zmm10, zmm14, 0xEE 45 | vpaddd zmm11, zmm11, zmm15 46 | vpaddd zmm11, zmm11, [rdi] 47 | vmovdqu64 [rdi], zmm11 48 | ret 49 | 50 | # not ported to uint32_t flags yet 51 | .balign 16 52 | accum32: 53 | .type accum32, @function 54 | vextracti64x2 xmm2, zmm8, 0x02 55 | vextracti64x2 xmm3, zmm9, 0x02 56 | vpmovzxwq zmm0, xmm8 57 | vpmovzxwq zmm1, xmm9 58 | vpmovzxwq zmm2, xmm2 59 | vpmovzxwq zmm3, xmm3 60 | vpaddq zmm0, zmm0, zmm2 61 | vpaddq zmm1, zmm1, zmm3 62 | vpaddq zmm0, zmm0, [rdi] 63 | vpaddq zmm1, zmm1, [rdi+0x1*0x40] 64 | vmovdqu64 [rdi], zmm0 65 | vmovdqu64 [rdi+0x1*0x40], zmm1 66 | vextracti64x2 xmm0, zmm8, 0x01 67 | vextracti64x2 xmm1, zmm9, 0x01 68 | vextracti64x2 xmm2, zmm8, 0x03 69 | vextracti64x2 xmm3, zmm9, 0x03 70 | vpmovzxwq zmm0, xmm0 71 | vpmovzxwq zmm1, xmm1 72 | vpmovzxwq zmm2, xmm2 73 | vpmovzxwq zmm3, xmm3 74 | vpaddq zmm0, zmm0, zmm2 75 | vpaddq zmm1, zmm1, zmm3 76 | vpaddq zmm0, zmm0, [rdi+0x2*0x40] 77 | vpaddq zmm1, zmm1, [rdi+0x3*0x40] 78 | vmovdqu64 [rdi+0x2*0x40], zmm0 79 | vmovdqu64 [rdi+0x3*0x40], zmm1 80 | ret 81 | 82 | # not ported to uint32_t flags yet 83 | .balign 16 84 | accum64: 85 | .type accum64, @function 86 | vpmovzxwq zmm3, xmm8 87 | vpmovzxwq zmm4, xmm9 88 | vpaddq zmm3, zmm3, [rdi] 89 | vpaddq zmm4, zmm4, [rdi+0x1*0x40] 90 | vmovdqu64 [rdi], zmm3 91 | vmovdqu64 [rdi+0x1*0x40], zmm4 92 | vextracti64x2 xmm3, zmm8, 0x01 93 | vextracti64x2 xmm4, zmm9, 0x01 94 | vpmovzxwq zmm3, xmm3 95 | vpmovzxwq zmm4, xmm4 96 | vpaddq zmm3, zmm3, [rdi+0x2*0x40] 97 | vpaddq zmm4, zmm4, [rdi+0x3*0x40] 98 | vmovdqu64 [rdi+0x2*0x40], zmm3 99 | vmovdqu64 [rdi+0x3*0x40], zmm4 100 | vextracti64x2 xmm3, zmm8, 0x02 101 | vextracti64x2 xmm4, zmm9, 0x02 102 | vpmovzxwq zmm3, xmm3 103 | vpmovzxwq zmm4, xmm4 104 | vpaddq zmm3, zmm3, [rdi+0x4*0x40] 105 | vpaddq zmm4, zmm4, [rdi+0x5*0x40] 106 | vmovdqu64 [rdi+0x4*0x40], zmm3 107 | vmovdqu64 [rdi+0x5*0x40], zmm4 108 | vextracti64x2 xmm3, zmm8, 0x03 109 | vextracti64x2 xmm4, zmm9, 0x03 110 | vpmovzxwq zmm3, xmm3 111 | vpmovzxwq zmm4, xmm4 112 | vpaddq zmm3, zmm3, [rdi+0x6*0x40] 113 | vpaddq zmm4, zmm4, [rdi+0x7*0x40] 114 | vmovdqu64 [rdi+0x6*0x40], zmm3 115 | vmovdqu64 [rdi+0x7*0x40], zmm4 116 | ret 117 | -------------------------------------------------------------------------------- /asm/countavx512_64.S: -------------------------------------------------------------------------------- 1 | #include "kernelavx512.S" 2 | 3 | .balign 16 4 | accum8: 5 | .type accum8, @function 6 | vpmovzxwq zmm0, xmm8 7 | vextracti64x2 xmm1, zmm8, 0x01 8 | vpmovzxwq zmm1, xmm1 9 | vextracti64x2 xmm2, zmm8, 0x02 10 | vpmovzxwq zmm2, xmm2 11 | vextracti64x2 xmm3, zmm8, 0x03 12 | vpmovzxwq zmm3, xmm3 13 | vpmovzxwq zmm4, xmm9 14 | vextracti64x2 xmm5, zmm9, 0x01 15 | vpmovzxwq zmm5, xmm5 16 | vextracti64x2 xmm6, zmm9, 0x02 17 | vpmovzxwq zmm6, xmm6 18 | vextracti64x2 xmm7, zmm9, 0x03 19 | vpmovzxwq zmm7, xmm7 20 | vpaddq zmm0, zmm0, zmm2 21 | vpaddq zmm1, zmm1, zmm3 22 | vpaddq zmm4, zmm4, zmm6 23 | vpaddq zmm5, zmm5, zmm7 24 | vpaddq zmm0, zmm0, zmm1 25 | vpaddq zmm4, zmm4, zmm5 26 | vpaddq zmm0, zmm0, zmm4 27 | vpaddq zmm0, zmm0, [rdi] 28 | vmovdqu64 [rdi], zmm0 29 | ret 30 | 31 | .balign 16 32 | accum16: 33 | .type accum16, @function 34 | vpmovzxwq zmm10, xmm8 35 | vextracti64x2 xmm11, zmm8, 0x01 36 | vpmovzxwq zmm11, xmm11 37 | vextracti64x2 xmm12, zmm8, 0x02 38 | vpmovzxwq zmm12, xmm12 39 | vextracti64x2 xmm13, zmm8, 0x03 40 | vpmovzxwq zmm13, xmm13 41 | vpmovzxwq zmm14, xmm9 42 | vextracti64x2 xmm15, zmm9, 0x01 43 | vpmovzxwq zmm15, xmm15 44 | vextracti64x2 xmm16, zmm9, 0x02 45 | vpmovzxwq zmm16, xmm16 46 | vextracti64x2 xmm17, zmm9, 0x03 47 | vpmovzxwq zmm17, xmm17 48 | vpaddq zmm10, zmm10, zmm12 49 | vpaddq zmm11, zmm11, zmm13 50 | vpaddq zmm14, zmm14, zmm16 51 | vpaddq zmm15, zmm15, zmm17 52 | vpaddq zmm10, zmm10, zmm11 53 | vpaddq zmm14, zmm14, zmm15 54 | vpaddq zmm10, zmm10, [rdi] 55 | vpaddq zmm14, zmm14, [rdi+0x1*0x40] 56 | vmovdqu64 [rdi], zmm10 57 | vmovdqu64 [rdi+0x1*0x40], zmm14 58 | ret 59 | 60 | .balign 16 61 | accum32: 62 | .type accum32, @function 63 | vextracti64x2 xmm2, zmm8, 0x02 64 | vextracti64x2 xmm3, zmm9, 0x02 65 | vpmovzxwq zmm0, xmm8 66 | vpmovzxwq zmm1, xmm9 67 | vpmovzxwq zmm2, xmm2 68 | vpmovzxwq zmm3, xmm3 69 | vpaddq zmm0, zmm0, zmm2 70 | vpaddq zmm1, zmm1, zmm3 71 | vpaddq zmm0, zmm0, [rdi] 72 | vpaddq zmm1, zmm1, [rdi+0x1*0x40] 73 | vmovdqu64 [rdi], zmm0 74 | vmovdqu64 [rdi+0x1*0x40], zmm1 75 | vextracti64x2 xmm0, zmm8, 0x01 76 | vextracti64x2 xmm1, zmm9, 0x01 77 | vextracti64x2 xmm2, zmm8, 0x03 78 | vextracti64x2 xmm3, zmm9, 0x03 79 | vpmovzxwq zmm0, xmm0 80 | vpmovzxwq zmm1, xmm1 81 | vpmovzxwq zmm2, xmm2 82 | vpmovzxwq zmm3, xmm3 83 | vpaddq zmm0, zmm0, zmm2 84 | vpaddq zmm1, zmm1, zmm3 85 | vpaddq zmm0, zmm0, [rdi+0x2*0x40] 86 | vpaddq zmm1, zmm1, [rdi+0x3*0x40] 87 | vmovdqu64 [rdi+0x2*0x40], zmm0 88 | vmovdqu64 [rdi+0x3*0x40], zmm1 89 | ret 90 | 91 | .balign 16 92 | accum64: 93 | .type accum64, @function 94 | vpmovzxwq zmm3, xmm8 95 | vpmovzxwq zmm4, xmm9 96 | vpaddq zmm3, zmm3, [rdi] 97 | vpaddq zmm4, zmm4, [rdi+0x1*0x40] 98 | vmovdqu64 [rdi], zmm3 99 | vmovdqu64 [rdi+0x1*0x40], zmm4 100 | vextracti64x2 xmm3, zmm8, 0x01 101 | vextracti64x2 xmm4, zmm9, 0x01 102 | vpmovzxwq zmm3, xmm3 103 | vpmovzxwq zmm4, xmm4 104 | vpaddq zmm3, zmm3, [rdi+0x2*0x40] 105 | vpaddq zmm4, zmm4, [rdi+0x3*0x40] 106 | vmovdqu64 [rdi+0x2*0x40], zmm3 107 | vmovdqu64 [rdi+0x3*0x40], zmm4 108 | vextracti64x2 xmm3, zmm8, 0x02 109 | vextracti64x2 xmm4, zmm9, 0x02 110 | vpmovzxwq zmm3, xmm3 111 | vpmovzxwq zmm4, xmm4 112 | vpaddq zmm3, zmm3, [rdi+0x4*0x40] 113 | vpaddq zmm4, zmm4, [rdi+0x5*0x40] 114 | vmovdqu64 [rdi+0x4*0x40], zmm3 115 | vmovdqu64 [rdi+0x5*0x40], zmm4 116 | vextracti64x2 xmm3, zmm8, 0x03 117 | vextracti64x2 xmm4, zmm9, 0x03 118 | vpmovzxwq zmm3, xmm3 119 | vpmovzxwq zmm4, xmm4 120 | vpaddq zmm3, zmm3, [rdi+0x6*0x40] 121 | vpaddq zmm4, zmm4, [rdi+0x7*0x40] 122 | vmovdqu64 [rdi+0x6*0x40], zmm3 123 | vmovdqu64 [rdi+0x7*0x40], zmm4 124 | ret 125 | -------------------------------------------------------------------------------- /asm/countneon_32.S: -------------------------------------------------------------------------------- 1 | #include "kernelneon.S" 2 | 3 | // not ported to 32 bit flags yet 4 | accum8: 5 | ld1 {v4.2d-v7.2d}, [x2] 6 | uaddl v16.4s, v10.4h, v8.4h 7 | uaddl2 v17.4s, v10.8h, v8.8h 8 | uaddl v18.4s, v11.4h, v9.4h 9 | uaddl2 v19.4s, v11.8h, v9.8h 10 | uaddl v20.4s, v14.4h, v12.4h 11 | uaddl2 v21.4s, v14.8h, v12.8h 12 | uaddl v22.4s, v15.4h, v13.4h 13 | uaddl2 v23.4s, v15.8h, v13.8h 14 | add v16.4s, v16.4s, v18.4s 15 | add v17.4s, v17.4s, v19.4s 16 | add v20.4s, v20.4s, v22.4s 17 | add v21.4s, v21.4s, v23.4s 18 | add v16.4s, v16.4s, v20.4s 19 | add v17.4s, v17.4s, v21.4s 20 | uaddw v4.2d, v4.2d, v16.2s 21 | uaddw2 v5.2d, v5.2d, v16.4s 22 | uaddw v6.2d, v6.2d, v17.2s 23 | uaddw2 v7.2d, v7.2d, v17.4s 24 | st1 {v4.2d-v7.2d}, [x2] 25 | ret 26 | 27 | accum16: 28 | ld1 {v4.4s-v7.4s}, [x2] 29 | uaddl v16.4s, v10.4h, v8.4h 30 | uaddl2 v17.4s, v10.8h, v8.8h 31 | uaddl v18.4s, v11.4h, v9.4h 32 | uaddl2 v19.4s, v11.8h, v9.8h 33 | uaddl v20.4s, v14.4h, v12.4h 34 | uaddl2 v21.4s, v14.8h, v12.8h 35 | uaddl v22.4s, v15.4h, v13.4h 36 | uaddl2 v23.4s, v15.8h, v13.8h 37 | add v16.4s, v16.4s, v20.4s 38 | add v17.4s, v17.4s, v21.4s 39 | add v18.4s, v18.4s, v22.4s 40 | add v19.4s, v19.4s, v23.4s 41 | add v4.4s, v4.4s, v16.4s 42 | add v5.4s, v5.4s, v17.4s 43 | add v6.4s, v6.4s, v18.4s 44 | add v7.4s, v7.4s, v19.4s 45 | st1 {v4.4s-v7.4s}, [x2] 46 | ret 47 | 48 | // not ported to 32 bit flags yet 49 | accum32: 50 | mov x7, x2 51 | mov x8, x2 52 | mov x9, #2 53 | 54 | 0: ld1 {v20.2d-v23.2d}, [x7], #64 55 | ld1 {v4.2d-v7.2d}, [x7], #64 56 | sub x9, x9, #0x1 57 | uaddl v16.4s, v12.4h, v8.4h 58 | uaddl2 v17.4s, v12.8h, v8.8h 59 | uaddl v18.4s, v13.4h, v9.4h 60 | uaddl2 v19.4s, v13.8h, v9.8h 61 | mov v8.16b, v10.16b 62 | mov v9.16b, v11.16b 63 | mov v12.16b, v14.16b 64 | mov v13.16b, v15.16b 65 | uaddw v20.2d, v20.2d, v16.2s 66 | uaddw2 v21.2d, v21.2d, v16.4s 67 | uaddw v22.2d, v22.2d, v17.2s 68 | uaddw2 v23.2d, v23.2d, v17.4s 69 | uaddw v4.2d, v4.2d, v18.2s 70 | uaddw2 v5.2d, v5.2d, v18.4s 71 | uaddw v6.2d, v6.2d, v19.2s 72 | uaddw2 v7.2d, v7.2d, v19.4s 73 | st1 {v20.2d-v23.2d}, [x8], #64 74 | st1 {v4.2d-v7.2d}, [x8], #64 75 | cbnz x9, 0b 76 | ret 77 | 78 | // not ported to 32 bit flags yet 79 | accum64: 80 | mov x7, x2 81 | mov x8, x2 82 | mov x9, #4 83 | 84 | 0: ld1 {v20.2d-v23.2d}, [x7], #64 85 | ld1 {v4.2d-v7.2d}, [x7], #64 86 | sub x9, x9, #0x1 87 | uxtl v16.4s, v8.4h 88 | uxtl2 v17.4s, v8.8h 89 | uxtl v18.4s, v9.4h 90 | uxtl2 v19.4s, v9.8h 91 | mov v8.16b, v10.16b 92 | mov v9.16b, v11.16b 93 | mov v10.16b, v12.16b 94 | mov v11.16b, v13.16b 95 | mov v12.16b, v14.16b 96 | mov v13.16b, v15.16b 97 | uaddw v20.2d, v20.2d, v16.2s 98 | uaddw2 v21.2d, v21.2d, v16.4s 99 | uaddw v22.2d, v22.2d, v17.2s 100 | uaddw2 v23.2d, v23.2d, v17.4s 101 | uaddw v4.2d, v4.2d, v18.2s 102 | uaddw2 v5.2d, v5.2d, v18.4s 103 | uaddw v6.2d, v6.2d, v19.2s 104 | uaddw2 v7.2d, v7.2d, v19.4s 105 | st1 {v20.2d-v23.2d}, [x8], #64 106 | st1 {v4.2d-v7.2d}, [x8], #64 107 | cbnz x9, 0b 108 | ret 109 | -------------------------------------------------------------------------------- /asm/countneon_64.S: -------------------------------------------------------------------------------- 1 | #include "kernelneon.S" 2 | 3 | accum8: 4 | ld1 {v4.2d-v7.2d}, [x2] 5 | uaddl v16.4s, v10.4h, v8.4h 6 | uaddl2 v17.4s, v10.8h, v8.8h 7 | uaddl v18.4s, v11.4h, v9.4h 8 | uaddl2 v19.4s, v11.8h, v9.8h 9 | uaddl v20.4s, v14.4h, v12.4h 10 | uaddl2 v21.4s, v14.8h, v12.8h 11 | uaddl v22.4s, v15.4h, v13.4h 12 | uaddl2 v23.4s, v15.8h, v13.8h 13 | add v16.4s, v16.4s, v18.4s 14 | add v17.4s, v17.4s, v19.4s 15 | add v20.4s, v20.4s, v22.4s 16 | add v21.4s, v21.4s, v23.4s 17 | add v16.4s, v16.4s, v20.4s 18 | add v17.4s, v17.4s, v21.4s 19 | uaddw v4.2d, v4.2d, v16.2s 20 | uaddw2 v5.2d, v5.2d, v16.4s 21 | uaddw v6.2d, v6.2d, v17.2s 22 | uaddw2 v7.2d, v7.2d, v17.4s 23 | st1 {v4.2d-v7.2d}, [x2] 24 | ret 25 | 26 | accum16: 27 | ld1 {v4.2d-v7.2d}, [x2], #64 28 | uaddl v16.4s, v10.4h, v8.4h 29 | uaddl2 v17.4s, v10.8h, v8.8h 30 | uaddl v18.4s, v11.4h, v9.4h 31 | uaddl2 v19.4s, v11.8h, v9.8h 32 | uaddl v20.4s, v14.4h, v12.4h 33 | uaddl2 v21.4s, v14.8h, v12.8h 34 | uaddl v22.4s, v15.4h, v13.4h 35 | uaddl2 v23.4s, v15.8h, v13.8h 36 | add v16.4s, v16.4s, v20.4s 37 | add v17.4s, v17.4s, v21.4s 38 | add v18.4s, v18.4s, v22.4s 39 | add v19.4s, v19.4s, v23.4s 40 | ld1 {v20.2d-v23.2d}, [x2] 41 | sub x2, x2, #0x40 42 | uaddw v4.2d, v4.2d, v16.2s 43 | uaddw2 v5.2d, v5.2d, v16.4s 44 | uaddw v6.2d, v6.2d, v17.2s 45 | uaddw2 v7.2d, v7.2d, v17.4s 46 | uaddw v20.2d, v20.2d, v18.2s 47 | uaddw2 v21.2d, v21.2d, v18.4s 48 | uaddw v22.2d, v22.2d, v19.2s 49 | uaddw2 v23.2d, v23.2d, v19.4s 50 | st1 {v4.2d-v7.2d}, [x2], #64 51 | st1 {v20.2d-v23.2d}, [x2] 52 | sub x2, x2, #0x40 53 | ret 54 | 55 | accum32: 56 | mov x7, x2 57 | mov x8, x2 58 | mov x9, #2 59 | 60 | 0: ld1 {v20.2d-v23.2d}, [x7], #64 61 | ld1 {v4.2d-v7.2d}, [x7], #64 62 | sub x9, x9, #0x1 63 | uaddl v16.4s, v12.4h, v8.4h 64 | uaddl2 v17.4s, v12.8h, v8.8h 65 | uaddl v18.4s, v13.4h, v9.4h 66 | uaddl2 v19.4s, v13.8h, v9.8h 67 | mov v8.16b, v10.16b 68 | mov v9.16b, v11.16b 69 | mov v12.16b, v14.16b 70 | mov v13.16b, v15.16b 71 | uaddw v20.2d, v20.2d, v16.2s 72 | uaddw2 v21.2d, v21.2d, v16.4s 73 | uaddw v22.2d, v22.2d, v17.2s 74 | uaddw2 v23.2d, v23.2d, v17.4s 75 | uaddw v4.2d, v4.2d, v18.2s 76 | uaddw2 v5.2d, v5.2d, v18.4s 77 | uaddw v6.2d, v6.2d, v19.2s 78 | uaddw2 v7.2d, v7.2d, v19.4s 79 | st1 {v20.2d-v23.2d}, [x8], #64 80 | st1 {v4.2d-v7.2d}, [x8], #64 81 | cbnz x9, 0b 82 | ret 83 | 84 | accum64: 85 | mov x7, x2 86 | mov x8, x2 87 | mov x9, #4 88 | 89 | 0: ld1 {v20.2d-v23.2d}, [x7], #64 90 | ld1 {v4.2d-v7.2d}, [x7], #64 91 | sub x9, x9, #0x1 92 | uxtl v16.4s, v8.4h 93 | uxtl2 v17.4s, v8.8h 94 | uxtl v18.4s, v9.4h 95 | uxtl2 v19.4s, v9.8h 96 | mov v8.16b, v10.16b 97 | mov v9.16b, v11.16b 98 | mov v10.16b, v12.16b 99 | mov v11.16b, v13.16b 100 | mov v12.16b, v14.16b 101 | mov v13.16b, v15.16b 102 | uaddw v20.2d, v20.2d, v16.2s 103 | uaddw2 v21.2d, v21.2d, v16.4s 104 | uaddw v22.2d, v22.2d, v17.2s 105 | uaddw2 v23.2d, v23.2d, v17.4s 106 | uaddw v4.2d, v4.2d, v18.2s 107 | uaddw2 v5.2d, v5.2d, v18.4s 108 | uaddw v6.2d, v6.2d, v19.2s 109 | uaddw2 v7.2d, v7.2d, v19.4s 110 | st1 {v20.2d-v23.2d}, [x8], #64 111 | st1 {v4.2d-v7.2d}, [x8], #64 112 | cbnz x9, 0b 113 | ret 114 | -------------------------------------------------------------------------------- /asm/kernelavx2.S: -------------------------------------------------------------------------------- 1 | # AVX2 positional popcount with 15-fold CSA reduction 2 | # by Robert Clausecker 3 | # from github.com/clausecker/pospop@v1.3.5 4 | # with slight copy-editing 5 | 6 | .intel_syntax noprefix 7 | .section .rodata 8 | .balign 32 9 | magic: .quad 0x0000000000000000 10 | .quad 0x0101010101010101 11 | .quad 0x0202020202020202 12 | .quad 0x0303030303030303 13 | .quad 0x0404040404040404 14 | .quad 0x0505050505050505 15 | .quad 0x0606060606060606 16 | .quad 0x0707070707070707 17 | .quad 0x8040201008040201 18 | .long 0x55555555 19 | .long 0x33333333 20 | .long 0x0f0f0f0f 21 | .long 0x00ff00ff 22 | .size magic, .-magic 23 | 24 | window: .quad 0, 0, 0, 0, -1, -1, -1, -1 25 | .size window, .-window 26 | 27 | .section .text 28 | .balign 16 29 | countavx2: 30 | .type countavx2, @function 31 | push rbp 32 | mov rbp, rsp 33 | cmp rcx, 480 34 | jl .L22056 35 | mov edx, esi 36 | and edx, 0x1F 37 | mov eax, 32 38 | sub eax, edx 39 | sub rsi, rdx 40 | vmovdqa ymm0, [rsi] 41 | add rcx, rdx 42 | lea rdx, [window+rip] 43 | vpand ymm0, ymm0, [rdx+rax] 44 | vmovdqa ymm1, [rsi+0x20] 45 | vmovdqa ymm4, [rsi+0x40] 46 | vmovdqa ymm2, [rsi+0x60] 47 | vmovdqa ymm3, [rsi+0x80] 48 | vmovdqa ymm5, [rsi+0xA0] 49 | vmovdqa ymm6, [rsi+0xC0] 50 | vpand ymm7, ymm1, ymm0 51 | vpxor ymm0, ymm1, ymm0 52 | vpand ymm1, ymm4, ymm0 53 | vpxor ymm0, ymm4, ymm0 54 | vpor ymm1, ymm7, ymm1 55 | vmovdqa ymm4, [rsi+0xE0] 56 | vpand ymm7, ymm2, ymm3 57 | vpxor ymm3, ymm2, ymm3 58 | vpand ymm2, ymm5, ymm3 59 | vpxor ymm3, ymm5, ymm3 60 | vpor ymm2, ymm7, ymm2 61 | vmovdqa ymm5, [rsi+0x100] 62 | vpand ymm7, ymm3, ymm0 63 | vpxor ymm0, ymm3, ymm0 64 | vpand ymm3, ymm6, ymm0 65 | vpxor ymm0, ymm6, ymm0 66 | vpor ymm3, ymm7, ymm3 67 | vmovdqa ymm6, [rsi+0x120] 68 | vpand ymm7, ymm2, ymm1 69 | vpxor ymm1, ymm2, ymm1 70 | vpand ymm2, ymm3, ymm1 71 | vpxor ymm1, ymm3, ymm1 72 | vpor ymm2, ymm7, ymm2 73 | vmovdqa ymm3, [rsi+0x140] 74 | vpand ymm7, ymm4, ymm0 75 | vpxor ymm0, ymm4, ymm0 76 | vpand ymm4, ymm5, ymm0 77 | vpxor ymm0, ymm5, ymm0 78 | vpor ymm4, ymm7, ymm4 79 | vmovdqa ymm5, [rsi+0x160] 80 | vpand ymm7, ymm3, ymm0 81 | vpxor ymm0, ymm3, ymm0 82 | vpand ymm3, ymm6, ymm0 83 | vpxor ymm0, ymm6, ymm0 84 | vpor ymm3, ymm7, ymm3 85 | vmovdqa ymm6, [rsi+0x180] 86 | vpand ymm7, ymm3, ymm1 87 | vpxor ymm1, ymm3, ymm1 88 | vpand ymm3, ymm4, ymm1 89 | vpxor ymm1, ymm4, ymm1 90 | vpor ymm3, ymm7, ymm3 91 | vmovdqa ymm4, [rsi+0x1A0] 92 | vpand ymm7, ymm5, ymm0 93 | vpxor ymm0, ymm5, ymm0 94 | vpand ymm5, ymm6, ymm0 95 | vpxor ymm0, ymm6, ymm0 96 | vpor ymm5, ymm7, ymm5 97 | vmovdqa ymm6, [rsi+0x1C0] 98 | vpbroadcastd ymm15, [magic+72+rip] 99 | vpbroadcastd ymm13, [magic+76+rip] 100 | vpand ymm7, ymm4, ymm0 101 | vpxor ymm0, ymm4, ymm0 102 | vpand ymm4, ymm6, ymm0 103 | vpxor ymm0, ymm6, ymm0 104 | vpor ymm4, ymm7, ymm4 105 | vpxor ymm8, ymm8, ymm8 106 | vpxor ymm9, ymm9, ymm9 107 | vpand ymm7, ymm4, ymm1 108 | vpxor ymm1, ymm4, ymm1 109 | vpand ymm4, ymm5, ymm1 110 | vpxor ymm1, ymm5, ymm1 111 | vpor ymm4, ymm7, ymm4 112 | vpxor ymm10, ymm10, ymm10 113 | vpxor ymm11, ymm11, ymm11 114 | vpand ymm7, ymm3, ymm2 115 | vpxor ymm2, ymm3, ymm2 116 | vpand ymm3, ymm4, ymm2 117 | vpxor ymm2, ymm4, ymm2 118 | vpor ymm3, ymm7, ymm3 119 | add rsi, 480 120 | sub rcx, 992 121 | jl .L22052 122 | mov eax, 65535 123 | .L22050:vmovdqa ymm4, [rsi] 124 | vmovdqa ymm5, [rsi+0x20] 125 | vmovdqa ymm6, [rsi+0x40] 126 | vmovdqa ymm12, [rsi+0x60] 127 | vmovdqa ymm14, [rsi+0x80] 128 | vpand ymm7, ymm4, ymm0 129 | vpxor ymm0, ymm4, ymm0 130 | vpand ymm4, ymm5, ymm0 131 | vpxor ymm0, ymm5, ymm0 132 | vpor ymm4, ymm7, ymm4 133 | vmovdqa ymm5, [rsi+0xA0] 134 | vpand ymm7, ymm12, ymm6 135 | vpxor ymm6, ymm12, ymm6 136 | vpand ymm12, ymm14, ymm6 137 | vpxor ymm6, ymm14, ymm6 138 | vpor ymm12, ymm7, ymm12 139 | vmovdqa ymm14, [rsi+0xC0] 140 | vpand ymm7, ymm4, ymm1 141 | vpxor ymm1, ymm4, ymm1 142 | vpand ymm4, ymm12, ymm1 143 | vpxor ymm1, ymm12, ymm1 144 | vpor ymm4, ymm7, ymm4 145 | vmovdqa ymm12, [rsi+0xE0] 146 | vpand ymm7, ymm5, ymm0 147 | vpxor ymm0, ymm5, ymm0 148 | vpand ymm5, ymm6, ymm0 149 | vpxor ymm0, ymm6, ymm0 150 | vpor ymm5, ymm7, ymm5 151 | vmovdqa ymm6, [rsi+0x100] 152 | vpand ymm7, ymm12, ymm6 153 | vpxor ymm6, ymm12, ymm6 154 | vpand ymm12, ymm14, ymm6 155 | vpxor ymm6, ymm14, ymm6 156 | vpor ymm12, ymm7, ymm12 157 | vmovdqa ymm14, [rsi+0x120] 158 | vpand ymm7, ymm5, ymm1 159 | vpxor ymm1, ymm5, ymm1 160 | vpand ymm5, ymm12, ymm1 161 | vpxor ymm1, ymm12, ymm1 162 | vpor ymm5, ymm7, ymm5 163 | vmovdqa ymm12, [rsi+0x140] 164 | vpand ymm7, ymm12, ymm0 165 | vpxor ymm0, ymm12, ymm0 166 | vpand ymm12, ymm14, ymm0 167 | vpxor ymm0, ymm14, ymm0 168 | vpor ymm12, ymm7, ymm12 169 | vmovdqa ymm14, [rsi+0x160] 170 | vpand ymm7, ymm4, ymm2 171 | vpxor ymm2, ymm4, ymm2 172 | vpand ymm4, ymm5, ymm2 173 | vpxor ymm2, ymm5, ymm2 174 | vpor ymm4, ymm7, ymm4 175 | vmovdqa ymm5, [rsi+0x180] 176 | vpand ymm7, ymm6, ymm0 177 | vpxor ymm0, ymm6, ymm0 178 | vpand ymm6, ymm14, ymm0 179 | vpxor ymm0, ymm14, ymm0 180 | vpor ymm6, ymm7, ymm6 181 | vmovdqa ymm14, [rsi+0x1A0] 182 | vpand ymm7, ymm6, ymm1 183 | vpxor ymm1, ymm6, ymm1 184 | vpand ymm6, ymm12, ymm1 185 | vpxor ymm1, ymm12, ymm1 186 | vpor ymm6, ymm7, ymm6 187 | vmovdqa ymm12, [rsi+0x1C0] 188 | vpand ymm7, ymm12, ymm5 189 | vpxor ymm5, ymm12, ymm5 190 | vpand ymm12, ymm14, ymm5 191 | vpxor ymm5, ymm14, ymm5 192 | vpor ymm12, ymm7, ymm12 193 | vmovdqa ymm14, [rsi+0x1E0] 194 | vpand ymm7, ymm5, ymm0 195 | vpxor ymm0, ymm5, ymm0 196 | vpand ymm5, ymm14, ymm0 197 | vpxor ymm0, ymm14, ymm0 198 | vpor ymm5, ymm7, ymm5 199 | add rsi, 512 200 | prefetcht0 [rsi] 201 | prefetcht0 [rsi+0x20] 202 | vpand ymm7, ymm5, ymm1 203 | vpxor ymm1, ymm5, ymm1 204 | vpand ymm5, ymm12, ymm1 205 | vpxor ymm1, ymm12, ymm1 206 | vpor ymm5, ymm7, ymm5 207 | vpand ymm7, ymm5, ymm2 208 | vpxor ymm2, ymm5, ymm2 209 | vpand ymm5, ymm6, ymm2 210 | vpxor ymm2, ymm6, ymm2 211 | vpor ymm5, ymm7, ymm5 212 | vpand ymm7, ymm4, ymm3 213 | vpxor ymm3, ymm4, ymm3 214 | vpand ymm4, ymm5, ymm3 215 | vpxor ymm3, ymm5, ymm3 216 | vpor ymm4, ymm7, ymm4 217 | vpbroadcastd ymm12, [magic+84+rip] 218 | vpbroadcastd ymm14, [magic+80+rip] 219 | vpand ymm5, ymm15, ymm4 220 | vpandn ymm6, ymm15, ymm4 221 | vpsrld ymm6, ymm6, 1 222 | vperm2i128 ymm4, ymm5, ymm6, 0x20 223 | vperm2i128 ymm5, ymm5, ymm6, 0x31 224 | vpaddd ymm4, ymm4, ymm5 225 | vpand ymm5, ymm13, ymm4 226 | vpandn ymm6, ymm13, ymm4 227 | vpsrld ymm6, ymm6, 2 228 | vpunpcklqdq ymm4, ymm5, ymm6 229 | vpunpckhqdq ymm5, ymm5, ymm6 230 | vpaddd ymm4, ymm4, ymm5 231 | vpand ymm5, ymm14, ymm4 232 | vpandn ymm6, ymm14, ymm4 233 | vpslld ymm5, ymm5, 4 234 | vperm2i128 ymm4, ymm5, ymm6, 0x20 235 | vperm2i128 ymm5, ymm5, ymm6, 0x31 236 | vpunpcklwd ymm6, ymm4, ymm5 237 | vpunpckhwd ymm7, ymm4, ymm5 238 | vpunpckldq ymm4, ymm6, ymm7 239 | vpunpckhdq ymm5, ymm6, ymm7 240 | vpermq ymm4, ymm4, 0xD8 241 | vpermq ymm5, ymm5, 0xD8 242 | vpand ymm6, ymm12, ymm4 243 | vpand ymm7, ymm12, ymm5 244 | vpaddw ymm8, ymm8, ymm6 245 | vpaddw ymm10, ymm10, ymm7 246 | vpsrlw ymm4, ymm4, 8 247 | vpsrlw ymm5, ymm5, 8 248 | vpaddw ymm9, ymm9, ymm4 249 | vpaddw ymm11, ymm11, ymm5 250 | sub eax, 64 251 | cmp eax, 184 252 | jge .L22051 253 | vpxor ymm7, ymm7, ymm7 254 | call rbx 255 | vpxor ymm8, ymm8, ymm8 256 | vpxor ymm9, ymm9, ymm9 257 | vpxor ymm10, ymm10, ymm10 258 | vpxor ymm11, ymm11, ymm11 259 | mov eax, 65535 260 | .L22051:sub rcx, 512 261 | jge .L22050 262 | .L22052:vpbroadcastd ymm14, [magic+80+rip] 263 | vpand ymm5, ymm15, ymm1 264 | vpaddd ymm5, ymm5, ymm5 265 | vpand ymm7, ymm15, ymm3 266 | vpaddd ymm7, ymm7, ymm7 267 | vpand ymm4, ymm15, ymm0 268 | vpand ymm6, ymm15, ymm2 269 | vpor ymm4, ymm5, ymm4 270 | vpor ymm5, ymm7, ymm6 271 | vpandn ymm0, ymm15, ymm0 272 | vpsrld ymm0, ymm0, 1 273 | vpandn ymm2, ymm15, ymm2 274 | vpsrld ymm2, ymm2, 1 275 | vpandn ymm1, ymm15, ymm1 276 | vpandn ymm3, ymm15, ymm3 277 | vpor ymm6, ymm1, ymm0 278 | vpor ymm7, ymm3, ymm2 279 | vpand ymm1, ymm13, ymm5 280 | vpslld ymm1, ymm1, 2 281 | vpand ymm3, ymm13, ymm7 282 | vpslld ymm3, ymm3, 2 283 | vpand ymm0, ymm13, ymm4 284 | vpand ymm2, ymm13, ymm6 285 | vpor ymm0, ymm1, ymm0 286 | vpor ymm1, ymm3, ymm2 287 | vpandn ymm4, ymm13, ymm4 288 | vpsrld ymm4, ymm4, 2 289 | vpandn ymm6, ymm13, ymm6 290 | vpsrld ymm6, ymm6, 2 291 | vpandn ymm5, ymm13, ymm5 292 | vpandn ymm7, ymm13, ymm7 293 | vpor ymm2, ymm5, ymm4 294 | vpor ymm3, ymm7, ymm6 295 | vpunpcklbw ymm5, ymm0, ymm1 296 | vpunpckhbw ymm0, ymm0, ymm1 297 | vpunpcklbw ymm6, ymm2, ymm3 298 | vpunpckhbw ymm1, ymm2, ymm3 299 | vpunpcklwd ymm4, ymm5, ymm6 300 | vpunpckhwd ymm5, ymm5, ymm6 301 | vpunpcklwd ymm6, ymm0, ymm1 302 | vpunpckhwd ymm7, ymm0, ymm1 303 | vpand ymm0, ymm14, ymm4 304 | vpsrld ymm4, ymm4, 4 305 | vpand ymm4, ymm14, ymm4 306 | vpand ymm1, ymm14, ymm5 307 | vpsrld ymm5, ymm5, 4 308 | vpand ymm5, ymm14, ymm5 309 | vpand ymm2, ymm14, ymm6 310 | vpsrld ymm6, ymm6, 4 311 | vpand ymm6, ymm14, ymm6 312 | vpand ymm3, ymm14, ymm7 313 | vpsrld ymm7, ymm7, 4 314 | vpand ymm7, ymm14, ymm7 315 | vpaddb ymm0, ymm0, ymm2 316 | vpaddb ymm1, ymm1, ymm3 317 | vpaddb ymm2, ymm4, ymm6 318 | vpaddb ymm3, ymm5, ymm7 319 | vpunpckldq ymm4, ymm0, ymm2 320 | vpunpckhdq ymm5, ymm0, ymm2 321 | vpunpckldq ymm6, ymm1, ymm3 322 | vpunpckhdq ymm7, ymm1, ymm3 323 | vperm2i128 ymm0, ymm4, ymm5, 0x20 324 | vperm2i128 ymm2, ymm4, ymm5, 0x31 325 | vperm2i128 ymm1, ymm6, ymm7, 0x20 326 | vperm2i128 ymm3, ymm6, ymm7, 0x31 327 | vpaddb ymm0, ymm0, ymm2 328 | vpaddb ymm1, ymm1, ymm3 329 | vpxor ymm7, ymm7, ymm7 330 | vpunpcklbw ymm4, ymm0, ymm7 331 | vpunpckhbw ymm5, ymm0, ymm7 332 | vpunpcklbw ymm6, ymm1, ymm7 333 | vpunpckhbw ymm1, ymm1, ymm7 334 | vpaddw ymm8, ymm8, ymm4 335 | vpaddw ymm9, ymm9, ymm5 336 | vpaddw ymm10, ymm10, ymm6 337 | vpaddw ymm11, ymm11, ymm1 338 | cmp ecx, -512 339 | je .L22055 340 | vpbroadcastq ymm2, [magic+64+rip] 341 | vmovdqu ymm3, [magic+rip] 342 | vmovdqu ymm7, [magic+32+rip] 343 | vpxor ymm0, ymm0, ymm0 344 | vpxor ymm1, ymm1, ymm1 345 | sub ecx, -504 346 | jle .L22054 347 | .L22053:vpbroadcastq ymm4, [rsi] 348 | vpshufb ymm5, ymm4, ymm7 349 | vpshufb ymm4, ymm4, ymm3 350 | vpand ymm5, ymm5, ymm2 351 | vpand ymm4, ymm4, ymm2 352 | vpcmpeqb ymm5, ymm5, ymm2 353 | vpcmpeqb ymm4, ymm4, ymm2 354 | vpsubb ymm1, ymm1, ymm5 355 | vpsubb ymm0, ymm0, ymm4 356 | add rsi, 8 357 | sub ecx, 8 358 | jg .L22053 359 | .L22054:lea ecx, [rcx*8+0x40] 360 | bzhi rax, [rsi], rcx 361 | vmovq xmm6, rax 362 | vpbroadcastq ymm4, xmm6 363 | vpshufb ymm5, ymm4, ymm7 364 | vpshufb ymm4, ymm4, ymm3 365 | vpand ymm5, ymm5, ymm2 366 | vpand ymm4, ymm4, ymm2 367 | vpcmpeqb ymm5, ymm5, ymm2 368 | vpcmpeqb ymm4, ymm4, ymm2 369 | vpsubb ymm1, ymm1, ymm5 370 | vpsubb ymm0, ymm0, ymm4 371 | vpxor ymm7, ymm7, ymm7 372 | vpunpcklbw ymm4, ymm0, ymm7 373 | vpunpckhbw ymm5, ymm0, ymm7 374 | vpunpcklbw ymm6, ymm1, ymm7 375 | vpunpckhbw ymm7, ymm1, ymm7 376 | vpaddw ymm8, ymm8, ymm4 377 | vpaddw ymm9, ymm9, ymm5 378 | vpaddw ymm10, ymm10, ymm6 379 | vpaddw ymm11, ymm11, ymm7 380 | .L22055:vpxor ymm7, ymm7, ymm7 381 | call rbx 382 | vzeroupper 383 | pop rbp 384 | ret 385 | 386 | .L22056: 387 | vpbroadcastq ymm2, [magic+64+rip] 388 | vmovdqu ymm3, [magic+rip] 389 | vmovdqu ymm7, [magic+32+rip] 390 | vpxor ymm0, ymm0, ymm0 391 | vpxor ymm1, ymm1, ymm1 392 | sub ecx, 8 393 | jl .L22058 394 | .L22057:vpbroadcastq ymm4, [rsi] 395 | vpshufb ymm5, ymm4, ymm7 396 | vpshufb ymm4, ymm4, ymm3 397 | vpand ymm5, ymm5, ymm2 398 | vpand ymm4, ymm4, ymm2 399 | vpcmpeqb ymm5, ymm5, ymm2 400 | vpcmpeqb ymm4, ymm4, ymm2 401 | vpsubb ymm1, ymm1, ymm5 402 | vpsubb ymm0, ymm0, ymm4 403 | add rsi, 8 404 | sub ecx, 8 405 | jge .L22057 406 | .L22058:cmp ecx, -8 407 | jle .L22061 408 | lea edx, [rsi+rcx+0x7] 409 | xor edx, esi 410 | lea ecx, [rcx*8+0x40] 411 | test edx, 0x8 412 | jnz .L22059 413 | lea eax, [rsi*8] 414 | and rsi, 0xFFFFFFFFFFFFFFF8 415 | mov r8, [rsi] 416 | shrx r8, r8, rax 417 | bzhi r8, r8, rcx 418 | jmp .L22060 419 | 420 | .L22059:bzhi r8, [rsi], rcx 421 | .L22060:vmovq xmm6, r8 422 | vpbroadcastq ymm4, xmm6 423 | vpshufb ymm5, ymm4, ymm7 424 | vpshufb ymm4, ymm4, ymm3 425 | vpand ymm5, ymm5, ymm2 426 | vpand ymm4, ymm4, ymm2 427 | vpcmpeqb ymm5, ymm5, ymm2 428 | vpcmpeqb ymm4, ymm4, ymm2 429 | vpsubb ymm1, ymm1, ymm5 430 | vpsubb ymm0, ymm0, ymm4 431 | .L22061:vpxor ymm7, ymm7, ymm7 432 | vpunpcklbw ymm8, ymm0, ymm7 433 | vpunpckhbw ymm9, ymm0, ymm7 434 | vpunpcklbw ymm10, ymm1, ymm7 435 | vpunpckhbw ymm11, ymm1, ymm7 436 | call rbx 437 | vzeroupper 438 | pop rbp 439 | ret 440 | .size countavx2, .-countavx2 441 | 442 | .balign 16 443 | .globl count8avx2 444 | .type count8avx2, @function 445 | count8avx2: 446 | push rbp 447 | mov rbp, rsp 448 | push rbx 449 | mov rcx, rdx 450 | lea rbx, [accum8+rip] 451 | call countavx2 452 | pop rbx 453 | pop rbp 454 | ret 455 | .size count8avx2, .-count8avx2 456 | 457 | .balign 16 458 | .globl count16avx2 459 | .type count16avx2, @function 460 | count16avx2: 461 | push rbp 462 | mov rbp, rsp 463 | push rbx 464 | mov rcx, rdx 465 | lea rbx, [accum16+rip] 466 | shl rcx, 1 467 | call countavx2 468 | pop rbx 469 | pop rbp 470 | ret 471 | .size count16avx2, .-count16avx2 472 | 473 | .balign 16 474 | .globl count32avx2 475 | .type count32avx2, @function 476 | count32avx2: 477 | push rbp 478 | mov rbp, rsp 479 | push rbx 480 | mov rcx, rdx 481 | lea rbx, [accum32+rip] 482 | shl rcx, 2 483 | call countavx2 484 | pop rbx 485 | pop rbp 486 | ret 487 | .size count32avx2, .-count32avx2 488 | 489 | .balign 16 490 | .globl count64avx2 491 | .type count64avx2, @function 492 | count64avx2: 493 | push rbp 494 | mov rbp, rsp 495 | push rbx 496 | mov rcx, rdx 497 | lea rbx, [accum64+rip] 498 | shl rcx, 3 499 | call countavx2 500 | pop rbx 501 | pop rbp 502 | ret 503 | .size count64avx2, .-count64avx2 504 | -------------------------------------------------------------------------------- /asm/kernelavx512.S: -------------------------------------------------------------------------------- 1 | # AVX-512 positional popcount with 15-fold CSA reduction 2 | # by Robert Clausecker 3 | # from github.com/clausecker/pospop@1.3.0 4 | # with slight copy-editing 5 | 6 | .intel_syntax noprefix 7 | .section .rodata 8 | .balign 32 9 | magic: .long 0x55555555 10 | .long 0x33333333 11 | .long 0x0f0f0f0f 12 | .long 0x00ff00ff 13 | .quad 0x1c1814100c080400 14 | .quad 0x1d1915110d090501 15 | .quad 0x1e1a16120e0a0602 16 | .quad 0x1f1b17130f0b0703 17 | .size magic, .-magic 18 | 19 | .section .text 20 | .type countavx512, @function 21 | .balign 16 22 | countavx512: 23 | push rbp 24 | mov rbp, rsp 25 | vpternlogd zmm30, zmm30, zmm30, 0xFF 26 | vpxord ymm25, ymm25, ymm25 27 | cmp rcx, 960 28 | jl .L22072 29 | mov rax, -1 30 | shlx rax, rax, rsi 31 | kmovq k1, rax 32 | add rcx, rsi 33 | and rsi, 0xFFFFFFFFFFFFFFC0 34 | sub rcx, rsi 35 | vmovdqu8 zmm0 {k1}{z}, [rsi] 36 | vmovdqa64 zmm1, [rsi+0x1*0x40] 37 | vmovdqa64 zmm4, [rsi+0x2*0x40] 38 | vpxor ymm8, ymm8, ymm8 39 | vpxor ymm9, ymm9, ymm9 40 | vmovdqa64 zmm2, [rsi+0x3*0x40] 41 | vmovdqa64 zmm3, [rsi+0x4*0x40] 42 | vmovdqa64 zmm5, [rsi+0x5*0x40] 43 | vmovdqa64 zmm22, zmm0 44 | vpternlogd zmm0, zmm1, zmm4, 0x96 45 | vpternlogd zmm1, zmm22, zmm4, 0xE8 46 | vmovdqa64 zmm6, [rsi+0x6*0x40] 47 | vmovdqa64 zmm7, [rsi+0x7*0x40] 48 | vmovdqa64 zmm10, [rsi+0x8*0x40] 49 | vmovdqa64 zmm22, zmm2 50 | vpternlogd zmm2, zmm3, zmm5, 0x96 51 | vpternlogd zmm3, zmm22, zmm5, 0xE8 52 | vmovdqa64 zmm11, [rsi+0x9*0x40] 53 | vmovdqa64 zmm12, [rsi+0xA*0x40] 54 | vmovdqa64 zmm13, [rsi+0xB*0x40] 55 | vmovdqa64 zmm22, zmm6 56 | vpternlogd zmm6, zmm7, zmm10, 0x96 57 | vpternlogd zmm7, zmm22, zmm10, 0xE8 58 | vmovdqa64 zmm4, [rsi+0xC*0x40] 59 | vmovdqa64 zmm5, [rsi+0xD*0x40] 60 | vmovdqa64 zmm10, [rsi+0xE*0x40] 61 | vmovdqa64 zmm22, zmm11 62 | vpternlogd zmm11, zmm12, zmm13, 0x96 63 | vpternlogd zmm12, zmm22, zmm13, 0xE8 64 | vpbroadcastd zmm28, [magic+rip] 65 | vpbroadcastd zmm27, [magic+4+rip] 66 | vpbroadcastd zmm26, [magic+8+rip] 67 | vmovdqa64 zmm22, zmm4 68 | vpternlogd zmm4, zmm5, zmm10, 0x96 69 | vpternlogd zmm5, zmm22, zmm10, 0xE8 70 | vmovdqa64 zmm22, zmm0 71 | vpternlogd zmm0, zmm2, zmm6, 0x96 72 | vpternlogd zmm2, zmm22, zmm6, 0xE8 73 | vmovdqa64 zmm22, zmm1 74 | vpternlogd zmm1, zmm3, zmm7, 0x96 75 | vpternlogd zmm3, zmm22, zmm7, 0xE8 76 | vmovdqa64 zmm22, zmm0 77 | vpternlogd zmm0, zmm11, zmm4, 0x96 78 | vpternlogd zmm11, zmm22, zmm4, 0xE8 79 | vmovdqa64 zmm22, zmm2 80 | vpternlogd zmm2, zmm12, zmm5, 0x96 81 | vpternlogd zmm12, zmm22, zmm5, 0xE8 82 | vmovdqa64 zmm22, zmm1 83 | vpternlogd zmm1, zmm2, zmm11, 0x96 84 | vpternlogd zmm2, zmm22, zmm11, 0xE8 85 | vmovdqa64 zmm22, zmm2 86 | vpternlogd zmm2, zmm3, zmm12, 0x96 87 | vpternlogd zmm3, zmm22, zmm12, 0xE8 88 | add rsi, 960 89 | sub rcx, 1984 90 | jl .L22068 91 | vpbroadcastd zmm24, [magic+12+rip] 92 | 93 | vpmovzxbw zmm23, [magic+16+rip] 94 | mov eax, 65535 95 | .L22066:vmovdqa64 zmm4, [rsi] 96 | vmovdqa64 zmm5, [rsi+0x1*0x40] 97 | vmovdqa64 zmm6, [rsi+0x2*0x40] 98 | vmovdqa64 zmm7, [rsi+0x3*0x40] 99 | vmovdqa64 zmm10, [rsi+0x4*0x40] 100 | vmovdqa64 zmm22, zmm0 101 | vpternlogd zmm0, zmm4, zmm5, 0x96 102 | vpternlogd zmm4, zmm22, zmm5, 0xE8 103 | vmovdqa64 zmm5, [rsi+0x5*0x40] 104 | vmovdqa64 zmm11, [rsi+0x6*0x40] 105 | vmovdqa64 zmm12, [rsi+0x7*0x40] 106 | vmovdqa64 zmm22, zmm6 107 | vpternlogd zmm6, zmm7, zmm10, 0x96 108 | vpternlogd zmm7, zmm22, zmm10, 0xE8 109 | vmovdqa64 zmm10, [rsi+0x8*0x40] 110 | vmovdqa64 zmm13, [rsi+0x9*0x40] 111 | vmovdqa64 zmm14, [rsi+0xA*0x40] 112 | vmovdqa64 zmm22, zmm5 113 | vpternlogd zmm5, zmm11, zmm12, 0x96 114 | vpternlogd zmm11, zmm22, zmm12, 0xE8 115 | vmovdqa64 zmm12, [rsi+0xB*0x40] 116 | vmovdqa64 zmm15, [rsi+0xC*0x40] 117 | vmovdqa64 zmm16, [rsi+0xD*0x40] 118 | vmovdqa64 zmm22, zmm10 119 | vpternlogd zmm10, zmm13, zmm14, 0x96 120 | vpternlogd zmm13, zmm22, zmm14, 0xE8 121 | vmovdqa64 zmm14, [rsi+0xE*0x40] 122 | vmovdqa64 zmm17, [rsi+0xF*0x40] 123 | vmovdqa64 zmm22, zmm12 124 | vpternlogd zmm12, zmm15, zmm16, 0x96 125 | vpternlogd zmm15, zmm22, zmm16, 0xE8 126 | add rsi, 1024 127 | prefetcht0 [rsi] 128 | vmovdqa64 zmm22, zmm0 129 | vpternlogd zmm0, zmm5, zmm6, 0x96 130 | vpternlogd zmm5, zmm22, zmm6, 0xE8 131 | prefetcht0 [rsi+0x40] 132 | vmovdqa64 zmm22, zmm1 133 | vpternlogd zmm1, zmm4, zmm7, 0x96 134 | vpternlogd zmm4, zmm22, zmm7, 0xE8 135 | vmovdqa64 zmm22, zmm10 136 | vpternlogd zmm10, zmm12, zmm14, 0x96 137 | vpternlogd zmm12, zmm22, zmm14, 0xE8 138 | vmovdqa64 zmm22, zmm11 139 | vpternlogd zmm11, zmm13, zmm15, 0x96 140 | vpternlogd zmm13, zmm22, zmm15, 0xE8 141 | vmovdqa64 zmm22, zmm0 142 | vpternlogd zmm0, zmm10, zmm17, 0x96 143 | vpternlogd zmm10, zmm22, zmm17, 0xE8 144 | vmovdqa64 zmm22, zmm1 145 | vpternlogd zmm1, zmm5, zmm11, 0x96 146 | vpternlogd zmm5, zmm22, zmm11, 0xE8 147 | vmovdqa64 zmm22, zmm2 148 | vpternlogd zmm2, zmm4, zmm13, 0x96 149 | vpternlogd zmm4, zmm22, zmm13, 0xE8 150 | vmovdqa64 zmm22, zmm1 151 | vpternlogd zmm1, zmm10, zmm12, 0x96 152 | vpternlogd zmm10, zmm22, zmm12, 0xE8 153 | vmovdqa64 zmm22, zmm2 154 | vpternlogd zmm2, zmm5, zmm10, 0x96 155 | vpternlogd zmm5, zmm22, zmm10, 0xE8 156 | vmovdqa64 zmm22, zmm3 157 | vpternlogd zmm3, zmm4, zmm5, 0x96 158 | vpternlogd zmm4, zmm22, zmm5, 0xE8 159 | vpandd zmm5, zmm28, zmm4 160 | vpandnd zmm6, zmm28, zmm4 161 | vpsrld zmm6, zmm6, 1 162 | vshufi64x2 zmm10, zmm5, zmm6, 0x44 163 | vshufi64x2 zmm11, zmm5, zmm6, 0xEE 164 | vpaddd zmm4, zmm11, zmm10 165 | vpandd zmm5, zmm27, zmm4 166 | vpandnd zmm6, zmm27, zmm4 167 | vpsrld zmm6, zmm6, 2 168 | vshufi64x2 zmm10, zmm5, zmm6, 0x88 169 | vshufi64x2 zmm11, zmm5, zmm6, 0xDD 170 | vpaddd zmm4, zmm11, zmm10 171 | vpandd zmm5, zmm26, zmm4 172 | vpandnd zmm6, zmm26, zmm4 173 | vpslld zmm5, zmm5, 4 174 | vpermq zmm5, zmm5, 0xD8 175 | vpermq zmm6, zmm6, 0xD8 176 | vshufi64x2 zmm10, zmm5, zmm6, 0x88 177 | vshufi64x2 zmm11, zmm5, zmm6, 0xDD 178 | vpaddd zmm4, zmm11, zmm10 179 | vpsrlw zmm6, zmm4, 8 180 | vpandd zmm5, zmm24, zmm4 181 | vpaddw zmm8, zmm8, zmm5 182 | vpaddw zmm9, zmm9, zmm6 183 | sub eax, 128 184 | cmp eax, 368 185 | jge .L22067 186 | vpermw zmm8, zmm23, zmm8 187 | vpermw zmm9, zmm23, zmm9 188 | call rbx 189 | vpxor ymm8, ymm8, ymm8 190 | vpxor ymm9, ymm9, ymm9 191 | mov eax, 65535 192 | .L22067:sub rcx, 1024 193 | jge .L22066 194 | vpermw zmm8, zmm23, zmm8 195 | vpermw zmm9, zmm23, zmm9 196 | .L22068:vpsrld zmm4, zmm0, 1 197 | vpaddd zmm5, zmm1, zmm1 198 | vpsrld zmm6, zmm2, 1 199 | vpaddd zmm7, zmm3, zmm3 200 | vpternlogd zmm0, zmm5, zmm28, 0xE4 201 | vpternlogd zmm1, zmm4, zmm28, 0xD8 202 | vpternlogd zmm2, zmm7, zmm28, 0xE4 203 | vpternlogd zmm3, zmm6, zmm28, 0xD8 204 | vpsrld zmm4, zmm0, 2 205 | vpsrld zmm6, zmm1, 2 206 | vpslld zmm5, zmm2, 2 207 | vpslld zmm7, zmm3, 2 208 | vpternlogd zmm2, zmm4, zmm27, 0xD8 209 | vpternlogd zmm3, zmm6, zmm27, 0xD8 210 | vpternlogd zmm0, zmm5, zmm27, 0xE4 211 | vpternlogd zmm1, zmm7, zmm27, 0xE4 212 | vpunpcklbw zmm6, zmm2, zmm3 213 | vpunpckhbw zmm3, zmm2, zmm3 214 | vpunpcklbw zmm5, zmm0, zmm1 215 | vpunpckhbw zmm2, zmm0, zmm1 216 | vpunpcklwd zmm4, zmm5, zmm6 217 | vpunpckhwd zmm5, zmm5, zmm6 218 | vpunpcklwd zmm6, zmm2, zmm3 219 | vpunpckhwd zmm7, zmm2, zmm3 220 | vpandd zmm0, zmm4, zmm26 221 | vpsrld zmm4, zmm4, 4 222 | vpandd zmm4, zmm4, zmm26 223 | vpandd zmm1, zmm5, zmm26 224 | vpsrld zmm5, zmm5, 4 225 | vpandd zmm5, zmm5, zmm26 226 | vpandd zmm2, zmm6, zmm26 227 | vpsrld zmm6, zmm6, 4 228 | vpandd zmm6, zmm6, zmm26 229 | vpandd zmm3, zmm7, zmm26 230 | vpsrld zmm7, zmm7, 4 231 | vpandd zmm7, zmm7, zmm26 232 | vpaddb zmm0, zmm0, zmm2 233 | vpaddb zmm1, zmm1, zmm3 234 | vpaddb zmm2, zmm4, zmm6 235 | vpaddb zmm3, zmm5, zmm7 236 | vpunpckldq zmm4, zmm0, zmm2 237 | vpunpckhdq zmm5, zmm0, zmm2 238 | vpunpckldq zmm6, zmm1, zmm3 239 | vpunpckhdq zmm7, zmm1, zmm3 240 | vshufi64x2 zmm0, zmm4, zmm5, 0x44 241 | vshufi64x2 zmm1, zmm4, zmm5, 0xEE 242 | vshufi64x2 zmm2, zmm6, zmm7, 0x44 243 | vshufi64x2 zmm3, zmm6, zmm7, 0xEE 244 | vpaddb zmm0, zmm0, zmm1 245 | vpaddb zmm2, zmm2, zmm3 246 | vshufi64x2 zmm1, zmm0, zmm2, 0x88 247 | vshufi64x2 zmm0, zmm0, zmm2, 0xDD 248 | vpaddb zmm0, zmm0, zmm1 249 | vpunpcklbw zmm1, zmm0, zmm25 250 | vpunpckhbw zmm2, zmm0, zmm25 251 | vpaddw zmm8, zmm8, zmm1 252 | vpaddw zmm9, zmm9, zmm2 253 | vpxor ymm0, ymm0, ymm0 254 | cmp ecx, -1024 255 | jz .L22071 256 | sub ecx, -1016 257 | jle .L22070 258 | .L22069:kmovq k1, [rsi] 259 | add rsi, 8 260 | vpsubb zmm0 {k1}, zmm0, zmm30 261 | sub ecx, 8 262 | jg .L22069 263 | .L22070:lea ecx, [rcx*8+0x40] 264 | bzhi rax, [rsi], rcx 265 | kmovq k1, rax 266 | vpsubb zmm0 {k1}, zmm0, zmm30 267 | vpunpcklbw zmm1, zmm0, zmm25 268 | vpunpckhbw zmm2, zmm0, zmm25 269 | vpaddw zmm8, zmm8, zmm1 270 | vpaddw zmm9, zmm9, zmm2 271 | .L22071:call rbx 272 | vzeroupper 273 | pop rbp 274 | ret 275 | 276 | .L22072:vpxor ymm0, ymm0, ymm0 277 | sub ecx, 8 278 | jle .L22074 279 | .L22073:kmovq k1, [rsi] 280 | add rsi, 8 281 | vpsubb zmm0 {k1}, zmm0, zmm30 282 | sub rcx, 8 283 | jg .L22073 284 | lea edx, [rcx*8] 285 | neg edx 286 | shrx rax, [rsi+rcx], rdx 287 | kmovq k1, rax 288 | vpsubb zmm0 {k1}, zmm0, zmm30 289 | vpunpcklbw zmm8, zmm0, zmm25 290 | vpunpckhbw zmm9, zmm0, zmm25 291 | call rbx 292 | vzeroupper 293 | pop rbp 294 | ret 295 | 296 | .L22074:.type .L22074, @function 297 | add ecx, 8 298 | xor eax, eax 299 | bts eax, ecx 300 | dec eax 301 | kmovd k1, eax 302 | vmovdqu8 xmm4 {k1}{z}, [rsi] 303 | vmovq rax, xmm4 304 | kmovq k1, rax 305 | vpsubb zmm0 {k1}, zmm0, zmm30 306 | vpunpcklbw zmm8, zmm0, zmm25 307 | vpunpckhbw zmm9, zmm0, zmm25 308 | call rbx 309 | vzeroupper 310 | pop rbp 311 | ret 312 | 313 | .balign 16 314 | .globl count8avx512 315 | .type count8avx512, @function 316 | count8avx512: 317 | push rbp 318 | mov rbp, rsp 319 | push rbx 320 | mov rcx, rdx 321 | lea rbx, [accum8+rip] 322 | call countavx512 323 | pop rbx 324 | pop rbp 325 | ret 326 | .size count8avx512, .-count8avx512 327 | 328 | .balign 16 329 | .globl count16avx512 330 | .type count16avx512, @function 331 | count16avx512: 332 | push rbp 333 | mov rbp, rsp 334 | push rbx 335 | mov rcx, rdx 336 | lea rbx, [accum16+rip] 337 | shl rcx, 1 338 | call countavx512 339 | pop rbx 340 | pop rbp 341 | ret 342 | .size count16avx512, .-count16avx512 343 | 344 | .balign 16 345 | .globl count32avx512 346 | .type count32avx512, @function 347 | count32avx512: 348 | push rbp 349 | mov rbp, rsp 350 | push rbx 351 | mov rcx, rdx 352 | lea rbx, [accum32+rip] 353 | shl rcx, 2 354 | call countavx512 355 | pop rbx 356 | pop rbp 357 | ret 358 | .size count32avx512, .-count32avx512 359 | 360 | .balign 16 361 | .globl count64avx512 362 | .type count64avx512, @function 363 | count64avx512: 364 | push rbp 365 | mov rbp, rsp 366 | push rbx 367 | mov rcx, rdx 368 | lea rbx, [accum64+rip] 369 | shl rcx, 3 370 | call countavx512 371 | pop rbx 372 | pop rbp 373 | ret 374 | .size count64avx512, .-count64avx512 375 | -------------------------------------------------------------------------------- /asm/kernelneon.S: -------------------------------------------------------------------------------- 1 | .section .rodata 2 | .balign 16 3 | magic: .quad 0x8040201008040201 4 | .quad 0, 0, -1, -1 5 | 6 | .section .text 7 | // X0: accumulation function 8 | // X1: input buffer 9 | // X2: counters 10 | // X3: remaining length 11 | countneon: 12 | str x30, [sp, #-16]! 13 | adrp x4, magic 14 | add x4, x4, #:lo12:magic 15 | ld1r {v28.2d}, [x4], #8 16 | movi v30.8b, #0x1 17 | movi v29.16b, #0x2 18 | add v29.16b, v29.16b, v30.16b 19 | movi v8.16b, #0x0 20 | movi v10.16b, #0x0 21 | movi v12.16b, #0x0 22 | movi v14.16b, #0x0 23 | cmp x3, #0xf0 24 | blt .Lrunt 25 | and x6, x1, #0xf 26 | and x1, x1, #0xfffffffffffffff0 27 | sub x5, x6, #0x10 28 | add x3, x3, x6 29 | neg x5, x5 30 | ld1 {v3.16b}, [x1], #16 31 | ldr q5, [x4, x5] 32 | and v0.16b, v3.16b, v5.16b 33 | ld1 {v1.16b, v2.16b}, [x1], #32 34 | ld1 {v3.16b-v6.16b}, [x1], #64 35 | ld1 {v16.16b-v19.16b}, [x1], #64 36 | eor v31.16b, v0.16b, v1.16b 37 | eor v0.16b, v31.16b, v2.16b 38 | bit v1.16b, v2.16b, v31.16b 39 | movi v27.16b, #0x55 40 | eor v2.16b, v3.16b, v0.16b 41 | eor v0.16b, v4.16b, v2.16b 42 | bsl v2.16b, v4.16b, v3.16b 43 | movi v26.16b, #0x33 44 | eor v3.16b, v5.16b, v0.16b 45 | eor v0.16b, v6.16b, v3.16b 46 | bsl v3.16b, v6.16b, v5.16b 47 | ld1 {v4.16b-v7.16b}, [x1], #64 48 | eor v31.16b, v1.16b, v2.16b 49 | eor v1.16b, v31.16b, v3.16b 50 | bit v2.16b, v3.16b, v31.16b 51 | eor v31.16b, v0.16b, v16.16b 52 | eor v0.16b, v31.16b, v17.16b 53 | bit v16.16b, v17.16b, v31.16b 54 | movi v25.16b, #0xf 55 | eor v31.16b, v0.16b, v18.16b 56 | eor v0.16b, v31.16b, v19.16b 57 | bit v18.16b, v19.16b, v31.16b 58 | mov x6, #65535 59 | eor v3.16b, v16.16b, v1.16b 60 | eor v1.16b, v18.16b, v3.16b 61 | bsl v3.16b, v18.16b, v16.16b 62 | movi v9.16b, #0x0 63 | eor v31.16b, v0.16b, v4.16b 64 | eor v0.16b, v31.16b, v5.16b 65 | bit v4.16b, v5.16b, v31.16b 66 | movi v11.16b, #0x0 67 | eor v31.16b, v0.16b, v6.16b 68 | eor v0.16b, v31.16b, v7.16b 69 | bit v6.16b, v7.16b, v31.16b 70 | movi v13.16b, #0x0 71 | eor v31.16b, v1.16b, v4.16b 72 | eor v1.16b, v31.16b, v6.16b 73 | bit v4.16b, v6.16b, v31.16b 74 | movi v15.16b, #0x0 75 | eor v31.16b, v2.16b, v3.16b 76 | eor v2.16b, v31.16b, v4.16b 77 | bit v3.16b, v4.16b, v31.16b 78 | subs x3, x3, #0x1f0 79 | blt .Lpost 80 | 81 | .Lvec: ld1 {v4.16b-v7.16b}, [x1], #64 82 | ld1 {v16.16b-v19.16b}, [x1], #64 83 | ld1 {v20.16b-v23.16b}, [x1], #64 84 | eor v31.16b, v4.16b, v5.16b 85 | eor v4.16b, v31.16b, v6.16b 86 | bit v5.16b, v6.16b, v31.16b 87 | eor v31.16b, v0.16b, v17.16b 88 | eor v0.16b, v31.16b, v19.16b 89 | bit v17.16b, v19.16b, v31.16b 90 | eor v31.16b, v7.16b, v16.16b 91 | eor v7.16b, v31.16b, v18.16b 92 | bit v16.16b, v18.16b, v31.16b 93 | eor v31.16b, v21.16b, v22.16b 94 | eor v21.16b, v31.16b, v20.16b 95 | bit v22.16b, v20.16b, v31.16b 96 | eor v31.16b, v1.16b, v5.16b 97 | eor v1.16b, v31.16b, v17.16b 98 | bit v5.16b, v17.16b, v31.16b 99 | ld1 {v17.16b-v20.16b}, [x1], #64 100 | eor v31.16b, v0.16b, v4.16b 101 | eor v0.16b, v31.16b, v7.16b 102 | bit v4.16b, v7.16b, v31.16b 103 | eor v31.16b, v17.16b, v18.16b 104 | eor v17.16b, v31.16b, v23.16b 105 | bit v18.16b, v23.16b, v31.16b 106 | eor v31.16b, v19.16b, v20.16b 107 | eor v19.16b, v31.16b, v21.16b 108 | bit v20.16b, v21.16b, v31.16b 109 | eor v31.16b, v16.16b, v18.16b 110 | eor v16.16b, v31.16b, v22.16b 111 | bit v18.16b, v22.16b, v31.16b 112 | eor v31.16b, v1.16b, v4.16b 113 | eor v1.16b, v31.16b, v20.16b 114 | bit v4.16b, v20.16b, v31.16b 115 | eor v31.16b, v0.16b, v17.16b 116 | eor v0.16b, v31.16b, v19.16b 117 | bit v17.16b, v19.16b, v31.16b 118 | eor v31.16b, v2.16b, v5.16b 119 | eor v2.16b, v31.16b, v18.16b 120 | bit v5.16b, v18.16b, v31.16b 121 | eor v31.16b, v1.16b, v16.16b 122 | eor v1.16b, v31.16b, v17.16b 123 | bit v16.16b, v17.16b, v31.16b 124 | eor v31.16b, v2.16b, v4.16b 125 | eor v2.16b, v31.16b, v16.16b 126 | bit v4.16b, v16.16b, v31.16b 127 | eor v31.16b, v3.16b, v4.16b 128 | eor v3.16b, v31.16b, v5.16b 129 | bit v4.16b, v5.16b, v31.16b 130 | and v5.16b, v4.16b, v27.16b 131 | bic v6.16b, v4.16b, v27.16b 132 | ushr v6.16b, v6.16b, #1 133 | zip1 v4.2d, v5.2d, v6.2d 134 | zip2 v5.2d, v5.2d, v6.2d 135 | add v4.16b, v4.16b, v5.16b 136 | and v5.16b, v4.16b, v26.16b 137 | bic v6.16b, v4.16b, v26.16b 138 | ushr v6.16b, v6.16b, #2 139 | and v4.16b, v5.16b, v25.16b 140 | bic v5.16b, v5.16b, v25.16b 141 | bic v7.16b, v6.16b, v25.16b 142 | and v6.16b, v6.16b, v25.16b 143 | shl v4.16b, v4.16b, #4 144 | shl v6.16b, v6.16b, #4 145 | zip1 v16.16b, v4.16b, v6.16b 146 | zip2 v17.16b, v4.16b, v6.16b 147 | zip1 v18.16b, v5.16b, v7.16b 148 | zip2 v19.16b, v5.16b, v7.16b 149 | zip1 v4.16b, v16.16b, v17.16b 150 | zip2 v5.16b, v16.16b, v17.16b 151 | zip1 v6.16b, v18.16b, v19.16b 152 | zip2 v7.16b, v18.16b, v19.16b 153 | zip1 v16.4s, v4.4s, v6.4s 154 | zip2 v17.4s, v4.4s, v6.4s 155 | zip1 v18.4s, v5.4s, v7.4s 156 | zip2 v19.4s, v5.4s, v7.4s 157 | uaddw v8.8h, v8.8h, v16.8b 158 | uaddw2 v9.8h, v9.8h, v16.16b 159 | uaddw v10.8h, v10.8h, v17.8b 160 | uaddw2 v11.8h, v11.8h, v17.16b 161 | uaddw v12.8h, v12.8h, v18.8b 162 | uaddw2 v13.8h, v13.8h, v18.16b 163 | uaddw v14.8h, v14.8h, v19.8b 164 | uaddw2 v15.8h, v15.8h, v19.16b 165 | sub x6, x6, #0x1e 166 | cmp x6, #0x5c 167 | bge .Lhave_space 168 | 169 | blr x0 170 | movi v8.16b, #0x0 171 | movi v9.16b, #0x0 172 | movi v10.16b, #0x0 173 | movi v11.16b, #0x0 174 | movi v12.16b, #0x0 175 | movi v13.16b, #0x0 176 | movi v14.16b, #0x0 177 | movi v15.16b, #0x0 178 | mov x6, #65535 179 | 180 | .Lhave_space: 181 | subs x3, x3, #0x100 182 | bge .Lvec 183 | 184 | .Lpost: ushr v4.16b, v0.16b, #1 185 | add v5.16b, v1.16b, v1.16b 186 | ushr v6.16b, v2.16b, #1 187 | add v7.16b, v3.16b, v3.16b 188 | bif v0.16b, v5.16b, v27.16b 189 | bit v1.16b, v4.16b, v27.16b 190 | bif v2.16b, v7.16b, v27.16b 191 | bit v3.16b, v6.16b, v27.16b 192 | ushr v4.16b, v0.16b, #2 193 | ushr v6.16b, v1.16b, #2 194 | shl v5.16b, v2.16b, #2 195 | shl v7.16b, v3.16b, #2 196 | bit v2.16b, v4.16b, v26.16b 197 | bit v3.16b, v6.16b, v26.16b 198 | bif v0.16b, v5.16b, v26.16b 199 | bif v1.16b, v7.16b, v26.16b 200 | zip1 v6.16b, v2.16b, v3.16b 201 | zip2 v3.16b, v2.16b, v3.16b 202 | zip1 v5.16b, v0.16b, v1.16b 203 | zip2 v2.16b, v0.16b, v1.16b 204 | zip1 v4.8h, v5.8h, v6.8h 205 | zip2 v5.8h, v5.8h, v6.8h 206 | zip1 v6.8h, v2.8h, v3.8h 207 | zip2 v7.8h, v2.8h, v3.8h 208 | and v0.16b, v25.16b, v4.16b 209 | ushr v4.16b, v4.16b, #4 210 | and v1.16b, v25.16b, v5.16b 211 | ushr v5.16b, v5.16b, #4 212 | and v2.16b, v25.16b, v6.16b 213 | add v0.16b, v2.16b, v0.16b 214 | usra v4.16b, v6.16b, #4 215 | and v3.16b, v25.16b, v7.16b 216 | add v1.16b, v3.16b, v1.16b 217 | usra v5.16b, v7.16b, #4 218 | zip1 v2.4s, v0.4s, v4.4s 219 | zip2 v3.4s, v0.4s, v4.4s 220 | zip1 v6.4s, v1.4s, v5.4s 221 | zip2 v7.4s, v1.4s, v5.4s 222 | uaddw v8.8h, v8.8h, v2.8b 223 | uaddw2 v9.8h, v9.8h, v2.16b 224 | uaddw v10.8h, v10.8h, v3.8b 225 | uaddw2 v11.8h, v11.8h, v3.16b 226 | uaddw v12.8h, v12.8h, v6.8b 227 | uaddw2 v13.8h, v13.8h, v6.16b 228 | uaddw v14.8h, v14.8h, v7.8b 229 | uaddw2 v15.8h, v15.8h, v7.16b 230 | 231 | .Lendvec: 232 | movi v0.16b, #0x0 233 | movi v1.16b, #0x0 234 | movi v2.16b, #0x0 235 | movi v3.16b, #0x0 236 | adds x3, x3, #0xf8 237 | blt .Ltail1 238 | 239 | .Ltail8: 240 | subs x3, x3, #0x8 241 | ldr s6, [x1], #4 242 | ldr s7, [x1], #4 243 | tbl v4.16b, {v6.16b}, v30.16b 244 | tbl v5.16b, {v6.16b}, v29.16b 245 | cmtst v4.16b, v4.16b, v28.16b 246 | cmtst v5.16b, v5.16b, v28.16b 247 | sub v0.16b, v0.16b, v4.16b 248 | sub v1.16b, v1.16b, v5.16b 249 | tbl v4.16b, {v7.16b}, v30.16b 250 | tbl v5.16b, {v7.16b}, v29.16b 251 | cmtst v4.16b, v4.16b, v28.16b 252 | cmtst v5.16b, v5.16b, v28.16b 253 | sub v2.16b, v2.16b, v4.16b 254 | sub v3.16b, v3.16b, v5.16b 255 | bge .Ltail8 256 | 257 | .Ltail1: 258 | adds x3, x3, #0x8 259 | ble .Lend 260 | 261 | ldr d6, [x1] 262 | sub x6, x4, x3 263 | ldr q5, [x6, #16] 264 | bic v6.16b, v6.16b, v5.16b 265 | ext v7.16b, v6.16b, v6.16b, #4 266 | tbl v4.16b, {v6.16b}, v30.16b 267 | tbl v5.16b, {v6.16b}, v29.16b 268 | cmtst v4.16b, v4.16b, v28.16b 269 | cmtst v5.16b, v5.16b, v28.16b 270 | sub v0.16b, v0.16b, v4.16b 271 | sub v1.16b, v1.16b, v5.16b 272 | tbl v4.16b, {v7.16b}, v30.16b 273 | tbl v5.16b, {v7.16b}, v29.16b 274 | cmtst v4.16b, v4.16b, v28.16b 275 | cmtst v5.16b, v5.16b, v28.16b 276 | sub v2.16b, v2.16b, v4.16b 277 | sub v3.16b, v3.16b, v5.16b 278 | 279 | .Lend: uaddw v9.8h, v9.8h, v0.8b 280 | uaddw2 v8.8h, v8.8h, v0.16b 281 | uaddw v11.8h, v11.8h, v1.8b 282 | uaddw2 v10.8h, v10.8h, v1.16b 283 | uaddw v13.8h, v13.8h, v2.8b 284 | uaddw2 v12.8h, v12.8h, v2.16b 285 | uaddw v15.8h, v15.8h, v3.8b 286 | uaddw2 v14.8h, v14.8h, v3.16b 287 | blr x0 288 | ldr x30, [sp], #16 289 | ret 290 | 291 | .Lrunt: 292 | subs x3, x3, #0x8 293 | blt .Lrunt1 294 | 295 | .Lrunt8: 296 | subs x3, x3, #0x8 297 | ldr s6, [x1], #4 298 | ldr s7, [x1], #4 299 | tbl v4.16b, {v6.16b}, v30.16b 300 | tbl v5.16b, {v6.16b}, v29.16b 301 | cmtst v4.16b, v4.16b, v28.16b 302 | cmtst v5.16b, v5.16b, v28.16b 303 | sub v8.16b, v8.16b, v4.16b 304 | sub v10.16b, v10.16b, v5.16b 305 | tbl v4.16b, {v7.16b}, v30.16b 306 | tbl v5.16b, {v7.16b}, v29.16b 307 | cmtst v4.16b, v4.16b, v28.16b 308 | cmtst v5.16b, v5.16b, v28.16b 309 | sub v12.16b, v12.16b, v4.16b 310 | sub v14.16b, v14.16b, v5.16b 311 | bge .Lrunt8 312 | 313 | .Lrunt1: 314 | adds x3, x3, #0x8 315 | ble .Lrunt_accum 316 | 317 | and x5, x1, #0x7 318 | add x8, x3, x5 319 | lsl x3, x3, #3 320 | mov x7, #0xffffffffffffffff // #-1 321 | lsl x7, x7, x3 322 | cmp x8, #0x8 323 | bgt .Lcrossrunt1 324 | 325 | and x1, x1, #0xfffffffffffffff8 326 | ldr x6, [x1] 327 | lsl x5, x5, #3 328 | lsr x6, x6, x5 329 | b .Ldorunt1 330 | 331 | .Lcrossrunt1: 332 | ldr x6, [x1] 333 | 334 | .Ldorunt1: 335 | bic x6, x6, x7 336 | fmov d6, x6 337 | ext v7.16b, v6.16b, v6.16b, #4 338 | tbl v4.16b, {v6.16b}, v30.16b 339 | tbl v5.16b, {v6.16b}, v29.16b 340 | cmtst v4.16b, v4.16b, v28.16b 341 | cmtst v5.16b, v5.16b, v28.16b 342 | sub v8.16b, v8.16b, v4.16b 343 | sub v10.16b, v10.16b, v5.16b 344 | tbl v4.16b, {v7.16b}, v30.16b 345 | tbl v5.16b, {v7.16b}, v29.16b 346 | cmtst v4.16b, v4.16b, v28.16b 347 | cmtst v5.16b, v5.16b, v28.16b 348 | sub v12.16b, v12.16b, v4.16b 349 | sub v14.16b, v14.16b, v5.16b 350 | 351 | .Lrunt_accum: 352 | uxtl v9.8h, v8.8b 353 | uxtl2 v8.8h, v8.16b 354 | uxtl v11.8h, v10.8b 355 | uxtl2 v10.8h, v10.16b 356 | uxtl v13.8h, v12.8b 357 | uxtl2 v12.8h, v12.16b 358 | uxtl v15.8h, v14.8b 359 | uxtl2 v14.8h, v14.16b 360 | blr x0 361 | ldr x30, [sp], #16 362 | ret 363 | 364 | .globl count8neon 365 | .type count8neon, %function 366 | // extern void count8neon(flags_type *flags, const uint16_t *data, uint32_t len); 367 | count8neon: 368 | sub x3, sp, #4*16 369 | str x30, [sp, #-5*16]! 370 | st1 {v8.1d-v11.1d}, [x3], #32 371 | st1 {v12.1d-v15.1d}, [x3], #32 372 | mov w3, w2 373 | mov x2, x0 374 | adr x0, accum8 375 | bl countneon 376 | ldr x30, [sp], #16 377 | ld1 {v8.1d-v11.1d}, [sp], #32 378 | ld1 {v12.1d-v15.1d}, [sp], #32 379 | ret 380 | 381 | .globl count16neon 382 | .type count16neon, %function 383 | // extern void count16neon(flags_type *flags, const uint16_t *data, uint32_t len); 384 | count16neon: 385 | sub x3, sp, #4*16 386 | str x30, [sp, #-5*16]! 387 | st1 {v8.1d-v11.1d}, [x3], #32 388 | st1 {v12.1d-v15.1d}, [x3], #32 389 | ubfiz x3, x2, #1, #32 390 | mov x2, x0 391 | adr x0, accum16 392 | bl countneon 393 | ldr x30, [sp], #16 394 | ld1 {v8.1d-v11.1d}, [sp], #32 395 | ld1 {v12.1d-v15.1d}, [sp], #32 396 | ret 397 | 398 | .globl count32neon 399 | .type count32neon, %function 400 | // extern void count32neon(flags_type *flags, const uint16_t *data, uint32_t len); 401 | count32neon: 402 | sub x3, sp, #4*16 403 | str x30, [sp, #-5*16]! 404 | st1 {v8.1d-v11.1d}, [x3], #32 405 | st1 {v12.1d-v15.1d}, [x3], #32 406 | ubfiz x3, x2, #2, #32 407 | mov x2, x0 408 | adr x0, accum32 409 | bl countneon 410 | ldr x30, [sp], #16 411 | ld1 {v8.1d-v11.1d}, [sp], #32 412 | ld1 {v12.1d-v15.1d}, [sp], #32 413 | ret 414 | 415 | .globl count64neon 416 | .type count64neon, %function 417 | // extern void count64neon(flags_type *flags, const uint16_t *data, uint32_t len); 418 | count64neon: 419 | sub x3, sp, #4*16 420 | str x30, [sp, #-5*16]! 421 | st1 {v8.1d-v11.1d}, [x3], #32 422 | st1 {v12.1d-v15.1d}, [x3], #32 423 | ubfiz x3, x2, #3, #32 424 | mov x2, x0 425 | adr x0, accum64 426 | bl countneon 427 | ldr x30, [sp], #16 428 | ld1 {v8.1d-v11.1d}, [sp], #32 429 | ld1 {v12.1d-v15.1d}, [sp], #32 430 | ret 431 | -------------------------------------------------------------------------------- /benchmark/golike/benchmark.c: -------------------------------------------------------------------------------- 1 | #define _GNU_SOURCE 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | #include 12 | #include 13 | #include 14 | 15 | #if FLAGSIZE == 64 16 | typedef uint64_t flags_type; 17 | #else 18 | typedef uint32_t flags_type; 19 | #endif 20 | 21 | #ifndef TIME_GOAL 22 | /* run each benchmark for at least this many seconds */ 23 | #define TIME_GOAL 2.0 24 | #endif 25 | 26 | #ifdef ALIGN 27 | #define memory_allocate(size) aligned_alloc(64, (size) + 63 & ~63) 28 | #else 29 | #define memory_allocate(size) malloc(size) 30 | #endif 31 | 32 | #ifdef __x86_64__ 33 | #include "pospopcnt_avx512bw.h" 34 | #endif 35 | #include "pospopcnt.h" 36 | 37 | typedef void pospopcnt_u16(const uint16_t *data, uint32_t len, flags_type *flags); 38 | static pospopcnt_u16 pospopcnt_u16_dummy; 39 | 40 | /* a "do nothing" implementation to measure the memory overhead */ 41 | static void pospopcnt_u16_dummy(const uint16_t *data, uint32_t len, flags_type *flags) 42 | { 43 | volatile uint16_t sink; 44 | uint16_t sum = 0; 45 | 46 | for (uint32_t i = 0; i < len; i++) 47 | sum += data[i]; 48 | 49 | sink = sum; 50 | 51 | for (uint32_t i = 0; i < 16; i++) 52 | flags[i] += sum; 53 | } 54 | 55 | #ifdef __x86_64__ 56 | extern void count16avx512(flags_type *flags, const uint16_t *data, uint32_t len); 57 | extern void count16avx2(flags_type *flags, const uint16_t *data, uint32_t len); 58 | 59 | static void 60 | pospopcnt_count16avx512(const uint16_t *data, uint32_t len, flags_type *flags) 61 | { 62 | count16avx512(flags, data, len); 63 | } 64 | 65 | static void 66 | pospopcnt_count16avx2(const uint16_t *data, uint32_t len, flags_type *flags) 67 | { 68 | count16avx2(flags, data, len); 69 | } 70 | #endif 71 | 72 | #ifdef __aarch64__ 73 | extern void count16neon(flags_type *flags, const uint16_t *data, uint32_t len); 74 | static void 75 | pospopcnt_count16neon(const uint16_t *data, uint32_t len, flags_type *flags) 76 | { 77 | count16neon(flags, data, len); 78 | } 79 | #endif 80 | 81 | static const struct pospopcnt_u16_method { 82 | const char *const name; 83 | pospopcnt_u16 *const method; 84 | } methods[] = { 85 | { "overhead", pospopcnt_u16_dummy }, 86 | { "scalar", pospopcnt_u16_scalar }, 87 | #ifdef __x86_64__ 88 | { "avx512bw_harvey_seal_1KB", pospopcnt_u16_avx512bw_harvey_seal_1KB }, 89 | { "avx512bw_harvey_seal_512B", pospopcnt_u16_avx512bw_harvey_seal_512B }, 90 | { "avx512bw_harvey_seal_256B", pospopcnt_u16_avx512bw_harvey_seal_256B }, 91 | { "count16avx512", pospopcnt_count16avx512 }, 92 | { "count16avx2", pospopcnt_count16avx2 }, 93 | #endif 94 | #ifdef __aarch64__ 95 | { "count16neon", pospopcnt_count16neon }, 96 | #endif 97 | { NULL, NULL }, 98 | }; 99 | 100 | static int event_group = -1, num_counters = 0; 101 | enum { EVENT_COUNT = 2 }; 102 | static struct event { 103 | uint32_t type; 104 | int fd; 105 | uint64_t conf; 106 | } events[EVENT_COUNT] = { 107 | { PERF_TYPE_HARDWARE, -1, PERF_COUNT_HW_CPU_CYCLES }, 108 | { PERF_TYPE_HARDWARE, -1, PERF_COUNT_HW_INSTRUCTIONS }, 109 | }; 110 | 111 | /* set up performance counters so performance measurements can be taken */ 112 | static void init_counters() 113 | { 114 | int i; 115 | struct perf_event_attr attribs; 116 | 117 | memset(&attribs, 0, sizeof attribs); 118 | attribs.exclude_kernel = 1; 119 | attribs.exclude_hv = 1; 120 | attribs.sample_period = 0; 121 | attribs.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID; 122 | 123 | for (i = 0; i < EVENT_COUNT; i++) { 124 | attribs.type = events[i].type; 125 | attribs.config = events[i].conf; 126 | events[i].fd = syscall(SYS_perf_event_open, &attribs, 0, -1, event_group, 0); 127 | if (events[i].fd == -1) { 128 | perror("perf_event_open"); 129 | continue; 130 | } 131 | 132 | num_counters++; 133 | 134 | if (event_group == -1) 135 | event_group = events[i].fd; 136 | } 137 | } 138 | 139 | /* state of performance counters at one point in time */ 140 | struct counters { 141 | struct timespec ts; 142 | uint64_t counters[2 * EVENT_COUNT + 1]; 143 | }; 144 | 145 | typedef int testfunc(struct counters *c, void *payload, size_t n, size_t m); 146 | 147 | /* initialise hardware perf counters */ 148 | 149 | /* set counters to their current value */ 150 | static int 151 | reset_counters(struct counters *c) 152 | { 153 | ssize_t count; 154 | int res; 155 | 156 | res = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &c->ts); 157 | if (res == -1) { 158 | perror("clock_gettime"); 159 | return (-1); 160 | } 161 | 162 | if (event_group != -1) { 163 | count = read(event_group, c->counters, (2 * num_counters + 1) * sizeof c->counters[0]); 164 | if (count < 0) { 165 | perror("read(event_group, ...)"); 166 | return (-1); 167 | } 168 | } 169 | 170 | return (0); 171 | } 172 | 173 | /* compute the difference between two timespec as a double */ 174 | static double 175 | tsdiff(struct timespec start, struct timespec end) 176 | { 177 | time_t sec; 178 | long nsec; 179 | 180 | sec = end.tv_sec - start.tv_sec; 181 | nsec = end.tv_nsec - start.tv_nsec; 182 | if (nsec < 0) { 183 | sec--; 184 | nsec += 1000000000L; 185 | } 186 | 187 | return (sec + nsec * 1.0e-9); 188 | } 189 | 190 | /* compute the difference between the two counter vectors */ 191 | static void 192 | counterdiff(uint64_t out[EVENT_COUNT], uint64_t start[], uint64_t end[]) 193 | { 194 | int i, j; 195 | 196 | for (i = 0, j = 0; i < EVENT_COUNT; i++) { 197 | if (events[i].fd == -1) 198 | out[i] = 0; 199 | else { 200 | out[i] = end[2*j+1] - start[2*j+1]; 201 | j++; 202 | } 203 | } 204 | } 205 | 206 | /* print test results */ 207 | /* https://golang.org/design/14313-benchmark-format */ 208 | static void 209 | print_results( 210 | const char *name, 211 | struct counters *start, struct counters *end, 212 | size_t n, size_t m) { 213 | uint64_t counts[EVENT_COUNT]; 214 | double elapsed; 215 | 216 | elapsed = tsdiff(start->ts, end->ts); 217 | counterdiff(counts, start->counters, end->counters); 218 | 219 | if (name == NULL || name[0] == '\0') 220 | name = " "; 221 | 222 | printf("Benchmark%c%s/%zuB\t%10zu\t" 223 | "%.8g ns/op\t%.8g MB/s", 224 | toupper(name[0]), name+1, n, m, 225 | (elapsed * 1e9) / m, (1e-6 * n * m) / elapsed); 226 | 227 | if (num_counters == EVENT_COUNT) { 228 | printf("\t%.8g cy/B\t%.8g ins/B\t%.8g ipc\n", 229 | counts[0]/((double)n * m), counts[1]/((double)n * m), 230 | (double)counts[1] / counts[0]); 231 | } else 232 | putchar('\n'); 233 | } 234 | 235 | /* run one test case for the specified n and print the result */ 236 | static void 237 | run_test(const char *name, testfunc *test, void *payload, size_t n) 238 | { 239 | struct counters start, end; 240 | size_t m = 1; 241 | int first_run = 1; 242 | 243 | /* repeatedly run benchmark and adjust m until result is meaningful */ 244 | for (;; first_run = 0) { 245 | double elapsed; 246 | size_t newm; 247 | int res; 248 | 249 | /* printf("RUN %s: n = %zu m = %zu\n", name, n, m); */ 250 | 251 | res = reset_counters(&start); 252 | if (res != 0) { 253 | printf("FAIL\t%s\n", name); 254 | return; 255 | } 256 | 257 | res = test(&start, payload, n, m); 258 | if (res != 0) { 259 | printf("FAIL\t%s\n", name); 260 | return; 261 | } 262 | 263 | res = reset_counters(&end); 264 | if (res != 0) { 265 | printf("FAIL\t%s\n", name); 266 | return; 267 | } 268 | 269 | elapsed = tsdiff(start.ts, end.ts); 270 | // printf("m = %zu, elapsed = %f s\n", m, elapsed); 271 | if (elapsed < TIME_GOAL) { 272 | if (elapsed < TIME_GOAL * 0.5) 273 | m *= 2; 274 | else { 275 | /* try to overshoot 1s time goal slightly */ 276 | newm = ceil(m * TIME_GOAL * 1.05 / elapsed); 277 | m = newm > m ? newm : m + 1; 278 | } 279 | 280 | continue; 281 | } 282 | 283 | /* make sure to perform at least one warm-up iteration */ 284 | if (!first_run) 285 | break; 286 | } 287 | 288 | print_results(name, &start, &end, n, m); 289 | } 290 | 291 | /* test positiona population count function payload with an n byte array */ 292 | static int 293 | test_pospop_u16(struct counters *c, void *payload, size_t n, size_t m) 294 | { 295 | pospopcnt_u16 *pospop = (pospopcnt_u16 *)payload; 296 | uint16_t *data; 297 | size_t len = n / sizeof *data, i; 298 | long int num; 299 | int res; 300 | flags_type flags[16], accum = 0; 301 | volatile flags_type sum; 302 | 303 | memset(flags, 0, sizeof flags); 304 | data = memory_allocate(len * sizeof *data); 305 | if (data == NULL) { 306 | perror("memory_allocate"); 307 | return (-1); 308 | } 309 | 310 | srand48(42); 311 | for (i = 0; i < len; i += 2) { 312 | num = mrand48(); 313 | data[i] = num & 0xffff; 314 | data[i+1] = num >> 16 & 0xffff; 315 | } 316 | 317 | /* skip initialisation step in benchmark measurements */ 318 | res = reset_counters(c); 319 | if (res != 0) 320 | return (-1); 321 | 322 | for (i = 0; i < m; i++) 323 | pospop(data, len, flags); 324 | 325 | /* make sure the result is used */ 326 | accum = 0; 327 | for (i = 0; i < 16; i++) 328 | accum += flags[i]; 329 | sum = accum; 330 | 331 | free(data); 332 | 333 | return (0); 334 | } 335 | 336 | extern int 337 | main(int argc, char *argv[]) 338 | { 339 | size_t n; 340 | int i, j; 341 | char *fake_argv[] = { argv[0], "100000", NULL }; 342 | 343 | setlinebuf(stdout); 344 | init_counters(); 345 | 346 | if (argc < 2) { 347 | argv = fake_argv; 348 | argc = 2; 349 | } 350 | 351 | for (j = 1; j < argc; j++) { 352 | n = atoll(argv[j]); 353 | for (i = 0; methods[i].method != NULL; i++) { 354 | run_test(methods[i].name, test_pospop_u16, methods[i].method, n); 355 | } 356 | } 357 | } 358 | -------------------------------------------------------------------------------- /benchmark/linux/aligned_alloc.h: -------------------------------------------------------------------------------- 1 | // functions borrowed from 2 | // https://github.com/lemire/simdjson/blob/master/include/simdjson/portability.h 3 | #pragma once 4 | 5 | #ifdef __cplusplus 6 | #include 7 | #include 8 | #else 9 | #include 10 | #include 11 | #endif 12 | 13 | template static inline int get_alignment(T *data) { 14 | uintptr_t addr = reinterpret_cast(data); 15 | return addr & ~(addr - 1); 16 | } 17 | 18 | // portable version of posix_memalign 19 | static inline void *aligned_malloc(size_t alignment, size_t size) { 20 | void *p; 21 | #ifdef _MSC_VER 22 | p = _aligned_malloc(size, alignment); 23 | #elif defined(__MINGW32__) || defined(__MINGW64__) 24 | p = __mingw_aligned_malloc(size, alignment); 25 | #else 26 | // somehow, if this is used before including "x86intrin.h", it creates an 27 | // implicit defined warning. 28 | if (posix_memalign(&p, alignment, size) != 0) 29 | return NULL; 30 | #endif 31 | return p; 32 | } 33 | 34 | static inline void aligned_free(void *memblock) { 35 | if (memblock == NULL) 36 | return; 37 | #ifdef _MSC_VER 38 | _aligned_free(memblock); 39 | #elif defined(__MINGW32__) || defined(__MINGW64__) 40 | __mingw_aligned_free(memblock); 41 | #else 42 | free(memblock); 43 | #endif 44 | } 45 | -------------------------------------------------------------------------------- /benchmark/linux/instrumented_benchmark.cpp: -------------------------------------------------------------------------------- 1 | #include 2 | #if FLAGSIZE == 64 3 | typedef uint64_t flags_type; 4 | #else 5 | typedef uint32_t flags_type; 6 | #endif 7 | 8 | #ifdef __linux__ 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | #include 17 | #include 18 | #include 19 | #include 20 | #include 21 | #include 22 | 23 | #include "linux-perf-events.h" 24 | #include "aligned_alloc.h" 25 | #ifdef __x86_64__ 26 | #include "pospopcnt_avx512bw.h" 27 | #endif 28 | #include "pospopcnt.h" 29 | #ifdef ALIGN 30 | #include "memalloc.h" 31 | #define memory_allocate(size) aligned_alloc(64, (size)) 32 | #else 33 | #define memory_allocate(size) malloc(size) 34 | #endif 35 | 36 | // command line options 37 | enum { 38 | OPT_VERBOSE = 1 << 0, 39 | OPT_TEST = 1 << 1, 40 | OPT_COMPENSATE = 1 << 2, 41 | OPT_TOUCH = 1 << 3, 42 | OPT_FORCE = 1 << 4, 43 | }; 44 | 45 | // Function pointer definition. 46 | typedef void (*pospopcnt_u16_method_type)(const uint16_t *data, uint32_t len, 47 | flags_type *flags); 48 | 49 | #ifdef __x86_64__ 50 | extern "C" void count16avx512(flags_type flags[16], const uint16_t *buf, size_t len); 51 | extern "C" void count16avx2(flags_type flags[16], const uint16_t *buf, size_t len); 52 | 53 | static void pospopcnt_count16avx512(const uint16_t *data, uint32_t len, flags_type *flags) 54 | { 55 | count16avx512(flags, data, len); 56 | } 57 | 58 | static void pospopcnt_count16avx2(const uint16_t *data, uint32_t len, flags_type *flags) 59 | { 60 | count16avx2(flags, data, len); 61 | } 62 | #endif 63 | 64 | #ifdef __aarch64__ 65 | extern "C" void count16neon(flags_type flags[16], const uint16_t *buf, size_t len); 66 | 67 | static void pospopcnt_count16neon(const uint16_t *data, uint32_t len, flags_type *flags) 68 | { 69 | count16neon(flags, data, len); 70 | } 71 | #endif 72 | 73 | // dummy for taking the overhead 74 | static void pospopcnt_dummy(const uint16_t *data, uint32_t len, flags_type *flags) 75 | { 76 | (void)data; 77 | (void)len; 78 | (void)flags; 79 | } 80 | 81 | static const struct { 82 | pospopcnt_u16_method_type method; 83 | const char *name; 84 | } methods[] = { 85 | { pospopcnt_u16_scalar, "pospopcnt_u16_scalar" }, 86 | #ifdef __x86_64__ 87 | { pospopcnt_u16_avx512bw_harvey_seal_1KB, "pospopcnt_u16_avx512bw_harvey_seal_1KB" }, 88 | { pospopcnt_u16_avx512bw_harvey_seal_512B, "pospopcnt_u16_avx512bw_harvey_seal_512B" }, 89 | { pospopcnt_u16_avx512bw_harvey_seal_256B, "pospopcnt_u16_avx512bw_harvey_seal_256B" }, 90 | { pospopcnt_count16avx512, "pospopcnt_count16avx512" }, 91 | { pospopcnt_count16avx2, "pospopcnt_count16avx2" }, 92 | #endif 93 | #ifdef __aarch64__ 94 | { pospopcnt_count16neon, "pospopcnt_count16neon" }, 95 | #endif 96 | { NULL, NULL }, 97 | }; 98 | 99 | void print16(flags_type *flags) { 100 | for (int k = 0; k < 16; k++) 101 | printf(" %8ju ", (uintmax_t)flags[k]); 102 | printf("\n"); 103 | } 104 | 105 | std::vector 106 | compute_mins(std::vector > allresults) { 107 | if (allresults.size() == 0) 108 | return std::vector(); 109 | 110 | std::vector answer = allresults[0]; 111 | 112 | for (size_t k = 1; k < allresults.size(); k++) { 113 | assert(allresults[k].size() == answer.size()); 114 | for (size_t z = 0; z < answer.size(); z++) { 115 | if (allresults[k][z] < answer[z]) 116 | answer[z] = allresults[k][z]; 117 | } 118 | } 119 | return answer; 120 | } 121 | 122 | std::vector 123 | compute_averages(std::vector > allresults) { 124 | if (allresults.size() == 0) 125 | return std::vector(); 126 | 127 | std::vector answer(allresults[0].size()); 128 | 129 | for (size_t k = 0; k < allresults.size(); k++) { 130 | assert(allresults[k].size() == answer.size()); 131 | for (size_t z = 0; z < answer.size(); z++) { 132 | answer[z] += allresults[k][z]; 133 | } 134 | } 135 | 136 | for (size_t z = 0; z < answer.size(); z++) { 137 | answer[z] /= allresults.size(); 138 | } 139 | return answer; 140 | } 141 | 142 | class BenchmarkState { 143 | std::vector evts = { 144 | PERF_COUNT_HW_CPU_CYCLES, 145 | PERF_COUNT_HW_INSTRUCTIONS, 146 | PERF_COUNT_HW_BRANCH_MISSES, 147 | PERF_COUNT_HW_CACHE_REFERENCES, 148 | PERF_COUNT_HW_CACHE_MISSES, 149 | PERF_COUNT_HW_REF_CPU_CYCLES 150 | }; 151 | 152 | BenchmarkState *overhead = nullptr; 153 | LinuxEvents unified; 154 | std::vector results; // tmp buffer 155 | std::chrono::time_point start; 156 | std::vector > allresults; 157 | std::vector timings; 158 | std::vector freqs; 159 | bool in_progress = false; 160 | 161 | public: 162 | BenchmarkState() : unified(evts) { 163 | results.resize(evts.size()); 164 | } 165 | 166 | BenchmarkState(BenchmarkState *oh) : unified(evts) { 167 | overhead = oh; 168 | results.resize(evts.size()); 169 | } 170 | 171 | // begin measurement of a benchmark iteration 172 | void begin() { 173 | assert(!in_progress); 174 | in_progress = true; 175 | 176 | start = std::chrono::steady_clock::now(); 177 | unified.start(); 178 | } 179 | 180 | // end measurement of a benchmark iteration 181 | void end() { 182 | assert(in_progress); 183 | 184 | unified.end(results); 185 | auto end = std::chrono::steady_clock::now(); 186 | std::chrono::duration secs = end - start; 187 | double time_in_s = secs.count(); 188 | timings.push_back(time_in_s); 189 | freqs.push_back(results[0]/(1e9*time_in_s)); 190 | allresults.push_back(results); 191 | 192 | in_progress = false; 193 | } 194 | 195 | void printResults(bool verbose, uint32_t n, uint32_t m) 196 | { 197 | std::vector mins = compute_mins(allresults); 198 | std::vector avg = compute_averages(allresults); 199 | double min_timing = *min_element(timings.begin(), timings.end()); 200 | double min_freq = *min_element(freqs.begin(), freqs.end()); 201 | double max_freq = *max_element(freqs.begin(), freqs.end()); 202 | 203 | // compensate for measuring overhead by subtracting the overhead 204 | if (overhead != nullptr) { 205 | std::vector oh_mins = compute_mins(overhead->allresults); 206 | std::vector oh_avg = compute_averages(overhead->allresults); 207 | min_timing -= *min_element(overhead->timings.begin(), overhead->timings.end()); 208 | 209 | assert(mins.size() == oh_mins.size()); 210 | for (size_t i = 0; i < mins.size(); i++) { 211 | mins[i] -= oh_mins[i]; 212 | avg[i] -= oh_avg[i]; 213 | } 214 | } 215 | 216 | double speedinGBs = (m * n * sizeof(uint16_t)) / (min_timing * 1e9); 217 | 218 | if (verbose) { 219 | printf("instructions per cycle %4.2f, cycles per 16-bit word: %4.3f, " 220 | "instructions per 16-bit word %4.3f \n", 221 | double(mins[1]) / mins[0], double(mins[0]) / (m * n), 222 | double(mins[1]) / (n * m)); 223 | // first we display mins 224 | printf("min: %8llu cycles, %8llu instructions, \t%8llu branch mis., %8llu " 225 | "cache ref., %8llu cache mis.\n", 226 | mins[0], mins[1], mins[2], mins[3], mins[4]); 227 | printf("avg: %8.1f cycles, %8.1f instructions, \t%8.1f branch mis., %8.1f " 228 | "cache ref., %8.1f cache mis.\n", 229 | avg[0], avg[1], avg[2], avg[3], avg[4]); 230 | printf(" %4.3f GB/s \n", speedinGBs); 231 | printf("estimated clock in range %4.3f GHz to %4.3f GHz\n", min_freq, max_freq); 232 | } else { 233 | printf("cycles per 16-bit word: %4.3f; ref cycles per 16-bit word: %4.3f; speed in GB/s %4.3f \n", 234 | double(mins[0]) / (n * m), double(mins[5]) / (n * m), speedinGBs); 235 | } 236 | } 237 | }; 238 | 239 | // initialise all subarrays of the vdata array 240 | template 241 | void init_vdata(C &vdata) 242 | { 243 | std::mt19937 gen; 244 | 245 | for (size_t k = 0; k < vdata.size(); k++) { 246 | for (size_t k2 = 0; k2 < vdata[k].size(); k2++) { 247 | vdata[k][k2] = gen() & 0xffff; // initialise to random integer 248 | } 249 | } 250 | } 251 | 252 | // Read all array entries and sum them up. Discard the sum. 253 | // The purpose of this is to ensure that the array has been 254 | // recently accessed. 255 | template 256 | void touch(C &vdata) 257 | { 258 | int sum; 259 | volatile int total; 260 | 261 | for (size_t k = 0; k < vdata.size(); k++) { 262 | sum += vdata[k]; 263 | } 264 | 265 | total = sum; 266 | } 267 | 268 | /** 269 | * @brief 270 | * 271 | * @param n Number of integers. 272 | * @parem m Number of arrays. 273 | * @param iterations Number of iterations. 274 | * @param fn Target function pointer. 275 | * @param options Command line options 276 | * @return Benchmark results. 277 | */ 278 | template 279 | BenchmarkState benchmarkMany(C & vdata, BenchmarkState *overhead, uint32_t n, uint32_t m, 280 | uint32_t iterations, 281 | pospopcnt_u16_method_type fn, int options) { 282 | #ifdef ALIGN 283 | for (auto &x : vdata) { 284 | assert(get_alignment(x.data()) == 64); 285 | } 286 | #endif 287 | BenchmarkState bench(overhead); 288 | 289 | init_vdata(vdata); 290 | 291 | uint32_t test_iterations = 1; // we run one test iteration 292 | for (uint32_t i = 0; i < test_iterations; i++) { 293 | std::vector > correctflags(m, 294 | std::vector(16)); 295 | for (size_t k = 0; k < m; k++) { 296 | pospopcnt_u16_scalar(vdata[k].data(), vdata[k].size(), 297 | correctflags[k].data()); // this is our gold standard 298 | } 299 | std::vector > flags(m, std::vector(16)); 300 | for (size_t k = 0; k < m; k++) { 301 | fn(vdata[k].data(), vdata[k].size(), flags[k].data()); 302 | } 303 | 304 | uint64_t tot_obs = 0; 305 | for (size_t km = 0; km < m; ++km) 306 | for (size_t k = 0; k < 16; ++k) 307 | tot_obs += flags[km][k]; 308 | if (tot_obs == 0 && options & OPT_TEST) { // when a method is not supported it returns all zero 309 | printf("method not supported\n"); 310 | } 311 | for (size_t km = 0; km < m; ++km) { 312 | for (size_t k = 0; k < 16; k++) { 313 | if (correctflags[km][k] != flags[km][k]) { 314 | if (options & OPT_TEST) { 315 | printf("bug:\n"); 316 | printf("expected : "); 317 | print16(correctflags[km].data()); 318 | printf("got : "); 319 | print16(flags[km].data()); 320 | } 321 | } 322 | } 323 | } 324 | } 325 | 326 | for (uint32_t i = 0; i < iterations; i++) { 327 | std::vector > flags(m, std::vector(16)); 328 | bench.begin(); 329 | for (size_t k = 0; k < m; k++) { 330 | if (options & OPT_TOUCH) { 331 | touch(vdata[k]); 332 | } 333 | fn(vdata[k].data(), vdata[k].size(), flags[k].data()); 334 | } 335 | bench.end(); 336 | } 337 | 338 | bench.printResults(options & OPT_VERBOSE, n, m); 339 | 340 | return bench; 341 | } 342 | 343 | template 344 | void benchmarkCopy(C & vdata, BenchmarkState *overhead, uint32_t n, uint32_t m, 345 | uint32_t iterations, int options) { 346 | size_t maxsize = 0; 347 | #ifdef ALIGN 348 | for (auto &x : vdata) { 349 | if(maxsize < x.size()) maxsize = x.size(); 350 | assert(get_alignment(x.data()) == 64); 351 | } 352 | #endif 353 | for (auto &x : vdata) { 354 | if(maxsize < x.size()) maxsize = x.size(); 355 | } 356 | 357 | init_vdata(vdata); 358 | 359 | BenchmarkState bench(overhead); 360 | std::vector copybuf(maxsize); 361 | 362 | for (uint32_t i = 0; i < iterations; i++) { 363 | std::vector > flags(m, std::vector(16)); 364 | bench.begin(); 365 | for (size_t k = 0; k < m; k++) { 366 | if (options & OPT_TOUCH) { 367 | touch(vdata[k]); 368 | } 369 | ::memcpy(copybuf.data(),vdata[k].data(),vdata[k].size()); 370 | } 371 | bench.end(); 372 | } 373 | 374 | bench.printResults(options & OPT_VERBOSE, n, m); 375 | } 376 | 377 | static void print_usage(char *command) { 378 | printf(" Try %s -n 100000 -i 15 -v\n", command); 379 | printf("-c compensate overhead in measurements\n"); 380 | printf("-f force use of suboptimal benchmark parameters\n"); 381 | printf("-m number of arrays\n"); 382 | printf("-n number of 16-bit words per array\n"); 383 | printf("-i number of iterations\n"); 384 | printf("-t load arrays into cache before benchmarking\n"); 385 | printf("-v enable verbose (perf counter) output\n"); 386 | } 387 | 388 | int main(int argc, char **argv) { 389 | size_t n = 10000000; 390 | size_t m = 1; 391 | size_t iterations = 0; 392 | int options = OPT_TEST; 393 | int c; 394 | 395 | while ((c = getopt(argc, argv, "cfi:hm:n:tv")) != -1) { 396 | switch (c) { 397 | case 'c': 398 | options |= OPT_COMPENSATE; 399 | break; 400 | case 'f': 401 | options |= OPT_FORCE; 402 | break; 403 | case 't': 404 | options |= OPT_TOUCH; 405 | break; 406 | case 'n': 407 | n = atoll(optarg); 408 | break; 409 | case 'm': 410 | m = atoll(optarg); 411 | break; 412 | case 'v': 413 | options |= OPT_VERBOSE; 414 | break; 415 | case 'h': 416 | print_usage(argv[0]); 417 | return EXIT_SUCCESS; 418 | case 'i': 419 | iterations = atoi(optarg); 420 | break; 421 | default: 422 | print_usage(argv[0]); 423 | return EXIT_FAILURE; 424 | } 425 | } 426 | 427 | if (n > UINT32_MAX) { 428 | printf("setting n to %u \n", UINT32_MAX); 429 | n = UINT32_MAX; 430 | } 431 | 432 | if (iterations > UINT32_MAX) { 433 | printf("setting iterations to %u \n", UINT32_MAX); 434 | iterations = UINT32_MAX; 435 | } 436 | 437 | if (iterations == 0) { 438 | iterations = 100; 439 | } 440 | 441 | size_t min_volume = 1000000; 442 | if(~options & OPT_FORCE && m * n < min_volume) { 443 | printf("The benchmark is designed to measure the time in units of m*n inputs.\n"); 444 | printf("But your choices make m*n too small, so increasing m.\n"); 445 | while(m * n < min_volume) { 446 | m++; 447 | } 448 | } 449 | 450 | printf("n = %zu m = %zu \n", n, m); 451 | printf("iterations = %zu \n", iterations); 452 | if (n == 0) { 453 | printf("n cannot be zero.\n"); 454 | return EXIT_FAILURE; 455 | } 456 | 457 | size_t array_in_bytes = sizeof(uint16_t) * n * m; 458 | if (array_in_bytes < 1024) { 459 | printf("array size: %zu B\n", array_in_bytes); 460 | } else if (array_in_bytes < 1024 * 1024) { 461 | printf("array size: %.3f kB\n", array_in_bytes / 1024.); 462 | } else { 463 | printf("array size: %.3f MB\n", array_in_bytes / (1024 * 1024.)); 464 | } 465 | 466 | int maxtrial = 3; 467 | #ifdef ALIGN 468 | std::vector > > vdata( 469 | m, std::vector >(n)); 470 | #else 471 | std::vector > vdata(m, std::vector(n)); 472 | #endif 473 | 474 | printf("%-40s\t", "overhead"); 475 | auto ohbench = benchmarkMany(vdata, nullptr, n, m, iterations, pospopcnt_dummy, options & ~OPT_TEST); 476 | BenchmarkState *overhead = options & OPT_COMPENSATE ? &ohbench : nullptr; 477 | 478 | printf("%-40s\t", "memcpy"); 479 | benchmarkCopy(vdata, overhead, n, m, iterations, options); 480 | printf("\n"); 481 | 482 | for (int t = 0; t < maxtrial; t++) { 483 | printf("\n== Trial %d out of %d \n", t + 1, maxtrial); 484 | for (size_t k = 0; methods[k].name != NULL; k++) { 485 | printf("\n"); 486 | printf("%-40s\t", methods[k].name); 487 | fflush(NULL); 488 | benchmarkMany(vdata, overhead, n, m, iterations, methods[k].method, options); 489 | if (options & OPT_VERBOSE) 490 | printf("\n"); 491 | } 492 | } 493 | if (~options & OPT_VERBOSE) 494 | printf("Try -v to get more details.\n"); 495 | 496 | return EXIT_SUCCESS; 497 | } 498 | #else // __linux__ 499 | 500 | #include 501 | #include 502 | 503 | int main() { 504 | printf("This is a linux-specific benchmark\n"); 505 | return EXIT_SUCCESS; 506 | } 507 | 508 | #endif 509 | -------------------------------------------------------------------------------- /benchmark/linux/linux-perf-events.h: -------------------------------------------------------------------------------- 1 | // https://github.com/WojciechMula/toys/blob/master/000helpers/linux-perf-events.h 2 | #pragma once 3 | #ifdef __linux__ 4 | 5 | #include // for __NR_perf_event_open 6 | #include // for perf event constants 7 | #include // for ioctl 8 | #include // for syscall 9 | #include 10 | #include // for errno 11 | #include // for memset 12 | #include 13 | 14 | #include 15 | 16 | template class LinuxEvents { 17 | int group; 18 | bool working; 19 | perf_event_attr attribs; 20 | int num_events; 21 | std::vector temp_result_vec; 22 | std::vector fds; 23 | 24 | public: 25 | explicit LinuxEvents(std::vector config_vec) : group(-1), working(true) { 26 | memset(&attribs, 0, sizeof(attribs)); 27 | attribs.type = TYPE; 28 | attribs.size = sizeof(attribs); 29 | attribs.disabled = 1; 30 | attribs.exclude_kernel = 1; 31 | attribs.exclude_hv = 1; 32 | 33 | attribs.sample_period = 0; 34 | attribs.read_format = PERF_FORMAT_GROUP | PERF_FORMAT_ID; 35 | const int pid = 0; // the current process 36 | const int cpu = -1; // all CPUs 37 | const unsigned long flags = 0; 38 | 39 | num_events = config_vec.size(); 40 | uint32_t i = 0; 41 | for (auto config : config_vec) { 42 | attribs.config = config; 43 | int fd = syscall(__NR_perf_event_open, &attribs, pid, cpu, group, flags); 44 | if (fd == -1) { 45 | report_error("perf_event_open"); 46 | } 47 | 48 | fds.push_back(fd); 49 | if (group == -1) { 50 | group = fd; 51 | } 52 | } 53 | 54 | temp_result_vec.resize(num_events * 2 + 1); 55 | } 56 | 57 | ~LinuxEvents() { 58 | for (auto fd : fds) { 59 | close(fd); 60 | } 61 | } 62 | 63 | inline void start() { 64 | if (ioctl(group, PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) { 65 | report_error("ioctl(PERF_EVENT_IOC_RESET)"); 66 | } 67 | 68 | if (ioctl(group, PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) { 69 | report_error("ioctl(PERF_EVENT_IOC_ENABLE)"); 70 | } 71 | } 72 | 73 | inline void end(std::vector &results) { 74 | if (ioctl(group, PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP) == -1) { 75 | report_error("ioctl(PERF_EVENT_IOC_DISABLE)"); 76 | } 77 | 78 | if (read(group, &temp_result_vec[0], temp_result_vec.size() * 8) == -1) { 79 | report_error("read"); 80 | } 81 | // our actual results are in slots 1,3,5, ... of this structure 82 | // we really should be checking our ids obtained earlier to be safe 83 | for (uint32_t i = 1; i < temp_result_vec.size(); i += 2) { 84 | results[i / 2] = temp_result_vec[i]; 85 | } 86 | } 87 | 88 | private: 89 | void report_error(const std::string &context) { 90 | if (working) 91 | std::cerr << (context + ": " + std::string(strerror(errno))) << std::endl; 92 | working = false; 93 | } 94 | }; 95 | #endif 96 | -------------------------------------------------------------------------------- /benchmark/linux/memalloc.h: -------------------------------------------------------------------------------- 1 | #pragma once 2 | 3 | template T *moveToBoundary(T *inbyte) { 4 | return reinterpret_cast( 5 | (reinterpret_cast(inbyte) + (alignment - 1)) & 6 | ~(alignment - 1)); 7 | } 8 | 9 | // use this when calling STL object if you want 10 | // their memory to be aligned on cache lines 11 | template class AlignedSTLAllocator { 12 | public: 13 | // type definitions 14 | typedef T value_type; 15 | typedef T *pointer; 16 | typedef const T *const_pointer; 17 | typedef T &reference; 18 | typedef const T &const_reference; 19 | typedef std::size_t size_type; 20 | typedef std::ptrdiff_t difference_type; 21 | 22 | // rebind allocator to type U 23 | template struct rebind { 24 | typedef AlignedSTLAllocator other; 25 | }; 26 | 27 | pointer address(reference value) const { return &value; } 28 | const_pointer address(const_reference value) const { return &value; } 29 | 30 | /* constructors and destructor 31 | * - nothing to do because the allocator has no state 32 | */ 33 | AlignedSTLAllocator() {} 34 | AlignedSTLAllocator(const AlignedSTLAllocator &) {} 35 | template 36 | AlignedSTLAllocator(const AlignedSTLAllocator &) {} 37 | ~AlignedSTLAllocator() {} 38 | 39 | // return maximum number of elements that can be allocated 40 | size_type max_size() const throw() { 41 | return (std::numeric_limits::max)() / sizeof(T); 42 | } 43 | 44 | /* 45 | * allocate but don't initialize num elements of type T 46 | * 47 | * This implementation is potentially unsafe on some compilers. 48 | */ 49 | pointer allocate(size_type num, const void * = 0) { 50 | /** 51 | * The nasty trick here is to make the position of the actual pointer 52 | * within the newly allocated memory. The alternative is to use 53 | * some kind of data structure like a hash table, which might be slow. 54 | */ 55 | size_t *buffer = reinterpret_cast( 56 | ::operator new(sizeof(uintptr_t) + (num + alignment) * sizeof(T))); 57 | size_t *answer = moveToBoundary(buffer + 1); 58 | *(answer - 1) = reinterpret_cast(answer) - 59 | reinterpret_cast(buffer); 60 | return reinterpret_cast(answer); 61 | } 62 | 63 | void construct(pointer p, const T &value) { 64 | // initialize memory with placement new 65 | new (p) T(value); 66 | } 67 | 68 | // destroy elements of initialized storage p 69 | void destroy(pointer p) { p->~T(); } 70 | 71 | // deallocate storage p of deleted elements 72 | void deallocate(pointer p, size_type /*num*/) { 73 | const size_t *assize_t = reinterpret_cast(p); 74 | const size_t offset = assize_t[-1]; 75 | ::operator delete( 76 | reinterpret_cast(reinterpret_cast(p) - offset)); 77 | } 78 | }; 79 | 80 | // for our purposes, we don't want to distinguish between allocators. 81 | template 82 | bool operator==(const AlignedSTLAllocator &, const T2 &) throw() { 83 | return true; 84 | } 85 | 86 | template 87 | bool operator!=(const AlignedSTLAllocator &, const T2 &) throw() { 88 | return false; 89 | } 90 | // typical cache line 91 | typedef AlignedSTLAllocator cacheallocator; 92 | -------------------------------------------------------------------------------- /benchmark/linux/stream_benchmark.cpp: -------------------------------------------------------------------------------- 1 | #ifdef __linux__ 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | #include 11 | #include 12 | #include 13 | #include 14 | #include 15 | #include 16 | 17 | #include "linux-perf-events.h" 18 | #include "aligned_alloc.h" 19 | #include "pospopcnt_avx512bw.h" 20 | #include "pospopcnt.h" 21 | #ifdef ALIGN 22 | #include "memalloc.h" 23 | #define memory_allocate(size) aligned_alloc(64, (size)) 24 | #else 25 | #define memory_allocate(size) malloc(size) 26 | #endif 27 | 28 | // Function pointer definition. 29 | typedef void (*pospopcnt_u16_method_type)(const uint16_t *data, uint32_t len, 30 | uint32_t *flags); 31 | #define PPOPCNT_NUMBER_METHODS 4 32 | pospopcnt_u16_method_type pospopcnt_u16_methods[] = { 33 | pospopcnt_u16_scalar, pospopcnt_u16_avx512bw_harvey_seal_1KB, pospopcnt_u16_avx512bw_harvey_seal_512B, pospopcnt_u16_avx512bw_harvey_seal_256B 34 | }; 35 | 36 | static const char *const pospopcnt_u16_method_names[] = { 37 | "pospopcnt_u16_scalar", "pospopcnt_u16_avx512bw_harvey_seal_1KB", "pospopcnt_u16_avx512bw_harvey_seal_512B", "pospopcnt_u16_avx512bw_harvey_seal_256B" 38 | }; 39 | 40 | 41 | template 42 | double benchmark(C & vdata, uint32_t n, pospopcnt_u16_method_type fn) { 43 | std::vector timings; 44 | std::vector correctflags(16); 45 | pospopcnt_u16_scalar(vdata.data(), n, correctflags.data()); // this is our gold standard 46 | for(size_t i = 0; i < 100; i++) { 47 | std::vector flags(16); 48 | auto start = std::chrono::steady_clock::now(); 49 | fn(vdata.data(), n, flags.data()); 50 | auto end = std::chrono::steady_clock::now(); 51 | if(correctflags != flags) { throw std::runtime_error("bug\n"); } 52 | std::chrono::duration secs = end - start; 53 | double time_in_s = secs.count(); 54 | timings.push_back(time_in_s); 55 | } 56 | double min_timing = *min_element(timings.begin(), timings.end()); 57 | double speedinGBs = (n * sizeof(uint16_t)) / (min_timing * 1000000000.0); 58 | return speedinGBs; 59 | } 60 | 61 | 62 | int main(int argc, char **argv) { 63 | size_t max_val = 536870912; 64 | std::vector vdata(max_val); 65 | std::random_device rd; 66 | std::mt19937 gen(rd()); 67 | std::uniform_int_distribution<> dis(0, 0xFFFF); 68 | for (size_t k2 = 0; k2 < vdata.size(); k2++) { 69 | vdata[k2] = dis(gen); // random init. 70 | } 71 | printf("#"); 72 | for (size_t k = 0; k < PPOPCNT_NUMBER_METHODS; k++) { 73 | printf("\t"); 74 | printf("%-40s\t", pospopcnt_u16_method_names[k]); 75 | fflush(NULL); 76 | } 77 | printf("\n"); 78 | 79 | for (int n = 1024; n <= max_val; n*=2) { 80 | printf("%d ", n); 81 | for (size_t k = 0; k < PPOPCNT_NUMBER_METHODS; k++) { 82 | printf("\t"); 83 | fflush(NULL); 84 | double speed = benchmark(vdata,n, pospopcnt_u16_methods[k]); 85 | printf("%f\t",speed); 86 | } 87 | printf("\n"); 88 | } 89 | return EXIT_SUCCESS; 90 | } 91 | #else // __linux__ 92 | 93 | #include 94 | #include 95 | 96 | int main() { 97 | printf("This is a linux-specific benchmark\n"); 98 | return EXIT_SUCCESS; 99 | } 100 | 101 | #endif 102 | -------------------------------------------------------------------------------- /goscript.sh: -------------------------------------------------------------------------------- 1 | max=30 2 | n=10 3 | #output=golike 4 | 5 | #for i in `seq 0 $max` 6 | #do 7 | # for j in `seq $n` 8 | # do 9 | # ./golike_benchmark $((2**i)) 10 | # done 11 | #done >${output}_ij.out 12 | 13 | for j in `seq $n` 14 | do 15 | ./golike_benchmark 1 16 | for i in `seq 1 $max` 17 | do 18 | ./golike_benchmark $((2**i)) 19 | ./golike_benchmark $((2**(i-1)*3)) 20 | done 21 | done 22 | -------------------------------------------------------------------------------- /include/pospopcnt.h: -------------------------------------------------------------------------------- 1 | #ifndef POSPOPCNT_H 2 | #define POSPOPCNT_H 3 | 4 | #include 5 | 6 | #if defined(__GNUC__) && !defined(__clang__) 7 | __attribute__((optimize("no-tree-vectorize"))) 8 | #endif 9 | // given a stream of len 16-bit words (in data), generates 10 | // an histogram of 16 counts stored in flags, corresponding 11 | // to the number of bit sets at the corresponding indexes (0,1,...,15). 12 | static void pospopcnt_u16_scalar(const uint16_t *data, uint32_t len, 13 | flags_type *flags) { 14 | #if defined(__clang__) 15 | #pragma clang loop vectorize(disable) 16 | #endif 17 | for (int i = 0; i < len; ++i) { 18 | uint64_t w = data[i]; 19 | flags[0] += ((w >> 0) & 1); 20 | flags[1] += ((w >> 1) & 1); 21 | flags[2] += ((w >> 2) & 1); 22 | flags[3] += ((w >> 3) & 1); 23 | flags[4] += ((w >> 4) & 1); 24 | flags[5] += ((w >> 5) & 1); 25 | flags[6] += ((w >> 6) & 1); 26 | flags[7] += ((w >> 7) & 1); 27 | flags[8] += ((w >> 8) & 1); 28 | flags[9] += ((w >> 9) & 1); 29 | flags[10] += ((w >> 10) & 1); 30 | flags[11] += ((w >> 11) & 1); 31 | flags[12] += ((w >> 12) & 1); 32 | flags[13] += ((w >> 13) & 1); 33 | flags[14] += ((w >> 14) & 1); 34 | flags[15] += ((w >> 15) & 1); 35 | } 36 | } 37 | 38 | #endif // POSPOPCNT_H 39 | -------------------------------------------------------------------------------- /include/pospopcnt_avx512bw.h: -------------------------------------------------------------------------------- 1 | #ifndef POSPOPCNT_AVX512BW_H 2 | #define POSPOPCNT_AVX512BW_H 3 | 4 | #include 5 | #include 6 | 7 | #if defined(__AVX512BW__) && __AVX512BW__ == 1 8 | 9 | // utility function for use in pospopcnt_u16_avx512bw_harvey_seal 10 | static inline void pospopcnt_csa_avx512(__m512i *__restrict__ h, 11 | __m512i *__restrict__ l, __m512i b, 12 | __m512i c) { 13 | *h = _mm512_ternarylogic_epi32(c, b, *l, 0xE8); // 11101000 14 | *l = _mm512_ternarylogic_epi32(c, b, *l, 0x96); // 10010110 15 | } 16 | 17 | // given a stream of len 16-bit words (in array), generates 18 | // an histogram of 16 counts stored in flags, corresponding 19 | // to the number of bit sets at the corresponding indexes (0,1,...,15). 20 | // 21 | // Uses 1KB blocks 22 | static void pospopcnt_u16_avx512bw_harvey_seal_1KB(const uint16_t *array, 23 | uint32_t len, flags_type *flags) { 24 | for (uint32_t i = len - (len % (32 * 16)); i < len; ++i) { 25 | for (int j = 0; j < 16; ++j) { 26 | flags[j] += (((array[i]) >> j) & 1); 27 | } 28 | } 29 | 30 | const __m512i *data = (const __m512i *)array; 31 | __m512i v1 = _mm512_setzero_si512(); 32 | __m512i v2 = _mm512_setzero_si512(); 33 | __m512i v4 = _mm512_setzero_si512(); 34 | __m512i v8 = _mm512_setzero_si512(); 35 | __m512i v16 = _mm512_setzero_si512(); 36 | __m512i twosA, twosB, foursA, foursB, eightsA, eightsB; 37 | __m512i one = _mm512_set1_epi16(1); 38 | __m512i counter[16]; 39 | 40 | const size_t size = len / 32; 41 | const uint64_t limit = size - size % 16; 42 | 43 | uint16_t buffer[32]; 44 | 45 | uint64_t i = 0; 46 | while (i < limit) { 47 | for (size_t i = 0; i < 16; ++i) 48 | counter[i] = _mm512_setzero_si512(); 49 | 50 | size_t thislimit = limit; 51 | if (thislimit - i >= (1 << 16)) 52 | thislimit = i + (1 << 16) - 1; 53 | 54 | for (/**/; i < thislimit; i += 16) { 55 | #define U(pos) \ 56 | { \ 57 | counter[pos] = _mm512_add_epi16( \ 58 | counter[pos], _mm512_and_si512(v16, _mm512_set1_epi16(1))); \ 59 | v16 = _mm512_srli_epi16(v16, 1); \ 60 | } 61 | pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 0), 62 | _mm512_loadu_si512(data + i + 1)); 63 | pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 2), 64 | _mm512_loadu_si512(data + i + 3)); 65 | pospopcnt_csa_avx512(&foursA, &v2, twosA, twosB); 66 | pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 4), 67 | _mm512_loadu_si512(data + i + 5)); 68 | pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 6), 69 | _mm512_loadu_si512(data + i + 7)); 70 | pospopcnt_csa_avx512(&foursB, &v2, twosA, twosB); 71 | pospopcnt_csa_avx512(&eightsA, &v4, foursA, foursB); 72 | pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 8), 73 | _mm512_loadu_si512(data + i + 9)); 74 | pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 10), 75 | _mm512_loadu_si512(data + i + 11)); 76 | pospopcnt_csa_avx512(&foursA, &v2, twosA, twosB); 77 | pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 12), 78 | _mm512_loadu_si512(data + i + 13)); 79 | pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 14), 80 | _mm512_loadu_si512(data + i + 15)); 81 | pospopcnt_csa_avx512(&foursB, &v2, twosA, twosB); 82 | pospopcnt_csa_avx512(&eightsB, &v4, foursA, foursB); 83 | U(0) U(1) U(2) U(3) U(4) U(5) U(6) U(7) U(8) U(9) U(10) U(11) U(12) U(13) 84 | U(14) U(15) // Updates 85 | pospopcnt_csa_avx512(&v16, &v8, eightsA, eightsB); 86 | } 87 | // Update the counters after the last iteration. 88 | for (size_t i = 0; i < 16; ++i) 89 | U(i) 90 | #undef U 91 | 92 | for (size_t i = 0; i < 16; ++i) { 93 | _mm512_storeu_si512((__m512i *)buffer, counter[i]); 94 | for (size_t z = 0; z < 32; z++) { 95 | flags[i] += 16 * (flags_type)buffer[z]; 96 | } 97 | } 98 | } 99 | 100 | _mm512_storeu_si512((__m512i *)buffer, v1); 101 | for (size_t i = 0; i < 32; i++) { 102 | for (int j = 0; j < 16; j++) { 103 | flags[j] += 1 * ((buffer[i] >> j) & 1); 104 | } 105 | } 106 | 107 | _mm512_storeu_si512((__m512i *)buffer, v2); 108 | for (size_t i = 0; i < 32; i++) { 109 | for (int j = 0; j < 16; j++) { 110 | flags[j] += 2 * ((buffer[i] >> j) & 1); 111 | } 112 | } 113 | 114 | _mm512_storeu_si512((__m512i *)buffer, v4); 115 | for (size_t i = 0; i < 32; i++) { 116 | for (int j = 0; j < 16; j++) { 117 | flags[j] += 4 * ((buffer[i] >> j) & 1); 118 | } 119 | } 120 | 121 | _mm512_storeu_si512((__m512i *)buffer, v8); 122 | for (size_t i = 0; i < 32; i++) { 123 | for (int j = 0; j < 16; j++) { 124 | flags[j] += 8 * ((buffer[i] >> j) & 1); 125 | } 126 | } 127 | } 128 | 129 | // given a stream of len 16-bit words (in array), generates 130 | // an histogram of 16 counts stored in flags, corresponding 131 | // to the number of bit sets at the corresponding indexes (0,1,...,15). 132 | // 133 | // Uses 512B blocks 134 | static void pospopcnt_u16_avx512bw_harvey_seal_512B(const uint16_t *array, 135 | uint32_t len, 136 | flags_type *flags) { 137 | for (uint32_t i = len - (len % (32 * 8)); i < len; ++i) { 138 | for (int j = 0; j < 16; ++j) { 139 | flags[j] += (((array[i]) >> j) & 1); 140 | } 141 | } 142 | 143 | const __m512i *data = (const __m512i *)array; 144 | __m512i v1 = _mm512_setzero_si512(); 145 | __m512i v2 = _mm512_setzero_si512(); 146 | __m512i v4 = _mm512_setzero_si512(); 147 | __m512i v8 = _mm512_setzero_si512(); 148 | __m512i twosA, twosB, foursA, foursB; 149 | __m512i one = _mm512_set1_epi16(1); 150 | __m512i counter[16]; 151 | 152 | const size_t size = len / 32; 153 | const uint64_t limit = size - size % 8; 154 | 155 | uint16_t buffer[32]; 156 | 157 | uint64_t i = 0; 158 | while (i < limit) { 159 | for (size_t i = 0; i < 16; ++i) 160 | counter[i] = _mm512_setzero_si512(); 161 | 162 | size_t thislimit = limit; 163 | if (thislimit - i >= (1 << 16)) 164 | thislimit = i + (1 << 16) - 1; 165 | 166 | for (/**/; i < thislimit; i += 8) { 167 | #define U(pos) \ 168 | { \ 169 | counter[pos] = _mm512_add_epi16( \ 170 | counter[pos], _mm512_and_si512(v8, _mm512_set1_epi16(1))); \ 171 | v8 = _mm512_srli_epi16(v8, 1); \ 172 | } 173 | pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 0), 174 | _mm512_loadu_si512(data + i + 1)); 175 | pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 2), 176 | _mm512_loadu_si512(data + i + 3)); 177 | pospopcnt_csa_avx512(&foursA, &v2, twosA, twosB); 178 | pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 4), 179 | _mm512_loadu_si512(data + i + 5)); 180 | pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 6), 181 | _mm512_loadu_si512(data + i + 7)); 182 | pospopcnt_csa_avx512(&foursB, &v2, twosA, twosB); 183 | U(0) U(1) U(2) U(3) U(4) U(5) U(6) U(7) U(8) U(9) U(10) U(11) U(12) U(13) 184 | U(14) U(15) // Updates 185 | pospopcnt_csa_avx512(&v8, &v4, foursA, foursB); 186 | } 187 | // Update the counters after the last iteration. 188 | for (size_t i = 0; i < 16; ++i) 189 | U(i) 190 | #undef U 191 | 192 | for (size_t i = 0; i < 16; ++i) { 193 | _mm512_storeu_si512((__m512i *)buffer, counter[i]); 194 | for (size_t z = 0; z < 32; z++) { 195 | flags[i] += 8 * (flags_type)buffer[z]; 196 | } 197 | } 198 | } 199 | 200 | _mm512_storeu_si512((__m512i *)buffer, v1); 201 | for (size_t i = 0; i < 32; i++) { 202 | for (int j = 0; j < 16; j++) { 203 | flags[j] += 1 * ((buffer[i] >> j) & 1); 204 | } 205 | } 206 | 207 | _mm512_storeu_si512((__m512i *)buffer, v2); 208 | for (size_t i = 0; i < 32; i++) { 209 | for (int j = 0; j < 16; j++) { 210 | flags[j] += 2 * ((buffer[i] >> j) & 1); 211 | } 212 | } 213 | 214 | _mm512_storeu_si512((__m512i *)buffer, v4); 215 | for (size_t i = 0; i < 32; i++) { 216 | for (int j = 0; j < 16; j++) { 217 | flags[j] += 4 * ((buffer[i] >> j) & 1); 218 | } 219 | } 220 | 221 | _mm512_storeu_si512((__m512i *)buffer, v8); 222 | for (size_t i = 0; i < 32; i++) { 223 | for (int j = 0; j < 16; j++) { 224 | flags[j] += 8 * ((buffer[i] >> j) & 1); 225 | } 226 | } 227 | } 228 | 229 | 230 | 231 | // given a stream of len 16-bit words (in array), generates 232 | // an histogram of 16 counts stored in flags, corresponding 233 | // to the number of bit sets at the corresponding indexes (0,1,...,15). 234 | // 235 | // Uses 256B blocks 236 | static void pospopcnt_u16_avx512bw_harvey_seal_256B(const uint16_t *array, 237 | uint32_t len, 238 | flags_type *flags) { 239 | for (uint32_t i = len - (len % (32 * 4)); i < len; ++i) { 240 | for (int j = 0; j < 16; ++j) { 241 | flags[j] += (((array[i]) >> j) & 1); 242 | } 243 | } 244 | 245 | const __m512i *data = (const __m512i *)array; 246 | __m512i v1 = _mm512_setzero_si512(); 247 | __m512i v2 = _mm512_setzero_si512(); 248 | __m512i v4 = _mm512_setzero_si512(); 249 | __m512i twosA, twosB; 250 | __m512i one = _mm512_set1_epi16(1); 251 | __m512i counter[16]; 252 | 253 | const size_t size = len / 32; 254 | const uint64_t limit = size - size % 4; 255 | 256 | uint16_t buffer[32]; 257 | 258 | uint64_t i = 0; 259 | while (i < limit) { 260 | for (size_t i = 0; i < 16; ++i) 261 | counter[i] = _mm512_setzero_si512(); 262 | 263 | size_t thislimit = limit; 264 | if (thislimit - i >= (1 << 16)) 265 | thislimit = i + (1 << 16) - 1; 266 | 267 | for (/**/; i < thislimit; i += 4) { 268 | #define U(pos) \ 269 | { \ 270 | counter[pos] = _mm512_add_epi16( \ 271 | counter[pos], _mm512_and_si512(v4, _mm512_set1_epi16(1))); \ 272 | v4 = _mm512_srli_epi16(v4, 1); \ 273 | } 274 | pospopcnt_csa_avx512(&twosA, &v1, _mm512_loadu_si512(data + i + 0), 275 | _mm512_loadu_si512(data + i + 1)); 276 | pospopcnt_csa_avx512(&twosB, &v1, _mm512_loadu_si512(data + i + 2), 277 | _mm512_loadu_si512(data + i + 3)); 278 | U(0) U(1) U(2) U(3) U(4) U(5) U(6) U(7) U(8) U(9) U(10) U(11) U(12) U(13) 279 | U(14) U(15) // Updates 280 | pospopcnt_csa_avx512(&v4, &v2, twosA, twosB); 281 | } 282 | // Update the counters after the last iteration. 283 | for (size_t i = 0; i < 16; ++i) 284 | U(i) 285 | #undef U 286 | 287 | for (size_t i = 0; i < 16; ++i) { 288 | _mm512_storeu_si512((__m512i *)buffer, counter[i]); 289 | for (size_t z = 0; z < 32; z++) { 290 | flags[i] += 4 * (flags_type)buffer[z]; 291 | } 292 | } 293 | } 294 | 295 | _mm512_storeu_si512((__m512i *)buffer, v1); 296 | for (size_t i = 0; i < 32; i++) { 297 | for (int j = 0; j < 16; j++) { 298 | flags[j] += 1 * ((buffer[i] >> j) & 1); 299 | } 300 | } 301 | 302 | _mm512_storeu_si512((__m512i *)buffer, v2); 303 | for (size_t i = 0; i < 32; i++) { 304 | for (int j = 0; j < 16; j++) { 305 | flags[j] += 2 * ((buffer[i] >> j) & 1); 306 | } 307 | } 308 | 309 | _mm512_storeu_si512((__m512i *)buffer, v4); 310 | for (size_t i = 0; i < 32; i++) { 311 | for (int j = 0; j < 16; j++) { 312 | flags[j] += 4 * ((buffer[i] >> j) & 1); 313 | } 314 | } 315 | } 316 | 317 | 318 | 319 | #endif // __AVX512BW__ 320 | 321 | #endif // POSPOPCNT_AVX512BW_H 322 | -------------------------------------------------------------------------------- /results.txt: -------------------------------------------------------------------------------- 1 | $ ./script.sh 2 | n = 8388608 m = 1 3 | iterations = 100 4 | array size: 16.000 MB 5 | nothing instructions per cycle 0.20, cycles per 16-bit word: 0.000, instructions per 16-bit word 0.000 6 | min: 65 cycles, 13 instructions, 1 branch mis., 0 cache ref., 0 cache mis. 7 | avg: 75.6 cycles, 13.0 instructions, 1.0 branch mis., 0.1 cache ref., 0.1 cache mis. 8 | 9 | == Trial 1 out of 3 10 | 11 | pospopcnt_u16_scalar alignments: 16 12 | instructions per cycle 3.75, cycles per 16-bit word: 17.321, instructions per 16-bit word 65.000 13 | min: 145300780 cycles, 545259670 instructions, 3 branch mis., 342794 cache ref., 227818 cache mis. 14 | avg: 145437051.2 cycles, 545259670.8 instructions, 8.1 branch mis., 343725.7 cache ref., 240702.4 cache mis. 15 | 0.367 GB/s 16 | estimated clock in range 3.114 GHz to 3.183 GHz 17 | 18 | 19 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 20 | instructions per cycle 0.44, cycles per 16-bit word: 0.510, instructions per 16-bit word 0.226 21 | min: 4275703 cycles, 1896353 instructions, 89 branch mis., 456562 cache ref., 210984 cache mis. 22 | avg: 4435881.5 cycles, 1896353.4 instructions, 114.7 branch mis., 462022.3 cache ref., 224756.1 cache mis. 23 | 12.242 GB/s 24 | estimated clock in range 2.995 GHz to 3.130 GHz 25 | 26 | 27 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 28 | instructions per cycle 0.59, cycles per 16-bit word: 0.553, instructions per 16-bit word 0.323 29 | min: 4636354 cycles, 2713472 instructions, 73 branch mis., 405240 cache ref., 212428 cache mis. 30 | avg: 4740082.5 cycles, 2713472.5 instructions, 99.2 branch mis., 410034.9 cache ref., 221150.8 cache mis. 31 | 11.334 GB/s 32 | estimated clock in range 2.960 GHz to 3.138 GHz 33 | 34 | 35 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 36 | instructions per cycle 0.80, cycles per 16-bit word: 0.650, instructions per 16-bit word 0.518 37 | min: 5455458 cycles, 4348093 instructions, 93 branch mis., 369665 cache ref., 214089 cache mis. 38 | avg: 5685385.1 cycles, 4348093.8 instructions, 100.7 branch mis., 372731.5 cache ref., 227855.8 cache mis. 39 | 9.658 GB/s 40 | estimated clock in range 3.016 GHz to 3.145 GHz 41 | 42 | 43 | == Trial 2 out of 3 44 | 45 | pospopcnt_u16_scalar alignments: 16 46 | instructions per cycle 3.75, cycles per 16-bit word: 17.319, instructions per 16-bit word 65.000 47 | min: 145278186 cycles, 545259670 instructions, 3 branch mis., 342538 cache ref., 228564 cache mis. 48 | avg: 145437966.3 cycles, 545259671.1 instructions, 9.4 branch mis., 343842.9 cache ref., 240921.6 cache mis. 49 | 0.368 GB/s 50 | estimated clock in range 3.103 GHz to 3.186 GHz 51 | 52 | 53 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 54 | instructions per cycle 0.44, cycles per 16-bit word: 0.510, instructions per 16-bit word 0.226 55 | min: 4280028 cycles, 1896353 instructions, 93 branch mis., 456611 cache ref., 212101 cache mis. 56 | avg: 4427292.3 cycles, 1896353.4 instructions, 118.8 branch mis., 461757.7 cache ref., 223975.2 cache mis. 57 | 12.248 GB/s 58 | estimated clock in range 2.954 GHz to 3.125 GHz 59 | 60 | 61 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 62 | instructions per cycle 0.58, cycles per 16-bit word: 0.554, instructions per 16-bit word 0.323 63 | min: 4646521 cycles, 2713472 instructions, 69 branch mis., 406160 cache ref., 213902 cache mis. 64 | avg: 4782781.6 cycles, 2713472.5 instructions, 99.2 branch mis., 412982.4 cache ref., 226466.0 cache mis. 65 | 11.284 GB/s 66 | estimated clock in range 2.757 GHz to 3.134 GHz 67 | 68 | 69 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 70 | instructions per cycle 0.79, cycles per 16-bit word: 0.653, instructions per 16-bit word 0.518 71 | min: 5476340 cycles, 4348093 instructions, 89 branch mis., 369385 cache ref., 215783 cache mis. 72 | avg: 5675215.1 cycles, 4348093.8 instructions, 99.7 branch mis., 372441.2 cache ref., 227343.4 cache mis. 73 | 9.616 GB/s 74 | estimated clock in range 2.982 GHz to 3.141 GHz 75 | 76 | 77 | == Trial 3 out of 3 78 | 79 | pospopcnt_u16_scalar alignments: 16 80 | instructions per cycle 3.75, cycles per 16-bit word: 17.323, instructions per 16-bit word 65.000 81 | min: 145315376 cycles, 545259670 instructions, 3 branch mis., 342409 cache ref., 225642 cache mis. 82 | avg: 145436719.9 cycles, 545259670.8 instructions, 9.5 branch mis., 343947.5 cache ref., 238966.5 cache mis. 83 | 0.367 GB/s 84 | estimated clock in range 3.106 GHz to 3.183 GHz 85 | 86 | 87 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 88 | instructions per cycle 0.44, cycles per 16-bit word: 0.508, instructions per 16-bit word 0.226 89 | min: 4264660 cycles, 1896353 instructions, 89 branch mis., 456718 cache ref., 211363 cache mis. 90 | avg: 4409605.7 cycles, 1896353.5 instructions, 110.8 branch mis., 461534.5 cache ref., 222450.5 cache mis. 91 | 12.274 GB/s 92 | estimated clock in range 2.983 GHz to 3.126 GHz 93 | 94 | 95 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 96 | instructions per cycle 0.58, cycles per 16-bit word: 0.555, instructions per 16-bit word 0.323 97 | min: 4654504 cycles, 2713472 instructions, 46 branch mis., 405671 cache ref., 213485 cache mis. 98 | avg: 4764287.9 cycles, 2713472.5 instructions, 96.6 branch mis., 411845.7 cache ref., 224323.1 cache mis. 99 | 11.267 GB/s 100 | estimated clock in range 2.857 GHz to 3.134 GHz 101 | 102 | 103 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 104 | instructions per cycle 0.80, cycles per 16-bit word: 0.649, instructions per 16-bit word 0.518 105 | min: 5441872 cycles, 4348093 instructions, 88 branch mis., 369764 cache ref., 217127 cache mis. 106 | avg: 5697221.4 cycles, 4348093.8 instructions, 100.9 branch mis., 372666.7 cache ref., 228013.5 cache mis. 107 | 9.679 GB/s 108 | estimated clock in range 2.980 GHz to 3.139 GHz 109 | 110 | n = 16777216 m = 1 111 | iterations = 100 112 | array size: 32.000 MB 113 | nothing instructions per cycle 0.20, cycles per 16-bit word: 0.000, instructions per 16-bit word 0.000 114 | min: 64 cycles, 13 instructions, 1 branch mis., 0 cache ref., 0 cache mis. 115 | avg: 71.7 cycles, 13.0 instructions, 1.0 branch mis., 0.3 cache ref., 0.0 cache mis. 116 | 117 | == Trial 1 out of 3 118 | 119 | pospopcnt_u16_scalar alignments: 16 120 | instructions per cycle 3.75, cycles per 16-bit word: 17.327, instructions per 16-bit word 65.000 121 | min: 290698808 cycles, 1090519236 instructions, 5 branch mis., 686676 cache ref., 500793 cache mis. 122 | avg: 290938868.1 cycles, 1090519236.9 instructions, 15.7 branch mis., 688354.2 cache ref., 510683.9 cache mis. 123 | 0.367 GB/s 124 | estimated clock in range 3.129 GHz to 3.183 GHz 125 | 126 | 127 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 128 | instructions per cycle 0.50, cycles per 16-bit word: 0.451, instructions per 16-bit word 0.225 129 | min: 7570267 cycles, 3778030 instructions, 172 branch mis., 911913 cache ref., 466824 cache mis. 130 | avg: 8082887.5 cycles, 3778030.8 instructions, 208.4 branch mis., 918840.2 cache ref., 478513.7 cache mis. 131 | 13.966 GB/s 132 | estimated clock in range 2.991 GHz to 3.156 GHz 133 | 134 | 135 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 136 | instructions per cycle 0.67, cycles per 16-bit word: 0.480, instructions per 16-bit word 0.323 137 | min: 8059538 cycles, 5412301 instructions, 117 branch mis., 832885 cache ref., 472215 cache mis. 138 | avg: 8571650.0 cycles, 5412301.8 instructions, 194.3 branch mis., 844191.3 cache ref., 483240.1 cache mis. 139 | 13.091 GB/s 140 | estimated clock in range 3.005 GHz to 3.155 GHz 141 | 142 | 143 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 144 | instructions per cycle 0.92, cycles per 16-bit word: 0.565, instructions per 16-bit word 0.518 145 | min: 9471909 cycles, 8685323 instructions, 179 branch mis., 750308 cache ref., 476978 cache mis. 146 | avg: 9987220.5 cycles, 8685323.0 instructions, 193.4 branch mis., 753474.4 cache ref., 491033.0 cache mis. 147 | 11.186 GB/s 148 | estimated clock in range 2.987 GHz to 3.161 GHz 149 | 150 | 151 | == Trial 2 out of 3 152 | 153 | pospopcnt_u16_scalar alignments: 16 154 | instructions per cycle 3.75, cycles per 16-bit word: 17.329, instructions per 16-bit word 65.000 155 | min: 290733066 cycles, 1090519236 instructions, 4 branch mis., 686103 cache ref., 499278 cache mis. 156 | avg: 290946935.9 cycles, 1090519236.8 instructions, 13.0 branch mis., 687911.4 cache ref., 509925.4 cache mis. 157 | 0.368 GB/s 158 | estimated clock in range 3.127 GHz to 3.185 GHz 159 | 160 | 161 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 162 | instructions per cycle 0.50, cycles per 16-bit word: 0.453, instructions per 16-bit word 0.225 163 | min: 7600302 cycles, 3778030 instructions, 180 branch mis., 912316 cache ref., 465633 cache mis. 164 | avg: 8088476.5 cycles, 3778030.7 instructions, 202.9 branch mis., 918700.2 cache ref., 477476.4 cache mis. 165 | 13.892 GB/s 166 | estimated clock in range 3.018 GHz to 3.155 GHz 167 | 168 | 169 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 170 | instructions per cycle 0.72, cycles per 16-bit word: 0.449, instructions per 16-bit word 0.323 171 | min: 7531365 cycles, 5412301 instructions, 155 branch mis., 827062 cache ref., 467937 cache mis. 172 | avg: 8587747.1 cycles, 5412301.8 instructions, 194.7 branch mis., 842218.7 cache ref., 480786.8 cache mis. 173 | 13.457 GB/s 174 | estimated clock in range 3.021 GHz to 3.157 GHz 175 | 176 | 177 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 178 | instructions per cycle 0.92, cycles per 16-bit word: 0.561, instructions per 16-bit word 0.518 179 | min: 9403750 cycles, 8685323 instructions, 177 branch mis., 749685 cache ref., 476669 cache mis. 180 | avg: 10080203.3 cycles, 8685323.0 instructions, 190.7 branch mis., 752866.7 cache ref., 489090.3 cache mis. 181 | 11.265 GB/s 182 | estimated clock in range 3.103 GHz to 3.160 GHz 183 | 184 | 185 | == Trial 3 out of 3 186 | 187 | pospopcnt_u16_scalar alignments: 16 188 | instructions per cycle 3.75, cycles per 16-bit word: 17.329, instructions per 16-bit word 65.000 189 | min: 290740327 cycles, 1090519236 instructions, 3 branch mis., 684978 cache ref., 496434 cache mis. 190 | avg: 290947846.2 cycles, 1090519236.7 instructions, 12.6 branch mis., 687683.9 cache ref., 509124.1 cache mis. 191 | 0.367 GB/s 192 | estimated clock in range 3.142 GHz to 3.184 GHz 193 | 194 | 195 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 196 | instructions per cycle 0.50, cycles per 16-bit word: 0.452, instructions per 16-bit word 0.225 197 | min: 7577793 cycles, 3778030 instructions, 175 branch mis., 911261 cache ref., 466200 cache mis. 198 | avg: 8120387.0 cycles, 3778030.7 instructions, 215.3 branch mis., 919429.1 cache ref., 480036.4 cache mis. 199 | 13.928 GB/s 200 | estimated clock in range 3.003 GHz to 3.153 GHz 201 | 202 | 203 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 204 | instructions per cycle 0.67, cycles per 16-bit word: 0.481, instructions per 16-bit word 0.323 205 | min: 8068715 cycles, 5412301 instructions, 160 branch mis., 835367 cache ref., 469650 cache mis. 206 | avg: 8566613.3 cycles, 5412301.9 instructions, 197.6 branch mis., 843327.4 cache ref., 482458.9 cache mis. 207 | 13.104 GB/s 208 | estimated clock in range 3.062 GHz to 3.156 GHz 209 | 210 | 211 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 212 | instructions per cycle 0.92, cycles per 16-bit word: 0.565, instructions per 16-bit word 0.518 213 | min: 9487169 cycles, 8685323 instructions, 175 branch mis., 750225 cache ref., 477681 cache mis. 214 | avg: 10032864.4 cycles, 8685323.1 instructions, 192.5 branch mis., 753271.8 cache ref., 490210.8 cache mis. 215 | 11.161 GB/s 216 | estimated clock in range 2.993 GHz to 3.159 GHz 217 | 218 | n = 33554432 m = 1 219 | iterations = 100 220 | array size: 64.000 MB 221 | nothing instructions per cycle 0.20, cycles per 16-bit word: 0.000, instructions per 16-bit word 0.000 222 | min: 65 cycles, 13 instructions, 1 branch mis., 0 cache ref., 0 cache mis. 223 | avg: 70.7 cycles, 13.0 instructions, 1.0 branch mis., 0.4 cache ref., 0.0 cache mis. 224 | 225 | == Trial 1 out of 3 226 | 227 | pospopcnt_u16_scalar alignments: 16 228 | instructions per cycle 3.75, cycles per 16-bit word: 17.329, instructions per 16-bit word 65.000 229 | min: 581477149 cycles, 2181038367 instructions, 3 branch mis., 1371669 cache ref., 1033291 cache mis. 230 | avg: 581933574.2 cycles, 2181038369.1 instructions, 19.5 branch mis., 1375421.4 cache ref., 1042079.4 cache mis. 231 | 0.367 GB/s 232 | estimated clock in range 3.142 GHz to 3.181 GHz 233 | 234 | 235 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 236 | instructions per cycle 0.55, cycles per 16-bit word: 0.407, instructions per 16-bit word 0.225 237 | min: 13653935 cycles, 7541384 instructions, 134 branch mis., 1816669 cache ref., 969367 cache mis. 238 | avg: 14379394.1 cycles, 7541384.5 instructions, 348.5 branch mis., 1825994.9 cache ref., 980377.7 cache mis. 239 | 15.407 GB/s 240 | estimated clock in range 3.037 GHz to 3.167 GHz 241 | 242 | 243 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 244 | instructions per cycle 0.73, cycles per 16-bit word: 0.439, instructions per 16-bit word 0.322 245 | min: 14742562 cycles, 10809959 instructions, 138 branch mis., 1695219 cache ref., 976523 cache mis. 246 | avg: 15366151.7 cycles, 10809960.0 instructions, 358.6 branch mis., 1702361.3 cache ref., 988884.0 cache mis. 247 | 14.399 GB/s 248 | estimated clock in range 3.069 GHz to 3.167 GHz 249 | 250 | 251 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 252 | instructions per cycle 1.00, cycles per 16-bit word: 0.517, instructions per 16-bit word 0.517 253 | min: 17349696 cycles, 17359781 instructions, 340 branch mis., 1507483 cache ref., 994974 cache mis. 254 | avg: 17986706.1 cycles, 17359781.6 instructions, 366.0 branch mis., 1512351.4 cache ref., 1007333.2 cache mis. 255 | 12.251 GB/s 256 | estimated clock in range 3.099 GHz to 3.171 GHz 257 | 258 | 259 | == Trial 2 out of 3 260 | 261 | pospopcnt_u16_scalar alignments: 16 262 | instructions per cycle 3.75, cycles per 16-bit word: 17.331, instructions per 16-bit word 65.000 263 | min: 581521999 cycles, 2181038367 instructions, 3 branch mis., 1372754 cache ref., 1031789 cache mis. 264 | avg: 581909568.5 cycles, 2181038370.3 instructions, 5.0 branch mis., 1376427.2 cache ref., 1040231.2 cache mis. 265 | 0.367 GB/s 266 | estimated clock in range 3.154 GHz to 3.182 GHz 267 | 268 | 269 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 270 | instructions per cycle 0.55, cycles per 16-bit word: 0.407, instructions per 16-bit word 0.225 271 | min: 13656404 cycles, 7541384 instructions, 339 branch mis., 1811137 cache ref., 973011 cache mis. 272 | avg: 14385482.8 cycles, 7541384.5 instructions, 382.7 branch mis., 1825741.2 cache ref., 982132.8 cache mis. 273 | 15.450 GB/s 274 | estimated clock in range 2.982 GHz to 3.167 GHz 275 | 276 | 277 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 278 | instructions per cycle 0.73, cycles per 16-bit word: 0.439, instructions per 16-bit word 0.322 279 | min: 14740976 cycles, 10809959 instructions, 135 branch mis., 1690517 cache ref., 982609 cache mis. 280 | avg: 15344071.8 cycles, 10809959.9 instructions, 354.4 branch mis., 1703487.1 cache ref., 990562.1 cache mis. 281 | 14.362 GB/s 282 | estimated clock in range 3.025 GHz to 3.167 GHz 283 | 284 | 285 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 286 | instructions per cycle 1.00, cycles per 16-bit word: 0.518, instructions per 16-bit word 0.517 287 | min: 17396176 cycles, 17359781 instructions, 341 branch mis., 1508967 cache ref., 997769 cache mis. 288 | avg: 17973735.3 cycles, 17359781.4 instructions, 367.4 branch mis., 1513286.9 cache ref., 1008995.7 cache mis. 289 | 12.215 GB/s 290 | estimated clock in range 3.034 GHz to 3.170 GHz 291 | 292 | 293 | == Trial 3 out of 3 294 | 295 | pospopcnt_u16_scalar alignments: 16 296 | instructions per cycle 3.75, cycles per 16-bit word: 17.331, instructions per 16-bit word 65.000 297 | min: 581529449 cycles, 2181038367 instructions, 3 branch mis., 1374422 cache ref., 1033591 cache mis. 298 | avg: 581922860.8 cycles, 2181038371.0 instructions, 4.9 branch mis., 1377283.2 cache ref., 1041910.3 cache mis. 299 | 0.367 GB/s 300 | estimated clock in range 3.140 GHz to 3.181 GHz 301 | 302 | 303 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 304 | instructions per cycle 0.56, cycles per 16-bit word: 0.404, instructions per 16-bit word 0.225 305 | min: 13543586 cycles, 7541384 instructions, 352 branch mis., 1817110 cache ref., 969972 cache mis. 306 | avg: 14392309.3 cycles, 7541384.4 instructions, 390.9 branch mis., 1825960.7 cache ref., 980936.2 cache mis. 307 | 15.453 GB/s 308 | estimated clock in range 3.000 GHz to 3.165 GHz 309 | 310 | 311 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 312 | instructions per cycle 0.73, cycles per 16-bit word: 0.441, instructions per 16-bit word 0.322 313 | min: 14789211 cycles, 10809959 instructions, 195 branch mis., 1695835 cache ref., 977227 cache mis. 314 | avg: 15385457.1 cycles, 10809959.7 instructions, 352.7 branch mis., 1702342.3 cache ref., 988574.6 cache mis. 315 | 14.313 GB/s 316 | estimated clock in range 3.023 GHz to 3.166 GHz 317 | 318 | 319 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 320 | instructions per cycle 1.00, cycles per 16-bit word: 0.518, instructions per 16-bit word 0.517 321 | min: 17366428 cycles, 17359781 instructions, 342 branch mis., 1508192 cache ref., 997052 cache mis. 322 | avg: 17946271.6 cycles, 17359781.7 instructions, 368.0 branch mis., 1512860.9 cache ref., 1007786.9 cache mis. 323 | 12.235 GB/s 324 | estimated clock in range 3.060 GHz to 3.169 GHz 325 | -------------------------------------------------------------------------------- /script.sh: -------------------------------------------------------------------------------- 1 | for i in {1..26} 2 | do 3 | ./instrumented_benchmark -c -v -n $((2**i)) 4 | ./instrumented_benchmark -c -v -n $((2**(i-1)*3)) 5 | done 6 | -------------------------------------------------------------------------------- /smallresults.txt: -------------------------------------------------------------------------------- 1 | n = 1048576 m = 1 2 | iterations = 100 3 | array size: 2.000 MB 4 | nothing instructions per cycle 0.20, cycles per 16-bit word: 0.000, instructions per 16-bit word 0.000 5 | min: 65 cycles, 13 instructions, 1 branch mis., 0 cache ref., 0 cache mis. 6 | avg: 78.2 cycles, 13.0 instructions, 1.0 branch mis., 0.1 cache ref., 0.1 cache mis. 7 | 8 | == Trial 1 out of 3 9 | 10 | pospopcnt_u16_scalar alignments: 16 11 | instructions per cycle 3.76, cycles per 16-bit word: 17.282, instructions per 16-bit word 65.000 12 | min: 18121980 cycles, 68157535 instructions, 2 branch mis., 42216 cache ref., 0 cache mis. 13 | avg: 18193433.7 cycles, 68157535.8 instructions, 3.2 branch mis., 42874.5 cache ref., 2131.4 cache mis. 14 | 0.368 GB/s 15 | estimated clock in range 3.014 GHz to 3.179 GHz 16 | 17 | 18 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 19 | instructions per cycle 1.89, cycles per 16-bit word: 0.127, instructions per 16-bit word 0.240 20 | min: 133206 cycles, 251608 instructions, 26 branch mis., 41303 cache ref., 0 cache mis. 21 | avg: 136291.4 cycles, 251608.0 instructions, 53.7 branch mis., 41590.3 cache ref., 37.4 cache mis. 22 | 35.961 GB/s 23 | estimated clock in range 1.248 GHz to 2.318 GHz 24 | 25 | 26 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 27 | instructions per cycle 2.49, cycles per 16-bit word: 0.135, instructions per 16-bit word 0.337 28 | min: 141993 cycles, 353463 instructions, 25 branch mis., 42463 cache ref., 0 cache mis. 29 | avg: 149097.1 cycles, 353463.1 instructions, 29.5 branch mis., 42750.8 cache ref., 57.8 cache mis. 30 | 34.328 GB/s 31 | estimated clock in range 1.748 GHz to 2.526 GHz 32 | 33 | 34 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 35 | instructions per cycle 2.41, cycles per 16-bit word: 0.220, instructions per 16-bit word 0.529 36 | min: 230490 cycles, 554484 instructions, 26 branch mis., 42979 cache ref., 0 cache mis. 37 | avg: 232864.8 cycles, 554484.1 instructions, 32.9 branch mis., 43112.4 cache ref., 25.7 cache mis. 38 | 23.554 GB/s 39 | estimated clock in range 1.833 GHz to 2.612 GHz 40 | 41 | 42 | == Trial 2 out of 3 43 | 44 | pospopcnt_u16_scalar alignments: 16 45 | instructions per cycle 3.76, cycles per 16-bit word: 17.284, instructions per 16-bit word 65.000 46 | min: 18123412 cycles, 68157535 instructions, 2 branch mis., 42640 cache ref., 0 cache mis. 47 | avg: 18176547.9 cycles, 68157535.8 instructions, 3.0 branch mis., 42931.6 cache ref., 2825.7 cache mis. 48 | 0.368 GB/s 49 | estimated clock in range 3.023 GHz to 3.179 GHz 50 | 51 | 52 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 53 | instructions per cycle 1.89, cycles per 16-bit word: 0.127, instructions per 16-bit word 0.240 54 | min: 133219 cycles, 251608 instructions, 26 branch mis., 41361 cache ref., 0 cache mis. 55 | avg: 136222.5 cycles, 251608.0 instructions, 47.7 branch mis., 41622.8 cache ref., 34.3 cache mis. 56 | 35.817 GB/s 57 | estimated clock in range 1.532 GHz to 2.287 GHz 58 | 59 | 60 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 61 | instructions per cycle 2.49, cycles per 16-bit word: 0.135, instructions per 16-bit word 0.337 62 | min: 141952 cycles, 353463 instructions, 28 branch mis., 42541 cache ref., 0 cache mis. 63 | avg: 145220.6 cycles, 353463.0 instructions, 32.3 branch mis., 42705.6 cache ref., 42.5 cache mis. 64 | 34.098 GB/s 65 | estimated clock in range 1.477 GHz to 2.455 GHz 66 | 67 | 68 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 69 | instructions per cycle 2.40, cycles per 16-bit word: 0.220, instructions per 16-bit word 0.529 70 | min: 230683 cycles, 554484 instructions, 24 branch mis., 42957 cache ref., 0 cache mis. 71 | avg: 232969.5 cycles, 554484.1 instructions, 29.4 branch mis., 43117.6 cache ref., 20.7 cache mis. 72 | 23.439 GB/s 73 | estimated clock in range 1.700 GHz to 2.616 GHz 74 | 75 | 76 | == Trial 3 out of 3 77 | 78 | pospopcnt_u16_scalar alignments: 16 79 | instructions per cycle 3.76, cycles per 16-bit word: 17.300, instructions per 16-bit word 65.000 80 | min: 18140522 cycles, 68157535 instructions, 2 branch mis., 42586 cache ref., 0 cache mis. 81 | avg: 18176543.7 cycles, 68157535.8 instructions, 3.0 branch mis., 42930.7 cache ref., 1025.8 cache mis. 82 | 0.367 GB/s 83 | estimated clock in range 3.041 GHz to 3.179 GHz 84 | 85 | 86 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 87 | instructions per cycle 1.89, cycles per 16-bit word: 0.127, instructions per 16-bit word 0.240 88 | min: 132920 cycles, 251608 instructions, 25 branch mis., 41335 cache ref., 0 cache mis. 89 | avg: 136227.6 cycles, 251608.0 instructions, 48.2 branch mis., 41611.7 cache ref., 30.9 cache mis. 90 | 35.520 GB/s 91 | estimated clock in range 1.256 GHz to 2.343 GHz 92 | 93 | 94 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 95 | instructions per cycle 2.49, cycles per 16-bit word: 0.136, instructions per 16-bit word 0.337 96 | min: 142113 cycles, 353463 instructions, 19 branch mis., 42518 cache ref., 0 cache mis. 97 | avg: 144527.0 cycles, 353463.1 instructions, 30.8 branch mis., 42726.6 cache ref., 27.5 cache mis. 98 | 33.835 GB/s 99 | estimated clock in range 1.605 GHz to 2.381 GHz 100 | 101 | 102 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 103 | instructions per cycle 2.40, cycles per 16-bit word: 0.220, instructions per 16-bit word 0.529 104 | min: 230802 cycles, 554484 instructions, 23 branch mis., 42936 cache ref., 0 cache mis. 105 | avg: 236280.8 cycles, 554484.1 instructions, 30.5 branch mis., 43137.8 cache ref., 59.7 cache mis. 106 | 23.320 GB/s 107 | estimated clock in range 1.746 GHz to 2.625 GHz 108 | 109 | -------------------------------------------------------------------------------- /verylargeresults.txt: -------------------------------------------------------------------------------- 1 | $ ./instrumented_benchmark -v -n $((2**29)) 2 | n = 536870912 m = 1 3 | iterations = 100 4 | array size: 1024.000 MB 5 | nothing instructions per cycle 0.20, cycles per 16-bit word: 0.000, instructions per 16-bit word 0.000 6 | min: 65 cycles, 13 instructions, 1 branch mis., 0 cache ref., 0 cache mis. 7 | avg: 67.5 cycles, 13.0 instructions, 1.0 branch mis., 0.1 cache ref., 0.1 cache mis. 8 | 9 | == Trial 1 out of 3 10 | 11 | pospopcnt_u16_scalar alignments: 16 12 | instructions per cycle 3.75, cycles per 16-bit word: 17.334, instructions per 16-bit word 65.000 13 | min: 9306366083 cycles, 34896612300 instructions, 6 branch mis., 21927294 cache ref., 16810210 cache mis. 14 | avg: 9310655628.7 cycles, 34896612310.5 instructions, 19.2 branch mis., 22002041.1 cache ref., 16810510.8 cache mis. 15 | 0.367 GB/s 16 | estimated clock in range 3.166 GHz to 3.179 GHz 17 | 18 | 19 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 20 | instructions per cycle 0.62, cycles per 16-bit word: 0.361, instructions per 16-bit word 0.224 21 | min: 193647030 cycles, 120441986 instructions, 2400 branch mis., 28808476 cache ref., 15793002 cache mis. 22 | avg: 194467110.1 cycles, 120441986.5 instructions, 4906.4 branch mis., 28951104.1 cache ref., 15828987.5 cache mis. 23 | 17.629 GB/s 24 | estimated clock in range 3.107 GHz to 3.182 GHz 25 | 26 | 27 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 28 | instructions per cycle 0.82, cycles per 16-bit word: 0.391, instructions per 16-bit word 0.322 29 | min: 209898674 cycles, 172739686 instructions, 3081 branch mis., 27373286 cache ref., 15922807 cache mis. 30 | avg: 210382318.2 cycles, 172739686.4 instructions, 5058.5 branch mis., 27406810.1 cache ref., 15934282.6 cache mis. 31 | 16.270 GB/s 32 | estimated clock in range 3.127 GHz to 3.184 GHz 33 | 34 | 35 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 36 | instructions per cycle 1.11, cycles per 16-bit word: 0.467, instructions per 16-bit word 0.517 37 | min: 250858524 cycles, 277593519 instructions, 4980 branch mis., 24257628 cache ref., 16300967 cache mis. 38 | avg: 251285505.1 cycles, 277593520.3 instructions, 5401.8 branch mis., 24285567.7 cache ref., 16315059.9 cache mis. 39 | 13.622 GB/s 40 | estimated clock in range 3.128 GHz to 3.184 GHz 41 | 42 | 43 | == Trial 2 out of 3 44 | 45 | pospopcnt_u16_scalar alignments: 16 46 | instructions per cycle 3.75, cycles per 16-bit word: 17.337, instructions per 16-bit word 65.000 47 | min: 9307710055 cycles, 34896612299 instructions, 7 branch mis., 21966489 cache ref., 16810382 cache mis. 48 | avg: 9311309436.1 cycles, 34896612309.0 instructions, 19.6 branch mis., 22003978.2 cache ref., 16810540.1 cache mis. 49 | 0.367 GB/s 50 | estimated clock in range 3.167 GHz to 3.179 GHz 51 | 52 | 53 | pospopcnt_u16_avx512bw_harvey_seal_1KB alignments: 16 54 | instructions per cycle 0.62, cycles per 16-bit word: 0.361, instructions per 16-bit word 0.224 55 | min: 193904414 cycles, 120441986 instructions, 3513 branch mis., 28916800 cache ref., 15819247 cache mis. 56 | avg: 194369533.6 cycles, 120441986.4 instructions, 5209.1 branch mis., 28952621.8 cache ref., 15829349.3 cache mis. 57 | 17.612 GB/s 58 | estimated clock in range 3.120 GHz to 3.185 GHz 59 | 60 | 61 | pospopcnt_u16_avx512bw_harvey_seal_512B alignments: 16 62 | instructions per cycle 0.82, cycles per 16-bit word: 0.391, instructions per 16-bit word 0.322 63 | min: 209768033 cycles, 172739686 instructions, 5176 branch mis., 27390175 cache ref., 15923841 cache mis. 64 | avg: 210364100.0 cycles, 172739686.4 instructions, 5619.1 branch mis., 27407305.2 cache ref., 15933567.4 cache mis. 65 | 16.280 GB/s 66 | estimated clock in range 3.124 GHz to 3.185 GHz 67 | 68 | 69 | pospopcnt_u16_avx512bw_harvey_seal_256B alignments: 16 70 | instructions per cycle 1.11, cycles per 16-bit word: 0.467, instructions per 16-bit word 0.517 71 | min: 250929542 cycles, 277593519 instructions, 4854 branch mis., 24167020 cache ref., 16304010 cache mis. 72 | avg: 251337691.8 cycles, 277593520.6 instructions, 5129.5 branch mis., 24283613.6 cache ref., 16315664.5 cache mis. 73 | 13.628 GB/s 74 | estimated clock in range 3.125 GHz to 3.185 GHz 75 | --------------------------------------------------------------------------------