├── LICENSE
├── README.md
├── cyccnt
    ├── Makefile
    └── cyccnt.c
└── test_cycle.c


/LICENSE:
--------------------------------------------------------------------------------
 1 | This is free and unencumbered software released into the public domain.
 2 | 
 3 | Anyone is free to copy, modify, publish, use, compile, sell, or
 4 | distribute this software, either in source code form or as a compiled
 5 | binary, for any purpose, commercial or non-commercial, and by any
 6 | means.
 7 | 
 8 | In jurisdictions that recognize copyright laws, the author or authors
 9 | of this software dedicate any and all copyright interest in the
10 | software to the public domain. We make this dedication for the benefit
11 | of the public at large and to the detriment of our heirs and
12 | successors. We intend this dedication to be an overt act of
13 | relinquishment in perpetuity of all present and future rights to this
14 | software under copyright law.
15 | 
16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
22 | OTHER DEALINGS IN THE SOFTWARE.
23 | 
24 | For more information, please refer to <https://unlicense.org>
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # CPU Cycle Counter
  2 | 
  3 | Some/most modern CPUs have an in-silicon cycle counter, which, as the
  4 | name indicates, counts the number of clock cycles elapsed since some
  5 | arbitrary instant. It shall be noted that such counters are often
  6 | distinct from "time stamp counters". For instance, x86 CPUs have
  7 | featured a sort-of cycle counter, which can be read with the `rdtsc`
  8 | opcode, since the original Pentium (introduced in 1993). Such time stamp
  9 | counters originally matched the cycle counter, but this is no longer
 10 | true now that frequency scaling is a thing: CPUs will commonly lower or
 11 | raise their operating frequency depending on current load and
 12 | operational conditions (temperature, input voltage...), while
 13 | maintaining a fixed update frequency for the time stamp counter.
 14 | 
 15 | "Normal" applications are expected to use the time stamp counter, which
 16 | is convenient for time synchronization tasks (e.g. ensuring that video
 17 | frames show up on the screen at the right time) since the time stamp
 18 | counter is expected to track the time experienced in the physical world.
 19 | Accessing the cycle counter is useful for debugging and optimization
 20 | purposes, specifically for working out small, tight loops; this is the
 21 | case for instance in the realm of implementation of cryptographic
 22 | primitives.
 23 | 
 24 | While the in-CPU cycle counter exists, its access is normally forbidden
 25 | by the operating system for "security reasons"; namely, direct and
 26 | precise access to that counter helps in running timing attacks that
 27 | would allow an unprivileged process to extract secret values from other
 28 | processes running as different users. Of course, this attack model makes
 29 | sense only in multi-user machines; if you are trying to optimize a tight
 30 | loop on a test system which is sitting on your desk, then chances are
 31 | that there is a single user to that test system and that's you.
 32 | Nevertheless, operating system vendors are trying real hard to prevent
 33 | direct access to the cycle counter, and explain that if you want
 34 | performance counters then you should use the performance counter APIs,
 35 | which are a set of system calls and ad hoc structures by which you can
 36 | have the kernel read such counters for you, and report back. [In the
 37 | case of Linux](https://web.eece.maine.edu/~vweaver/projects/perf_events/),
 38 | this really entails using a specific system call to set up some special
 39 | file descriptors, and the value of any performance counter can either be
 40 | read with a `read()` system call, or possibly sampled at a periodic
 41 | frequency (e.g. 1000 Hz) and transmitted back to the userland process
 42 | via some shared memory. On other operating systems, a similar but of
 43 | course incompatible and possibly undocumented API can achieve about the
 44 | same result (e.g. on
 45 | [macOS](https://gist.github.com/ibireme/173517c208c7dc333ba962c1f0d67d12)).
 46 | Of course, access to these monitoring system calls normally requires
 47 | superuser privilege.
 48 | 
 49 | For tight loop optimization, which can be quite hairy in a world where
 50 | CPUs implement out-of-order execution, the performance monitoring system
 51 | calls are inadequate, since we really are interested in some short
 52 | sequences whose execution uses less time than an average system call,
 53 | let alone the million or so clock cycles that would happen before two
 54 | regular sampling events. For these tasks, you really need to read the
 55 | cycle counter directly from userland, with the smallest possible
 56 | sequence of instructions. This repository contains notes and some extra
 57 | software (e.g. Linux kernel module) to do just that.
 58 | 
 59 | Remember that:
 60 | 
 61 |   - Interpretation of a cycle count is delicate. You really need to
 62 |     consider both bandwidth and latency of instructions, and how they
 63 |     get executed on your CPU. For x86 machines, I recommend perusing
 64 |     Agner Fog's [optimization manuals](https://agner.org/optimize/),
 65 |     especially the [microarchitecture
 66 |     description](https://agner.org/optimize/microarchitecture.pdf).
 67 | 
 68 |   - Allowing access to performance counters *does* allow some precise
 69 |     timing attacks -- to be precise, these attacks were mostly feasible
 70 |     without the performance counters, but become quite easier. So, don't
 71 |     do that on a multi-tenant machine (though the wisdom of multi-tenant
 72 |     machines on modern hardware is a bit questionable these days).
 73 | 
 74 |   - On some of the smallest embedded systems (microcontrollers), maintaining
 75 |     the cycle counter can incur a noticeable increase in power draw; you'd
 76 |     prefer to reserve that to development boards, not production hardware.
 77 | 
 78 |   - Cycle counts are per-core. If the operating system decides to
 79 |     migrate your process from one core to another, then cycle counts
 80 |     will apparently "jump". This is in general not much of a problem for
 81 |     development, because the OS tries not to migrate processes (it's
 82 |     expensive). You can instruct the OS to "tie" a thread to a specific
 83 |     CPU core with the
 84 |     [`sched_setaffinity()`](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html)
 85 |     system call; you will want to do that if you have asymmetric
 86 |     hardware with "efficiency" and "performance" cores, so that you can
 87 |     bench performance on either type of core.
 88 | 
 89 | The [`test_cycle.c`](test_cycle.c) file is a demonstration application
 90 | that more-or-less expects to run on a Linux-like system; it uses the
 91 | cycle counter to benchmark the speed of integer multiplications. Compile
 92 | it and use it with a numerical parameter:
 93 | 
 94 | ```
 95 | $ clang -W -Wextra -O2 -o test_cycle test_cycle.c
 96 | $ ./test_cycle 3
 97 | 32x32->32 muls:    2.000
 98 | 64x64->64 muls:    4.025
 99 | 64x64->128 muls:   5.499
100 | (225)
101 | ```
102 | 
103 | The parameter should be 0, 1 or 3; with 0 or 1, you mostly bench speed
104 | of multiplications by 0 or 1, and with 3, you mostly bench speed of
105 | multiplications by "large values". Here, on an ARM Cortex-A76 CPU, we
106 | see that 32-bit multiplications complete in 2 cycles, 64-bit
107 | multiplications complete in 4 cycles for the low output word, and 5
108 | cycles for the high output word (the extra ".499" are some loop
109 | overhead). On that particular CPU, these multiplications are
110 | constant-time; this is not the case on, for instance, an ARM Cortex-A55:
111 | 
112 | ```
113 | $ ./test_cycle 1
114 | 32x32->32 muls:    3.001
115 | 64x64->64 muls:    4.001
116 | 64x64->128 muls:   6.127
117 | (0)
118 | $ ./test_cycle 3
119 | 32x32->32 muls:    3.001
120 | 64x64->64 muls:    5.001
121 | 64x64->128 muls:   6.127
122 | (225)
123 | ```
124 | 
125 | We see here that on the A55, 64-bit multiplications return the low word
126 | of the output earlier when the operands are mathematically small enough
127 | (and this is a problem for cryptographic schemes; the early return may
128 | allow secret-revealing timing attacks).
129 | 
130 | If you try this program on your machine, then chances are that it will
131 | crash with an "illegal instruction" error, or something similar. This is
132 | because access to the cycle counter must first be allowed, which requires
133 | at least superuser access, as described below.
134 | 
135 | # x86
136 | 
137 | On x86 CPUs, the cycle counter can be read with the `rdpmc` opcode, using
138 | the register `0x40000001`. A typical access would use this function:
139 | 
140 | ```c
141 | static inline uint64_t
142 | core_cycles(void)
143 | {
144 |         _mm_lfence();
145 |         return __rdpmc(0x40000001);
146 | }
147 | ```
148 | 
149 | Note the use of a memory fence to ensure that the instruction executes in
150 | a reasonably predictable position within the instruction sequence.
151 | 
152 | On a Linux system, to allow `rdpmc` to complete without crashing, the
153 | access must first be authorized by root with:
154 | 
155 | ```
156 | echo 2 > /sys/bus/event_source/devices/cpu/rdpmc
157 | ```
158 | 
159 | Once this has been done, unpriviledged userland processes can read the
160 | cycle counter without crashing. The setting "sticks" until the next
161 | reboot. On my test/research system (running Ubuntu), I do that
162 | automatically at boot time with an `@reboot` crontab.
163 | 
164 | I don't have any matching solution for Windows (sorry). On macOS this is
165 | hardly better, you would need a kernel extension (aka macOS's notion of
166 | a kernel module) and rumour has it that there is one somewhere in [Intel
167 | Performance Counter Monitor](https://github.com/intel/pcm), but I have
168 | not tried. Also, macOS makes it difficult to load custom kernel
169 | extensions, you have to reboot to some recovery OS and disable some
170 | checks. Also note that virtual machines won't save you here; regardless
171 | of what you do in a VM, you won't be able to access the cycle counter if
172 | the host does not agree.
173 | 
174 | # ARMv8
175 | 
176 | On ARMv8 (ARMv8-A in 64-bit mode, aka "aarch64" or "arm64"), the cycle
177 | counter is read with a bit of inline assembly:
178 | 
179 | ```c
180 | static inline uint64_t
181 | core_cycles(void)
182 | {
183 |         uint64_t x;
184 |         __asm__ __volatile__ ("dsb sy\n\tmrs %0, pmccntr_el0" : "=r" (x) : : );
185 |         return x;
186 | }
187 | ```
188 | 
189 | The `dsb` opcode is a memory fence, just like in the x86 case. For the
190 | cycle counter to be accessible, it must be enabled, and unprivileged read
191 | must be authorized, both operations needing to be done in supervisor mode
192 | (i.e. in the kernel). On Linux, you can use the [cyccnt](cyccnt) custom
193 | module from this repository. Namely:
194 | 
195 |   - You need to install the kernel headers that match your kernel.
196 |     Possibly this is already done; otherwise, your distribution may
197 |     provide a convenient package that is kept in sync with the kernel
198 |     itself. On Ubuntu, try the `linux-headers-generic` package.
199 | 
200 |   - Go to the `cyccnt` directory and type `make`. If all goes well it
201 |     should produce some files, including the module itself which will
202 |     be called `cyccnt.ko`.
203 | 
204 |   - Loading the module is done with `insmod cyccnt.ko`. This is where
205 |     root is needed, so presumably type `sudo insmod cyccnt.ko`.
206 | 
207 | Once the module is loaded, you can see in the kernel message (`dmesg`
208 | command) something like:
209 | 
210 | ```
211 | [  901.256454] enable pmccntr_el0 on CPU 0
212 | [  901.256454] enable pmccntr_el0 on CPU 1
213 | [  901.256454] enable pmccntr_el0 on CPU 2
214 | [  901.256454] enable pmccntr_el0 on CPU 3
215 | ```
216 | 
217 | which means that the module was indeed loaded and enabled the cycle
218 | counter on all four CPU cores. The access remains allowed until the
219 | module is unloaded (with `rmmod`), or the next reboot.
220 | 
221 | **Idle states:** on my Raspberry Pi 5, running Ubuntu 24.04, this
222 | is sufficient. On another system, an ODROID C4 running an older kernel
223 | (version "4.9.337-13"), the CPU cores happen to "forget" their setting
224 | whenever they go idle, which is common. On that system, the CPU idle
225 | state must be disabled on each core (the sequence is: disable idle
226 | state, *then* load the kernel module):
227 | 
228 | ```
229 | echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/disable
230 | echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state1/disable
231 | echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state1/disable
232 | echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state1/disable
233 | ```
234 | 
235 | I don't know if that is a quirk of that particular hardware, or of the
236 | older kernel version.
237 | 
238 | # RISC-V
239 | 
240 | On a 64-bit RISC-V system, the cycle counter is read with the `rdcycle`
241 | instruction:
242 | 
243 | ```c
244 | static inline uint64_t
245 | core_cycles(void)
246 | {
247 |         uint64_t x;
248 |         __asm__ __volatile__ ("rdcycle %0" : "=r" (x));
249 |         return x;
250 | }
251 | ```
252 | 
253 | Note that there is no memory fence here; the [RISC-V instruction set
254 | manuals](https://lf-riscv.atlassian.net/wiki/spaces/HOME/pages/16154769/RISC-V+Technical+Specifications)
255 | sort of explain that the CPU is supposed to enforce proper synchronization
256 | of these instructions with regard to the sequence of instructions that the
257 | core executes, so that no memory fence should be needed on cores with
258 | out-of-order execution. I am not entirely sure that I read it correctly
259 | and I do not have a RISC-V with out-of-order execution to test it.
260 | 
261 | For accessing the cycle counter, the situation is similar to that of
262 | ARMv8: the `cyccnt` kernel module should be used. The process actually
263 | needs cooperation of both the supervisor (kernel) mode, and the machine
264 | level (the hypervisor). The kernel can ask the hypervisor to enable some
265 | performance counters through an API called
266 | [SBI](https://lists.riscv.org/g/tech-brs/attachment/361/0/riscv-sbi.pdf).
267 | In `cyccnt` this is done with the `sbi_ecall()` function and it has been
268 | tested on a grand total of one (1) test board (a [StarFive VisionFive
269 | 2](https://www.starfivetech.com/en/site/boards) using Ubuntu 24.04).
270 | 
271 | There again, the access remains until the module is unloaded or the
272 | system is rebooted.
273 | 


--------------------------------------------------------------------------------
/cyccnt/Makefile:
--------------------------------------------------------------------------------
 1 | obj-m	:= cyccnt.o
 2 | KDIR	:= /lib/modules/$(shell uname -r)/build
 3 | PWD	:= $(shell pwd)
 4 | 
 5 | all:
 6 | 	$(MAKE) -C $(KDIR) M=$(PWD) modules
 7 | 
 8 | clean:
 9 | 	$(MAKE) -C $(KDIR) M=$(PWD) clean
10 | 


--------------------------------------------------------------------------------
/cyccnt/cyccnt.c:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * This module activates the in-CPU cycle counter and enables direct access
  3 |  * from unprivileged (userland) code, for benchmark purposes. It can be used
  4 |  * on ARMv8 (aarch64) and some RISC-V (riscv64) systems.
  5 |  */
  6 | 
  7 | #include <linux/kernel.h>
  8 | #include <linux/module.h>
  9 | #include <linux/smp.h>
 10 | 
 11 | #if defined(__aarch64__)
 12 | 
 13 | static void
 14 | enable_counter(void *data)
 15 | {
 16 | 	/*
 17 | 	 * ARMv8: the counter is PMCCNTR_EL0. Its behaviour and access is
 18 | 	 * controlled by several other registers:
 19 | 	 *
 20 | 	 * PMINTENCLR_EL1
 21 | 	 *    Controls generation of interrupts on cycle counter overflow.
 22 | 	 *    We do not want such interrupts; to disable them, we write
 23 | 	 *    1 in bit 31 of the register.
 24 | 	 *
 25 | 	 * PMCNTENSET_EL0
 26 | 	 *    Controls whether the cycle counter is active. We enable the
 27 | 	 *    cycle counter by writing 1 in bit 31.
 28 | 	 *
 29 | 	 * PMUSERENR_EL0
 30 | 	 *    Controls whether reading from the cycle counter is allowed
 31 | 	 *    from userland. Either bit 0 or bit 2 needs to be set. Here
 32 | 	 *    we set both, which is probably overkill, but it works.
 33 | 	 *
 34 | 	 * PMCR_EL0
 35 | 	 *    We need to conserve most of the bits here, but we set the
 36 | 	 *    some low bits to specific values:
 37 | 	 *      LC (bit 6)   1 for ensuring a 64-bit counter (not 32-bit)
 38 | 	 *      DP (bit 5)   0 to leave the cycle counter accessible
 39 | 	 *      D (bit 3)    0 to increment the counter every cycle
 40 | 	 *      C (bit 2)    1 to reset the counter to zero
 41 | 	 *      E (bit 0)    1 to allow enabling through PMCNTENSET_EL0
 42 | 	 *    Since we read that register first then write it back, we use
 43 | 	 *    a fence to ensure no compiler or CPU shenanigans with stale
 44 | 	 *    information.
 45 | 	 *
 46 | 	 * PMCCFILTR_EL0
 47 | 	 *    Controls whether the cycle counter is incremented in various
 48 | 	 *    security levels ("exception levels"). For our benchmarking
 49 | 	 *    purposes we only really need incrementation in userland (EL0)
 50 | 	 *    but here we just enable counting at all levels. This is done
 51 | 	 *    by setting bit 27 to 1, and all other bits to 0.
 52 | 	 *
 53 | 	 * Reference: Arm Architecture Reference Manual -- Armv8, for
 54 | 	 * Armv8-A architecture profile, DDI 0487F.c (ID072120) (2020),
 55 | 	 * section D13.4.
 56 | 	 */
 57 | 	(void)data;
 58 | 	printk(KERN_INFO "enable pmccntr_el0 on CPU %d\n", smp_processor_id());
 59 | 	asm volatile("msr pmintenclr_el1, %0" : : "r" (BIT(31)));
 60 | 	asm volatile("msr pmcntenset_el0, %0" : : "r" (BIT(31)));
 61 | 	asm volatile("msr pmuserenr_el0, %0" : : "r" (BIT(0)|BIT(2)|BIT(6)));
 62 | 	unsigned long x;
 63 | 	asm volatile("mrs %0, pmcr_el0" : "=r" (x));
 64 | 	x |= BIT(0) | BIT(2);
 65 | 	isb();
 66 | 	asm volatile("msr pmcr_el0, %0" : : "r" (x));
 67 | 	asm volatile("msr pmccfiltr_el0, %0" : : "r" (BIT(27)));
 68 | }
 69 | 
 70 | static void
 71 | disable_counter(void *data)
 72 | {
 73 | 	(void)data;
 74 | 	printk(KERN_INFO "disable pmccntr_el0 on CPU %d\n", smp_processor_id());
 75 | 	asm volatile("msr pmcntenset_el0, %0" : : "r" (0));
 76 | 	asm volatile("msr pmuserenr_el0, %0" : : "r" (0));
 77 | }
 78 | 
 79 | #elif defined(__riscv)
 80 | 
 81 | #include <asm/sbi.h>
 82 | 
 83 | static void
 84 | enable_counter(void *data)
 85 | {
 86 | 	/*
 87 | 	 * On RISC-V, we need to authorize read access to the cycle counter
 88 | 	 * from userland, and also to enable incrementation of the counter
 89 | 	 * at each cycle. Authorization is done by setting bit 0 of register
 90 | 	 * scounteren to 1; here, we set all bits of that register, which has
 91 | 	 * the side-effect of enabling access to all performance counters:
 92 | 	 * bit 1 should always bet set (it's for access to the real-time
 93 | 	 * clock, the fixed-frequency counter read by rdtime); bit 2 is
 94 | 	 * for the counter of retired instructions, i.e. the total number of
 95 | 	 * exec instructions; bit 3 to 31 are for other performance counters.
 96 | 	 *
 97 | 	 * Setting scounteren is not enough. That bit only allows access
 98 | 	 * to the cycle counter from userland if the supervisor (i.e.
 99 | 	 * the kernel) can read it. Access to the counter by the
100 | 	 * supervisor must also be enabled, which must be done by the
101 | 	 * machine-level code (i.e. the hypervisor). Similarly, the
102 | 	 * machine-level code should also start the counter, because it
103 | 	 * is usually inhibited by default to save some power (i.e. even
104 | 	 * with allowed access, the counter seems "stuck" to a fixed
105 | 	 * value). Both operations cannot be done by the supervisor;
106 | 	 * thus, the kernel code must request them to the relevant
107 | 	 * hypervisor part, which is done through a specific interface
108 | 	 * called SBI (Supervisor Binary Interface) (my test board is a
109 | 	 * StarFive VisionFive 2 and the hypervisor implementation in it
110 | 	 * appears to be OpenSBI).
111 | 	 */
112 | 	(void)data;
113 | 	printk(KERN_INFO "enable_rdcycle on CPU %d\n", smp_processor_id());
114 | 	struct sbiret r = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START,
115 | 		0, 1, SBI_PMU_START_FLAG_SET_INIT_VALUE, 0, 0, 0);
116 | 	printk(KERN_INFO "CPU %d: sbi_ecall() returned %ld, %ld\n",
117 | 		smp_processor_id(), r.error, r.value);
118 | 	csr_write(CSR_SCOUNTEREN, GENMASK(31, 0));
119 | }
120 | 
121 | static void
122 | disable_counter(void *data)
123 | {
124 | 	/*
125 | 	 * Ideally we should also ask the SBI to stop the counter. Here
126 | 	 * we just reset allowed access to only the timer (which is accessed
127 | 	 * through rdtime from userland).
128 | 	 */
129 | 	(void)data;
130 | 	printk(KERN_INFO "disable_rdcycle on CPU %d\n", smp_processor_id());
131 | 	csr_write(CSR_SCOUNTEREN, 0x2);
132 | }
133 | 
134 | #else
135 | 
136 | #error This module is for ARMv8 and RISC-V only.
137 | 
138 | #endif
139 | 
140 | static int __init
141 | init(void)
142 | {
143 | 	on_each_cpu(enable_counter, NULL, 1);
144 | 	return 0;
145 | }
146 | 
147 | static void __exit
148 | fini(void)
149 | {
150 | 	on_each_cpu(disable_counter, NULL, 1);
151 | }
152 | 
153 | MODULE_DESCRIPTION("Enables user-mode access to in-CPU cycle counter");
154 | MODULE_LICENSE("GPL");
155 | module_init(init);
156 | module_exit(fini);
157 | 


--------------------------------------------------------------------------------
/test_cycle.c:
--------------------------------------------------------------------------------
  1 | /*
  2 |  * This test program demonstrates access to the cycle counter. It should
  3 |  * work on x86 (32-bit and 64-bit), aarch64 and riscv64, though in all
  4 |  * three cases some superuser-level operations must first be done to allow
  5 |  * access to the counter.
  6 |  *
  7 |  * The core_cycles() function returns the current value of the cycle
  8 |  * counter. The test program uses core_cycles() to perform a measurement
  9 |  * of the cost (latency) of integer multiplications; a base integer
 10 |  * value should be provided as starting point, then the program
 11 |  * multiplies it with itself repeatedly. The starting point is obtained
 12 |  * as a program argument to prevent the compiler from optimizing it.
 13 |  * Relevant argument values are 0, 1 and 3; 0 and 1 exercise "special cases"
 14 |  * (i.e. values for which a variable-time multiplier is likely to return
 15 |  * "early") while 3 will use more-or-less pseudorandom values and should thus
 16 |  * exercise the "general case".
 17 |  */
 18 | 
 19 | #include <stdio.h>
 20 | #include <stdlib.h>
 21 | #include <stdint.h>
 22 | 
 23 | #if defined __x86_64__ || defined _M_X64 || defined __i386__ || defined _M_IX86
 24 | /*
 25 |  * x86: the cycle counter is accessible with the rdpmc instruction. Access
 26 |  * must first be allowed, which, on Linux, is done with:
 27 |  *   echo 2 > /sys/bus/event_source/devices/cpu/rdpmc
 28 |  * (only root can perform this write; the setting is "volatile" in that
 29 |  * it lasts only until next reboot).
 30 |  */
 31 | #include <immintrin.h>
 32 | #ifdef _MSC_VER
 33 | /* On Windows, the intrinsic is called __readpmc(), not __rdpmc(). But it
 34 |    will usually imply a crash, since Windows does no enable access to the
 35 |    performance counters. */
 36 | #ifndef __rdpmc
 37 | #define __rdpmc   __readpmc
 38 | #endif
 39 | #else
 40 | #include <x86intrin.h>
 41 | #endif
 42 | #if defined __GNUC__ || defined __clang__
 43 | __attribute__((target("sse2")))
 44 | #endif
 45 | static inline uint64_t
 46 | core_cycles(void)
 47 | {
 48 | 	_mm_lfence();
 49 | 	return __rdpmc(0x40000001);
 50 | }
 51 | 
 52 | #elif defined __aarch64__ && (defined __GNUC__ || defined __clang__)
 53 | /*
 54 |  * ARMv8, 64-bit (aarch64): the cycle counter is pmccntr_el0; it must be
 55 |  * enabled through dedicated kernel code.
 56 |  */
 57 | static inline uint64_t
 58 | core_cycles(void)
 59 | {
 60 | 	uint64_t x;
 61 | 	__asm__ __volatile__ ("dsb sy\n\tmrs %0, pmccntr_el0" : "=r" (x) : : );
 62 | 	return x;
 63 | }
 64 | 
 65 | #elif defined __riscv && defined __riscv_xlen && __riscv_xlen >= 64
 66 | /*
 67 |  * RISC-V, 64-bit (rv64gc): the cycle counter is read with the
 68 |  * pseudo-instruction rdcycle (which is just an alias for csrrs with
 69 |  * the appropriate register identifier). The cycle counter must be enabled
 70 |  * and its userland access allowed, which requires machine-level actions
 71 |  * that can be triggered from dedicated kernel code.
 72 |  */
 73 | static inline uint64_t
 74 | core_cycles(void)
 75 | {
 76 | 	/* We don't use a memory fence here because the RISC-V ISA
 77 | 	   already requires the CPU to enforce appropriate ordering for
 78 | 	   this access. */
 79 | 	uint64_t x;
 80 | 	__asm__ __volatile__ ("rdcycle %0" : "=r" (x));
 81 | 	return x;
 82 | }
 83 | 
 84 | #else
 85 | #error Architecture is not supported.
 86 | #endif
 87 | 
 88 | static int
 89 | cmp_u64(const void *v1, const void *v2)
 90 | {
 91 | 	uint64_t x1 = *(const uint64_t *)v1;
 92 | 	uint64_t x2 = *(const uint64_t *)v2;
 93 | 	if (x1 < x2) {
 94 | 		return -1;
 95 | 	} else if (x1 == x2) {
 96 | 		return 0;
 97 | 	} else {
 98 | 		return 1;
 99 | 	}
100 | }
101 | 
102 | int
103 | main(int argc, char *argv[])
104 | {
105 | 	if (argc != 2) {
106 | 		fprintf(stderr, "usage: test_cycle [ 0 | 1 | 3 ]\n");
107 | 		exit(EXIT_FAILURE);
108 | 	}
109 | 	int st = atoi(argv[1]);
110 | 
111 | 	uint64_t tt[100];
112 | 
113 | 	/* 32-bit multiplications. */
114 | 	uint32_t x32 = (uint32_t)st;
115 | 	uint32_t y32 = x32;
116 | 	for (int i = 0; i < 100; i ++) {
117 | 		y32 *= x32;
118 | 	}
119 | 	x32 = y32;
120 | 	for (size_t i = 0; i < 120; i ++) {
121 | 		uint64_t begin = core_cycles();
122 | 		for (int j = 0; j < 1000; j ++) {
123 | 			x32 *= y32;
124 | 			y32 *= x32;
125 | 			x32 *= y32;
126 | 			y32 *= x32;
127 | 			x32 *= y32;
128 | 			y32 *= x32;
129 | 			x32 *= y32;
130 | 			y32 *= x32;
131 | 			x32 *= y32;
132 | 			y32 *= x32;
133 | 			x32 *= y32;
134 | 			y32 *= x32;
135 | 			x32 *= y32;
136 | 			y32 *= x32;
137 | 			x32 *= y32;
138 | 			y32 *= x32;
139 | 			x32 *= y32;
140 | 			y32 *= x32;
141 | 			x32 *= y32;
142 | 			y32 *= x32;
143 | 		}
144 | 		uint64_t end = core_cycles();
145 | 		if (i >= 20) {
146 | 			tt[i - 20] = end - begin;
147 | 		}
148 | 	}
149 | 	qsort(tt, 100, sizeof(uint64_t), &cmp_u64);
150 | 	printf("32x32->32 muls:  %7.3f\n", (double)tt[50] / 20000.0);
151 | 
152 | 	/* 64-bit multiplications. */
153 | 	uint64_t x64 = (uint64_t)x32;
154 | 	x64 *= x64 * x64;
155 | 	uint64_t y64 = x64;
156 | 	for (size_t i = 0; i < 120; i ++) {
157 | 		uint64_t begin = core_cycles();
158 | 		for (int j = 0; j < 1000; j ++) {
159 | 			x64 *= y64;
160 | 			y64 *= x64;
161 | 			x64 *= y64;
162 | 			y64 *= x64;
163 | 			x64 *= y64;
164 | 			y64 *= x64;
165 | 			x64 *= y64;
166 | 			y64 *= x64;
167 | 			x64 *= y64;
168 | 			y64 *= x64;
169 | 			x64 *= y64;
170 | 			y64 *= x64;
171 | 			x64 *= y64;
172 | 			y64 *= x64;
173 | 			x64 *= y64;
174 | 			y64 *= x64;
175 | 			x64 *= y64;
176 | 			y64 *= x64;
177 | 			x64 *= y64;
178 | 			y64 *= x64;
179 | 		}
180 | 		uint64_t end = core_cycles();
181 | 		if (i >= 20) {
182 | 			tt[i - 20] = end - begin;
183 | 		}
184 | 	}
185 | 	qsort(tt, 100, sizeof(uint64_t), &cmp_u64);
186 | 	printf("64x64->64 muls:  %7.3f\n", (double)tt[50] / 20000.0);
187 | 
188 | #if (defined __GNUC__ || defined __clang) && defined __SIZEOF_INT128__
189 | 	/* 64x64->128 multiplications.
190 | 	   We really measure latency to access to the upper half of the
191 | 	   result. To avoid the value to become too small, we ensure
192 | 	   every 20 operations that the top bit is set (unless source
193 | 	   was 0 or 1). */
194 | 	uint64_t t64 = (uint64_t)((y64 >> 1) != 0) << 63;
195 | 	x64 |= t64;
196 | 	y64 |= t64;
197 | 	uint64_t x64orig = x64;
198 | 	uint64_t y64orig = y64;
199 | 	for (size_t i = 0; i < 120; i ++) {
200 | 		uint64_t begin = core_cycles();
201 | 		for (int j = 0; j < 1000; j ++) {
202 | 			x64 ^= x64orig;
203 | 			y64 ^= y64orig;
204 | 			x64 = ((unsigned __int128)x64 * y64) >> 64;
205 | 			y64 = ((unsigned __int128)y64 * x64) >> 64;
206 | 			x64 = ((unsigned __int128)x64 * y64) >> 64;
207 | 			y64 = ((unsigned __int128)y64 * x64) >> 64;
208 | 			x64 = ((unsigned __int128)x64 * y64) >> 64;
209 | 			y64 = ((unsigned __int128)y64 * x64) >> 64;
210 | 			x64 = ((unsigned __int128)x64 * y64) >> 64;
211 | 			y64 = ((unsigned __int128)y64 * x64) >> 64;
212 | 		}
213 | 		uint64_t end = core_cycles();
214 | 		if (i >= 20) {
215 | 			tt[i - 20] = end - begin;
216 | 		}
217 | 	}
218 | 	qsort(tt, 100, sizeof(uint64_t), &cmp_u64);
219 | 	printf("64x64->128 muls: %7.3f\n", (double)tt[50] / 8000.0);
220 | #endif
221 | 
222 | 	/* Get some bytes from the final value and print them out; this
223 | 	   should prevent the compiler from optimizing away the
224 | 	   multiplications. */
225 | 	unsigned x = 0;
226 | 	for (int i = 0; i < 8; i ++) {
227 | 		x ^= (unsigned)x64;
228 | 		x64 >>= 8;
229 | 	}
230 | 	printf("(%u)\n", x & 0xFF);
231 | 	return 0;
232 | }
233 | 


--------------------------------------------------------------------------------