├── LICENSE ├── README.md ├── cyccnt ├── Makefile └── cyccnt.c └── test_cycle.c /LICENSE: -------------------------------------------------------------------------------- 1 | This is free and unencumbered software released into the public domain. 2 | 3 | Anyone is free to copy, modify, publish, use, compile, sell, or 4 | distribute this software, either in source code form or as a compiled 5 | binary, for any purpose, commercial or non-commercial, and by any 6 | means. 7 | 8 | In jurisdictions that recognize copyright laws, the author or authors 9 | of this software dedicate any and all copyright interest in the 10 | software to the public domain. We make this dedication for the benefit 11 | of the public at large and to the detriment of our heirs and 12 | successors. We intend this dedication to be an overt act of 13 | relinquishment in perpetuity of all present and future rights to this 14 | software under copyright law. 15 | 16 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF 18 | MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. 19 | IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR 20 | OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 21 | ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR 22 | OTHER DEALINGS IN THE SOFTWARE. 23 | 24 | For more information, please refer to 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CPU Cycle Counter 2 | 3 | Some/most modern CPUs have an in-silicon cycle counter, which, as the 4 | name indicates, counts the number of clock cycles elapsed since some 5 | arbitrary instant. It shall be noted that such counters are often 6 | distinct from "time stamp counters". For instance, x86 CPUs have 7 | featured a sort-of cycle counter, which can be read with the `rdtsc` 8 | opcode, since the original Pentium (introduced in 1993). Such time stamp 9 | counters originally matched the cycle counter, but this is no longer 10 | true now that frequency scaling is a thing: CPUs will commonly lower or 11 | raise their operating frequency depending on current load and 12 | operational conditions (temperature, input voltage...), while 13 | maintaining a fixed update frequency for the time stamp counter. 14 | 15 | "Normal" applications are expected to use the time stamp counter, which 16 | is convenient for time synchronization tasks (e.g. ensuring that video 17 | frames show up on the screen at the right time) since the time stamp 18 | counter is expected to track the time experienced in the physical world. 19 | Accessing the cycle counter is useful for debugging and optimization 20 | purposes, specifically for working out small, tight loops; this is the 21 | case for instance in the realm of implementation of cryptographic 22 | primitives. 23 | 24 | While the in-CPU cycle counter exists, its access is normally forbidden 25 | by the operating system for "security reasons"; namely, direct and 26 | precise access to that counter helps in running timing attacks that 27 | would allow an unprivileged process to extract secret values from other 28 | processes running as different users. Of course, this attack model makes 29 | sense only in multi-user machines; if you are trying to optimize a tight 30 | loop on a test system which is sitting on your desk, then chances are 31 | that there is a single user to that test system and that's you. 32 | Nevertheless, operating system vendors are trying real hard to prevent 33 | direct access to the cycle counter, and explain that if you want 34 | performance counters then you should use the performance counter APIs, 35 | which are a set of system calls and ad hoc structures by which you can 36 | have the kernel read such counters for you, and report back. [In the 37 | case of Linux](https://web.eece.maine.edu/~vweaver/projects/perf_events/), 38 | this really entails using a specific system call to set up some special 39 | file descriptors, and the value of any performance counter can either be 40 | read with a `read()` system call, or possibly sampled at a periodic 41 | frequency (e.g. 1000 Hz) and transmitted back to the userland process 42 | via some shared memory. On other operating systems, a similar but of 43 | course incompatible and possibly undocumented API can achieve about the 44 | same result (e.g. on 45 | [macOS](https://gist.github.com/ibireme/173517c208c7dc333ba962c1f0d67d12)). 46 | Of course, access to these monitoring system calls normally requires 47 | superuser privilege. 48 | 49 | For tight loop optimization, which can be quite hairy in a world where 50 | CPUs implement out-of-order execution, the performance monitoring system 51 | calls are inadequate, since we really are interested in some short 52 | sequences whose execution uses less time than an average system call, 53 | let alone the million or so clock cycles that would happen before two 54 | regular sampling events. For these tasks, you really need to read the 55 | cycle counter directly from userland, with the smallest possible 56 | sequence of instructions. This repository contains notes and some extra 57 | software (e.g. Linux kernel module) to do just that. 58 | 59 | Remember that: 60 | 61 | - Interpretation of a cycle count is delicate. You really need to 62 | consider both bandwidth and latency of instructions, and how they 63 | get executed on your CPU. For x86 machines, I recommend perusing 64 | Agner Fog's [optimization manuals](https://agner.org/optimize/), 65 | especially the [microarchitecture 66 | description](https://agner.org/optimize/microarchitecture.pdf). 67 | 68 | - Allowing access to performance counters *does* allow some precise 69 | timing attacks -- to be precise, these attacks were mostly feasible 70 | without the performance counters, but become quite easier. So, don't 71 | do that on a multi-tenant machine (though the wisdom of multi-tenant 72 | machines on modern hardware is a bit questionable these days). 73 | 74 | - On some of the smallest embedded systems (microcontrollers), maintaining 75 | the cycle counter can incur a noticeable increase in power draw; you'd 76 | prefer to reserve that to development boards, not production hardware. 77 | 78 | - Cycle counts are per-core. If the operating system decides to 79 | migrate your process from one core to another, then cycle counts 80 | will apparently "jump". This is in general not much of a problem for 81 | development, because the OS tries not to migrate processes (it's 82 | expensive). You can instruct the OS to "tie" a thread to a specific 83 | CPU core with the 84 | [`sched_setaffinity()`](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) 85 | system call; you will want to do that if you have asymmetric 86 | hardware with "efficiency" and "performance" cores, so that you can 87 | bench performance on either type of core. 88 | 89 | The [`test_cycle.c`](test_cycle.c) file is a demonstration application 90 | that more-or-less expects to run on a Linux-like system; it uses the 91 | cycle counter to benchmark the speed of integer multiplications. Compile 92 | it and use it with a numerical parameter: 93 | 94 | ``` 95 | $ clang -W -Wextra -O2 -o test_cycle test_cycle.c 96 | $ ./test_cycle 3 97 | 32x32->32 muls: 2.000 98 | 64x64->64 muls: 4.025 99 | 64x64->128 muls: 5.499 100 | (225) 101 | ``` 102 | 103 | The parameter should be 0, 1 or 3; with 0 or 1, you mostly bench speed 104 | of multiplications by 0 or 1, and with 3, you mostly bench speed of 105 | multiplications by "large values". Here, on an ARM Cortex-A76 CPU, we 106 | see that 32-bit multiplications complete in 2 cycles, 64-bit 107 | multiplications complete in 4 cycles for the low output word, and 5 108 | cycles for the high output word (the extra ".499" are some loop 109 | overhead). On that particular CPU, these multiplications are 110 | constant-time; this is not the case on, for instance, an ARM Cortex-A55: 111 | 112 | ``` 113 | $ ./test_cycle 1 114 | 32x32->32 muls: 3.001 115 | 64x64->64 muls: 4.001 116 | 64x64->128 muls: 6.127 117 | (0) 118 | $ ./test_cycle 3 119 | 32x32->32 muls: 3.001 120 | 64x64->64 muls: 5.001 121 | 64x64->128 muls: 6.127 122 | (225) 123 | ``` 124 | 125 | We see here that on the A55, 64-bit multiplications return the low word 126 | of the output earlier when the operands are mathematically small enough 127 | (and this is a problem for cryptographic schemes; the early return may 128 | allow secret-revealing timing attacks). 129 | 130 | If you try this program on your machine, then chances are that it will 131 | crash with an "illegal instruction" error, or something similar. This is 132 | because access to the cycle counter must first be allowed, which requires 133 | at least superuser access, as described below. 134 | 135 | # x86 136 | 137 | On x86 CPUs, the cycle counter can be read with the `rdpmc` opcode, using 138 | the register `0x40000001`. A typical access would use this function: 139 | 140 | ```c 141 | static inline uint64_t 142 | core_cycles(void) 143 | { 144 | _mm_lfence(); 145 | return __rdpmc(0x40000001); 146 | } 147 | ``` 148 | 149 | Note the use of a memory fence to ensure that the instruction executes in 150 | a reasonably predictable position within the instruction sequence. 151 | 152 | On a Linux system, to allow `rdpmc` to complete without crashing, the 153 | access must first be authorized by root with: 154 | 155 | ``` 156 | echo 2 > /sys/bus/event_source/devices/cpu/rdpmc 157 | ``` 158 | 159 | Once this has been done, unpriviledged userland processes can read the 160 | cycle counter without crashing. The setting "sticks" until the next 161 | reboot. On my test/research system (running Ubuntu), I do that 162 | automatically at boot time with an `@reboot` crontab. 163 | 164 | I don't have any matching solution for Windows (sorry). On macOS this is 165 | hardly better, you would need a kernel extension (aka macOS's notion of 166 | a kernel module) and rumour has it that there is one somewhere in [Intel 167 | Performance Counter Monitor](https://github.com/intel/pcm), but I have 168 | not tried. Also, macOS makes it difficult to load custom kernel 169 | extensions, you have to reboot to some recovery OS and disable some 170 | checks. Also note that virtual machines won't save you here; regardless 171 | of what you do in a VM, you won't be able to access the cycle counter if 172 | the host does not agree. 173 | 174 | # ARMv8 175 | 176 | On ARMv8 (ARMv8-A in 64-bit mode, aka "aarch64" or "arm64"), the cycle 177 | counter is read with a bit of inline assembly: 178 | 179 | ```c 180 | static inline uint64_t 181 | core_cycles(void) 182 | { 183 | uint64_t x; 184 | __asm__ __volatile__ ("dsb sy\n\tmrs %0, pmccntr_el0" : "=r" (x) : : ); 185 | return x; 186 | } 187 | ``` 188 | 189 | The `dsb` opcode is a memory fence, just like in the x86 case. For the 190 | cycle counter to be accessible, it must be enabled, and unprivileged read 191 | must be authorized, both operations needing to be done in supervisor mode 192 | (i.e. in the kernel). On Linux, you can use the [cyccnt](cyccnt) custom 193 | module from this repository. Namely: 194 | 195 | - You need to install the kernel headers that match your kernel. 196 | Possibly this is already done; otherwise, your distribution may 197 | provide a convenient package that is kept in sync with the kernel 198 | itself. On Ubuntu, try the `linux-headers-generic` package. 199 | 200 | - Go to the `cyccnt` directory and type `make`. If all goes well it 201 | should produce some files, including the module itself which will 202 | be called `cyccnt.ko`. 203 | 204 | - Loading the module is done with `insmod cyccnt.ko`. This is where 205 | root is needed, so presumably type `sudo insmod cyccnt.ko`. 206 | 207 | Once the module is loaded, you can see in the kernel message (`dmesg` 208 | command) something like: 209 | 210 | ``` 211 | [ 901.256454] enable pmccntr_el0 on CPU 0 212 | [ 901.256454] enable pmccntr_el0 on CPU 1 213 | [ 901.256454] enable pmccntr_el0 on CPU 2 214 | [ 901.256454] enable pmccntr_el0 on CPU 3 215 | ``` 216 | 217 | which means that the module was indeed loaded and enabled the cycle 218 | counter on all four CPU cores. The access remains allowed until the 219 | module is unloaded (with `rmmod`), or the next reboot. 220 | 221 | **Idle states:** on my Raspberry Pi 5, running Ubuntu 24.04, this 222 | is sufficient. On another system, an ODROID C4 running an older kernel 223 | (version "4.9.337-13"), the CPU cores happen to "forget" their setting 224 | whenever they go idle, which is common. On that system, the CPU idle 225 | state must be disabled on each core (the sequence is: disable idle 226 | state, *then* load the kernel module): 227 | 228 | ``` 229 | echo 1 > /sys/devices/system/cpu/cpu0/cpuidle/state1/disable 230 | echo 1 > /sys/devices/system/cpu/cpu1/cpuidle/state1/disable 231 | echo 1 > /sys/devices/system/cpu/cpu2/cpuidle/state1/disable 232 | echo 1 > /sys/devices/system/cpu/cpu3/cpuidle/state1/disable 233 | ``` 234 | 235 | I don't know if that is a quirk of that particular hardware, or of the 236 | older kernel version. 237 | 238 | # RISC-V 239 | 240 | On a 64-bit RISC-V system, the cycle counter is read with the `rdcycle` 241 | instruction: 242 | 243 | ```c 244 | static inline uint64_t 245 | core_cycles(void) 246 | { 247 | uint64_t x; 248 | __asm__ __volatile__ ("rdcycle %0" : "=r" (x)); 249 | return x; 250 | } 251 | ``` 252 | 253 | Note that there is no memory fence here; the [RISC-V instruction set 254 | manuals](https://lf-riscv.atlassian.net/wiki/spaces/HOME/pages/16154769/RISC-V+Technical+Specifications) 255 | sort of explain that the CPU is supposed to enforce proper synchronization 256 | of these instructions with regard to the sequence of instructions that the 257 | core executes, so that no memory fence should be needed on cores with 258 | out-of-order execution. I am not entirely sure that I read it correctly 259 | and I do not have a RISC-V with out-of-order execution to test it. 260 | 261 | For accessing the cycle counter, the situation is similar to that of 262 | ARMv8: the `cyccnt` kernel module should be used. The process actually 263 | needs cooperation of both the supervisor (kernel) mode, and the machine 264 | level (the hypervisor). The kernel can ask the hypervisor to enable some 265 | performance counters through an API called 266 | [SBI](https://lists.riscv.org/g/tech-brs/attachment/361/0/riscv-sbi.pdf). 267 | In `cyccnt` this is done with the `sbi_ecall()` function and it has been 268 | tested on a grand total of one (1) test board (a [StarFive VisionFive 269 | 2](https://www.starfivetech.com/en/site/boards) using Ubuntu 24.04). 270 | 271 | There again, the access remains until the module is unloaded or the 272 | system is rebooted. 273 | -------------------------------------------------------------------------------- /cyccnt/Makefile: -------------------------------------------------------------------------------- 1 | obj-m := cyccnt.o 2 | KDIR := /lib/modules/$(shell uname -r)/build 3 | PWD := $(shell pwd) 4 | 5 | all: 6 | $(MAKE) -C $(KDIR) M=$(PWD) modules 7 | 8 | clean: 9 | $(MAKE) -C $(KDIR) M=$(PWD) clean 10 | -------------------------------------------------------------------------------- /cyccnt/cyccnt.c: -------------------------------------------------------------------------------- 1 | /* 2 | * This module activates the in-CPU cycle counter and enables direct access 3 | * from unprivileged (userland) code, for benchmark purposes. It can be used 4 | * on ARMv8 (aarch64) and some RISC-V (riscv64) systems. 5 | */ 6 | 7 | #include 8 | #include 9 | #include 10 | 11 | #if defined(__aarch64__) 12 | 13 | static void 14 | enable_counter(void *data) 15 | { 16 | /* 17 | * ARMv8: the counter is PMCCNTR_EL0. Its behaviour and access is 18 | * controlled by several other registers: 19 | * 20 | * PMINTENCLR_EL1 21 | * Controls generation of interrupts on cycle counter overflow. 22 | * We do not want such interrupts; to disable them, we write 23 | * 1 in bit 31 of the register. 24 | * 25 | * PMCNTENSET_EL0 26 | * Controls whether the cycle counter is active. We enable the 27 | * cycle counter by writing 1 in bit 31. 28 | * 29 | * PMUSERENR_EL0 30 | * Controls whether reading from the cycle counter is allowed 31 | * from userland. Either bit 0 or bit 2 needs to be set. Here 32 | * we set both, which is probably overkill, but it works. 33 | * 34 | * PMCR_EL0 35 | * We need to conserve most of the bits here, but we set the 36 | * some low bits to specific values: 37 | * LC (bit 6) 1 for ensuring a 64-bit counter (not 32-bit) 38 | * DP (bit 5) 0 to leave the cycle counter accessible 39 | * D (bit 3) 0 to increment the counter every cycle 40 | * C (bit 2) 1 to reset the counter to zero 41 | * E (bit 0) 1 to allow enabling through PMCNTENSET_EL0 42 | * Since we read that register first then write it back, we use 43 | * a fence to ensure no compiler or CPU shenanigans with stale 44 | * information. 45 | * 46 | * PMCCFILTR_EL0 47 | * Controls whether the cycle counter is incremented in various 48 | * security levels ("exception levels"). For our benchmarking 49 | * purposes we only really need incrementation in userland (EL0) 50 | * but here we just enable counting at all levels. This is done 51 | * by setting bit 27 to 1, and all other bits to 0. 52 | * 53 | * Reference: Arm Architecture Reference Manual -- Armv8, for 54 | * Armv8-A architecture profile, DDI 0487F.c (ID072120) (2020), 55 | * section D13.4. 56 | */ 57 | (void)data; 58 | printk(KERN_INFO "enable pmccntr_el0 on CPU %d\n", smp_processor_id()); 59 | asm volatile("msr pmintenclr_el1, %0" : : "r" (BIT(31))); 60 | asm volatile("msr pmcntenset_el0, %0" : : "r" (BIT(31))); 61 | asm volatile("msr pmuserenr_el0, %0" : : "r" (BIT(0)|BIT(2)|BIT(6))); 62 | unsigned long x; 63 | asm volatile("mrs %0, pmcr_el0" : "=r" (x)); 64 | x |= BIT(0) | BIT(2); 65 | isb(); 66 | asm volatile("msr pmcr_el0, %0" : : "r" (x)); 67 | asm volatile("msr pmccfiltr_el0, %0" : : "r" (BIT(27))); 68 | } 69 | 70 | static void 71 | disable_counter(void *data) 72 | { 73 | (void)data; 74 | printk(KERN_INFO "disable pmccntr_el0 on CPU %d\n", smp_processor_id()); 75 | asm volatile("msr pmcntenset_el0, %0" : : "r" (0)); 76 | asm volatile("msr pmuserenr_el0, %0" : : "r" (0)); 77 | } 78 | 79 | #elif defined(__riscv) 80 | 81 | #include 82 | 83 | static void 84 | enable_counter(void *data) 85 | { 86 | /* 87 | * On RISC-V, we need to authorize read access to the cycle counter 88 | * from userland, and also to enable incrementation of the counter 89 | * at each cycle. Authorization is done by setting bit 0 of register 90 | * scounteren to 1; here, we set all bits of that register, which has 91 | * the side-effect of enabling access to all performance counters: 92 | * bit 1 should always bet set (it's for access to the real-time 93 | * clock, the fixed-frequency counter read by rdtime); bit 2 is 94 | * for the counter of retired instructions, i.e. the total number of 95 | * exec instructions; bit 3 to 31 are for other performance counters. 96 | * 97 | * Setting scounteren is not enough. That bit only allows access 98 | * to the cycle counter from userland if the supervisor (i.e. 99 | * the kernel) can read it. Access to the counter by the 100 | * supervisor must also be enabled, which must be done by the 101 | * machine-level code (i.e. the hypervisor). Similarly, the 102 | * machine-level code should also start the counter, because it 103 | * is usually inhibited by default to save some power (i.e. even 104 | * with allowed access, the counter seems "stuck" to a fixed 105 | * value). Both operations cannot be done by the supervisor; 106 | * thus, the kernel code must request them to the relevant 107 | * hypervisor part, which is done through a specific interface 108 | * called SBI (Supervisor Binary Interface) (my test board is a 109 | * StarFive VisionFive 2 and the hypervisor implementation in it 110 | * appears to be OpenSBI). 111 | */ 112 | (void)data; 113 | printk(KERN_INFO "enable_rdcycle on CPU %d\n", smp_processor_id()); 114 | struct sbiret r = sbi_ecall(SBI_EXT_PMU, SBI_EXT_PMU_COUNTER_START, 115 | 0, 1, SBI_PMU_START_FLAG_SET_INIT_VALUE, 0, 0, 0); 116 | printk(KERN_INFO "CPU %d: sbi_ecall() returned %ld, %ld\n", 117 | smp_processor_id(), r.error, r.value); 118 | csr_write(CSR_SCOUNTEREN, GENMASK(31, 0)); 119 | } 120 | 121 | static void 122 | disable_counter(void *data) 123 | { 124 | /* 125 | * Ideally we should also ask the SBI to stop the counter. Here 126 | * we just reset allowed access to only the timer (which is accessed 127 | * through rdtime from userland). 128 | */ 129 | (void)data; 130 | printk(KERN_INFO "disable_rdcycle on CPU %d\n", smp_processor_id()); 131 | csr_write(CSR_SCOUNTEREN, 0x2); 132 | } 133 | 134 | #else 135 | 136 | #error This module is for ARMv8 and RISC-V only. 137 | 138 | #endif 139 | 140 | static int __init 141 | init(void) 142 | { 143 | on_each_cpu(enable_counter, NULL, 1); 144 | return 0; 145 | } 146 | 147 | static void __exit 148 | fini(void) 149 | { 150 | on_each_cpu(disable_counter, NULL, 1); 151 | } 152 | 153 | MODULE_DESCRIPTION("Enables user-mode access to in-CPU cycle counter"); 154 | MODULE_LICENSE("GPL"); 155 | module_init(init); 156 | module_exit(fini); 157 | -------------------------------------------------------------------------------- /test_cycle.c: -------------------------------------------------------------------------------- 1 | /* 2 | * This test program demonstrates access to the cycle counter. It should 3 | * work on x86 (32-bit and 64-bit), aarch64 and riscv64, though in all 4 | * three cases some superuser-level operations must first be done to allow 5 | * access to the counter. 6 | * 7 | * The core_cycles() function returns the current value of the cycle 8 | * counter. The test program uses core_cycles() to perform a measurement 9 | * of the cost (latency) of integer multiplications; a base integer 10 | * value should be provided as starting point, then the program 11 | * multiplies it with itself repeatedly. The starting point is obtained 12 | * as a program argument to prevent the compiler from optimizing it. 13 | * Relevant argument values are 0, 1 and 3; 0 and 1 exercise "special cases" 14 | * (i.e. values for which a variable-time multiplier is likely to return 15 | * "early") while 3 will use more-or-less pseudorandom values and should thus 16 | * exercise the "general case". 17 | */ 18 | 19 | #include 20 | #include 21 | #include 22 | 23 | #if defined __x86_64__ || defined _M_X64 || defined __i386__ || defined _M_IX86 24 | /* 25 | * x86: the cycle counter is accessible with the rdpmc instruction. Access 26 | * must first be allowed, which, on Linux, is done with: 27 | * echo 2 > /sys/bus/event_source/devices/cpu/rdpmc 28 | * (only root can perform this write; the setting is "volatile" in that 29 | * it lasts only until next reboot). 30 | */ 31 | #include 32 | #ifdef _MSC_VER 33 | /* On Windows, the intrinsic is called __readpmc(), not __rdpmc(). But it 34 | will usually imply a crash, since Windows does no enable access to the 35 | performance counters. */ 36 | #ifndef __rdpmc 37 | #define __rdpmc __readpmc 38 | #endif 39 | #else 40 | #include 41 | #endif 42 | #if defined __GNUC__ || defined __clang__ 43 | __attribute__((target("sse2"))) 44 | #endif 45 | static inline uint64_t 46 | core_cycles(void) 47 | { 48 | _mm_lfence(); 49 | return __rdpmc(0x40000001); 50 | } 51 | 52 | #elif defined __aarch64__ && (defined __GNUC__ || defined __clang__) 53 | /* 54 | * ARMv8, 64-bit (aarch64): the cycle counter is pmccntr_el0; it must be 55 | * enabled through dedicated kernel code. 56 | */ 57 | static inline uint64_t 58 | core_cycles(void) 59 | { 60 | uint64_t x; 61 | __asm__ __volatile__ ("dsb sy\n\tmrs %0, pmccntr_el0" : "=r" (x) : : ); 62 | return x; 63 | } 64 | 65 | #elif defined __riscv && defined __riscv_xlen && __riscv_xlen >= 64 66 | /* 67 | * RISC-V, 64-bit (rv64gc): the cycle counter is read with the 68 | * pseudo-instruction rdcycle (which is just an alias for csrrs with 69 | * the appropriate register identifier). The cycle counter must be enabled 70 | * and its userland access allowed, which requires machine-level actions 71 | * that can be triggered from dedicated kernel code. 72 | */ 73 | static inline uint64_t 74 | core_cycles(void) 75 | { 76 | /* We don't use a memory fence here because the RISC-V ISA 77 | already requires the CPU to enforce appropriate ordering for 78 | this access. */ 79 | uint64_t x; 80 | __asm__ __volatile__ ("rdcycle %0" : "=r" (x)); 81 | return x; 82 | } 83 | 84 | #else 85 | #error Architecture is not supported. 86 | #endif 87 | 88 | static int 89 | cmp_u64(const void *v1, const void *v2) 90 | { 91 | uint64_t x1 = *(const uint64_t *)v1; 92 | uint64_t x2 = *(const uint64_t *)v2; 93 | if (x1 < x2) { 94 | return -1; 95 | } else if (x1 == x2) { 96 | return 0; 97 | } else { 98 | return 1; 99 | } 100 | } 101 | 102 | int 103 | main(int argc, char *argv[]) 104 | { 105 | if (argc != 2) { 106 | fprintf(stderr, "usage: test_cycle [ 0 | 1 | 3 ]\n"); 107 | exit(EXIT_FAILURE); 108 | } 109 | int st = atoi(argv[1]); 110 | 111 | uint64_t tt[100]; 112 | 113 | /* 32-bit multiplications. */ 114 | uint32_t x32 = (uint32_t)st; 115 | uint32_t y32 = x32; 116 | for (int i = 0; i < 100; i ++) { 117 | y32 *= x32; 118 | } 119 | x32 = y32; 120 | for (size_t i = 0; i < 120; i ++) { 121 | uint64_t begin = core_cycles(); 122 | for (int j = 0; j < 1000; j ++) { 123 | x32 *= y32; 124 | y32 *= x32; 125 | x32 *= y32; 126 | y32 *= x32; 127 | x32 *= y32; 128 | y32 *= x32; 129 | x32 *= y32; 130 | y32 *= x32; 131 | x32 *= y32; 132 | y32 *= x32; 133 | x32 *= y32; 134 | y32 *= x32; 135 | x32 *= y32; 136 | y32 *= x32; 137 | x32 *= y32; 138 | y32 *= x32; 139 | x32 *= y32; 140 | y32 *= x32; 141 | x32 *= y32; 142 | y32 *= x32; 143 | } 144 | uint64_t end = core_cycles(); 145 | if (i >= 20) { 146 | tt[i - 20] = end - begin; 147 | } 148 | } 149 | qsort(tt, 100, sizeof(uint64_t), &cmp_u64); 150 | printf("32x32->32 muls: %7.3f\n", (double)tt[50] / 20000.0); 151 | 152 | /* 64-bit multiplications. */ 153 | uint64_t x64 = (uint64_t)x32; 154 | x64 *= x64 * x64; 155 | uint64_t y64 = x64; 156 | for (size_t i = 0; i < 120; i ++) { 157 | uint64_t begin = core_cycles(); 158 | for (int j = 0; j < 1000; j ++) { 159 | x64 *= y64; 160 | y64 *= x64; 161 | x64 *= y64; 162 | y64 *= x64; 163 | x64 *= y64; 164 | y64 *= x64; 165 | x64 *= y64; 166 | y64 *= x64; 167 | x64 *= y64; 168 | y64 *= x64; 169 | x64 *= y64; 170 | y64 *= x64; 171 | x64 *= y64; 172 | y64 *= x64; 173 | x64 *= y64; 174 | y64 *= x64; 175 | x64 *= y64; 176 | y64 *= x64; 177 | x64 *= y64; 178 | y64 *= x64; 179 | } 180 | uint64_t end = core_cycles(); 181 | if (i >= 20) { 182 | tt[i - 20] = end - begin; 183 | } 184 | } 185 | qsort(tt, 100, sizeof(uint64_t), &cmp_u64); 186 | printf("64x64->64 muls: %7.3f\n", (double)tt[50] / 20000.0); 187 | 188 | #if (defined __GNUC__ || defined __clang) && defined __SIZEOF_INT128__ 189 | /* 64x64->128 multiplications. 190 | We really measure latency to access to the upper half of the 191 | result. To avoid the value to become too small, we ensure 192 | every 20 operations that the top bit is set (unless source 193 | was 0 or 1). */ 194 | uint64_t t64 = (uint64_t)((y64 >> 1) != 0) << 63; 195 | x64 |= t64; 196 | y64 |= t64; 197 | uint64_t x64orig = x64; 198 | uint64_t y64orig = y64; 199 | for (size_t i = 0; i < 120; i ++) { 200 | uint64_t begin = core_cycles(); 201 | for (int j = 0; j < 1000; j ++) { 202 | x64 ^= x64orig; 203 | y64 ^= y64orig; 204 | x64 = ((unsigned __int128)x64 * y64) >> 64; 205 | y64 = ((unsigned __int128)y64 * x64) >> 64; 206 | x64 = ((unsigned __int128)x64 * y64) >> 64; 207 | y64 = ((unsigned __int128)y64 * x64) >> 64; 208 | x64 = ((unsigned __int128)x64 * y64) >> 64; 209 | y64 = ((unsigned __int128)y64 * x64) >> 64; 210 | x64 = ((unsigned __int128)x64 * y64) >> 64; 211 | y64 = ((unsigned __int128)y64 * x64) >> 64; 212 | } 213 | uint64_t end = core_cycles(); 214 | if (i >= 20) { 215 | tt[i - 20] = end - begin; 216 | } 217 | } 218 | qsort(tt, 100, sizeof(uint64_t), &cmp_u64); 219 | printf("64x64->128 muls: %7.3f\n", (double)tt[50] / 8000.0); 220 | #endif 221 | 222 | /* Get some bytes from the final value and print them out; this 223 | should prevent the compiler from optimizing away the 224 | multiplications. */ 225 | unsigned x = 0; 226 | for (int i = 0; i < 8; i ++) { 227 | x ^= (unsigned)x64; 228 | x64 >>= 8; 229 | } 230 | printf("(%u)\n", x & 0xFF); 231 | return 0; 232 | } 233 | --------------------------------------------------------------------------------