├── .gitignore ├── LICENSE ├── LowOverheadTimersTests ├── README.md ├── SetupPowerLevelCounters.sh ├── SetupUserKernelCounters.sh ├── build_timer_tests.sh ├── counter_test_epilog.c ├── counter_test_epilog_32.c ├── counter_test_prolog.c ├── counter_test_prolog_32.c ├── run_timer_test_ensemble.sh ├── summarize.sh └── test_timer_overhead.c ├── README.md ├── low_overhead_timers.c └── low_overhead_timers.h /.gitignore: -------------------------------------------------------------------------------- 1 | *.o 2 | *.s 3 | *.optrpt 4 | *.exe 5 | log.* 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2018, John D McCalpin and University of Texas at Austin 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /LowOverheadTimersTests/README.md: -------------------------------------------------------------------------------- 1 | Test driver and scripts for the low_overhead_timers.c program. 2 | 3 | The program "test_timer_overhead.c" tests all of the interfaces in 4 | "low_overhead_counters.c" with 64 repeated calls, then reports the minimum, 5 | average, and maximum counter deltas (excluding the first iteration). 6 | The program also includes tests of the same counters read directly using 7 | inline assembly macros for comparison. The test_timer_overhead code is set 8 | up so that it can be compiled in either inline or separate compilation mode. 9 | 10 | The script "build_timer_tests.sh" will compile four versions of the 11 | test_timer_overhead program -- one with inlining and one with separate 12 | compilation, for each of the Intel (icc) and GNU (gcc) compilers. 13 | 14 | The script "run_timer_test_ensemble.sh" will run each of these versions 10 15 | times and save the results. 16 | 17 | The script "summarize.sh" will take a result file from the 18 | "run_timer_test_ensemble.sh" and compute the average of the average 19 | values (after excluding the slowest result of the ensemble). 20 | 21 | Building for icc 22 | IMPORTANT NOTES: 23 | 1. Choose a target architecture that does not include 256-bit or 512-bit SIMD support 24 | (Any SSE target will do -- the default for Linux is -msse2) 25 | 2. The compiler flag -nolib-inline prevents the compiler from generating calls to 26 | memset() or memcpy() that might contain 256-bit or 512-bit instructions. 27 | 28 | Building for gcc 29 | 1. Choose a target architecture that does not include 256-bit or 512-bit SIMD support 30 | (Any SSE target will do -- gcc defaults to -msse or -msse2.) 31 | 2. The flag -fno-tree-loop-distribute-patterns will prevent the compiler from 32 | generating most calls to memset() or memcpy() (which might include 256-bit 33 | or 512-bit instructions). 34 | 35 | Other compilers should be similar to gcc, but have not been carefully tested. 36 | On Mac OS X, the clang compiler does not understand the gcc "-fno-tree-loop-distribute-patterns" 37 | flag, but otherwise appears to work. 38 | 39 | -------------------------------------------------------------------------------- /LowOverheadTimersTests/SetupPowerLevelCounters.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # You will need to put in the path to the msrtools binaries here 4 | # if they are not in your default path. 5 | # The msrtools commands almost always require root privileges. 6 | WRMSR=wrmsr 7 | 8 | # ------------------------------------------------------------------- 9 | # PMC0 on each core is set to record actual cycles not halted (in 10 | # either user or kernel mode). This event is "architectural", so 11 | # it should work on almost all Intel processors. 12 | # ------------------------------------------------------------------- 13 | # The next three events are specific to the Intel Xeon Scalable 14 | # Processor family (Skylake Xeon). 15 | # --> These event encodings might produce non-zero results on 16 | # other Intel processors (Event 0x28 was used for L1D cache 17 | # writebacks to L2 on Nehalem/Westmere/SandyBridge/IvyBridge), 18 | # but these specific values are intended for SKX only. 19 | # ------------------------------------------------------------------- 20 | # The desired result for the test_timer_overhead program using these 21 | # Skylake Xeon counters is that PMC0 ("actual cycles not halted") 22 | # matches PMC1 ("core power level 0 cycles"), and that PMC2 23 | # ("core power level 1 cycles") and PMC3 ("core power level 2 cycles") 24 | # counters are zero. 25 | # If PMC2 and PMC3 are *not* zero, the power control unit in 26 | # the processor will halt the core while it adjusts the voltage 27 | # and activates the 256-bit or 512-bit pipelines. The timing 28 | # of this halt is unpredictable, and the duration of the halt 29 | # can be 20,000 or more core cycles. 30 | # ------------------------------------------------------------------- 31 | 32 | $WRMSR -a 0x186 0x0043003c # actual cycles not halted 33 | $WRMSR -a 0x187 0x00430728 # core power level 0 cycles 34 | $WRMSR -a 0x188 0x00431828 # core power level 1 cycles 35 | $WRMSR -a 0x189 0x00432028 # core power level 2 cycles 36 | -------------------------------------------------------------------------------- /LowOverheadTimersTests/SetupUserKernelCounters.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # You will need to put in the path to the msrtools binaries here 4 | # if they are not in your default path. 5 | # The msrtools commands almost always require root privileges. 6 | WRMSR=wrmsr 7 | 8 | # ------------------------------------------------------------------------- 9 | # This set of events is intended to help find instances in which the 10 | # timer overhead test was contaminated by OS activity. 11 | # 12 | # PMC0 is set to measure actual cycles not halted in user or kernel mode. 13 | # This is an "architectural" event that should work on all Intel processors. 14 | # PMC1 is set to measure actual cycles not halted in kernel mode. 15 | # This is an "architectural" event that should work on all Intel processors. 16 | # PMC2 is set to measure instructions retired in kernel mode. 17 | # This is an "architectural" event that should work on all Intel processors. 18 | # PMC3 is set to measure interrupts received (Skylake and later cores only) 19 | # PMC3 may report non-zero results on earlier Intel processors, but 20 | # those results will mean something unrelated to what I want to measure 21 | # here. 22 | # 23 | # A "clean" result has non-zero results in PMC0 and zero results in the 24 | # other three counters. 25 | # For processors earlier than Skylake, ignore the results for PMC3. 26 | # ------------------------------------------------------------------------- 27 | 28 | $WRMSR -a 0x186 0x0043003c # actual cycles not halted (user + kernel) 29 | $WRMSR -a 0x187 0x0042003c # actual cycles not halted (kernel only) 30 | $WRMSR -a 0x188 0x004200c0 # instructions retired (kernel only) 31 | $WRMSR -a 0x189 0x004301cb # interrupts received 32 | -------------------------------------------------------------------------------- /LowOverheadTimersTests/build_timer_tests.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | which icc >& /dev/null 4 | if [ $? -ne 0 ] 5 | then 6 | echo "Intel icc compiler not found, skipping...." 7 | else 8 | echo "compiling externally linked version with icc" 9 | icc --version 10 | icc -O2 -msse2 -nolib-inline -DUSE_PAUSE -c ../low_overhead_timers.c -o low_overhead_timers_icc.o 11 | icc -O2 -msse2 -nolib-inline -DUSE_PAUSE -I.. test_timer_overhead.c low_overhead_timers_icc.o -qopt-report=5 -o timer_ovhd_external.icc.exe 12 | mv test_timer_overhead.optrpt test_timer_overhead.external.optrpt 13 | icc -O2 -msse2 -nolib-inline -DUSE_PAUSE -I.. test_timer_overhead.c -S -o timer_ovhd_external.icc.s 14 | 15 | echo "compiling inlined version with icc" 16 | icc -O2 -msse2 -nolib-inline -DUSE_PAUSE -DINLINE_TIMERS -I.. test_timer_overhead.c -qopt-report=5 -o timer_ovhd_inline.icc.exe 17 | mv test_timer_overhead.optrpt test_timer_overhead.inline.optrpt 18 | icc -O2 -msse2 -nolib-inline -DUSE_PAUSE -DINLINE_TIMERS -I.. test_timer_overhead.c -S -o timer_ovhd_inline.icc.s 19 | fi 20 | 21 | which gcc >& /dev/null 22 | if [ $? -ne 0 ] 23 | then 24 | echo "GNU gcc compiler not found, skipping...." 25 | else 26 | echo "compiling externally linked version with gcc" 27 | gcc --version 28 | gcc -O2 -msse2 -fno-tree-loop-distribute-patterns -DUSE_PAUSE -c ../low_overhead_timers.c -o low_overhead_timers_gcc.o 29 | gcc -O2 -msse2 -fno-tree-loop-distribute-patterns -DUSE_PAUSE -I.. test_timer_overhead.c low_overhead_timers_gcc.o -o timer_ovhd_external.gcc.exe 30 | gcc -O2 -msse2 -fno-tree-loop-distribute-patterns -DUSE_PAUSE -I.. test_timer_overhead.c -fverbose-asm -S -o timer_ovhd_external.gcc.s 31 | 32 | echo "compiling inlined version with gcc" 33 | gcc -O2 -msse2 -fno-tree-loop-distribute-patterns -DUSE_PAUSE -DINLINE_TIMERS -I.. test_timer_overhead.c -o timer_ovhd_inline.gcc.exe 34 | gcc -O2 -msse2 -fno-tree-loop-distribute-patterns -DUSE_PAUSE -DINLINE_TIMERS -I.. test_timer_overhead.c -fverbose-asm -S -o timer_ovhd_inline.gcc.s 35 | fi 36 | -------------------------------------------------------------------------------- /LowOverheadTimersTests/counter_test_epilog.c: -------------------------------------------------------------------------------- 1 | // Boilerplate code that goes after the sample loop for 2 | // tests that store the full 64 bits of each result. 3 | // 4 | // 1. Collect final values of fixed-function core counters, 5 | // programmable core performance counters, and TSC 6 | // to monitor behavior over the entire sample loop. 7 | // 2. Compute min/max/avg deltas on the individual measurements 8 | // taken in the loop (excluding the first delta, which is 9 | // sometimes slow). 10 | // 3. Compute core utilization and average frequency for the 11 | // entire sample loop. 12 | // 4. Print out statistics on the deltas measured within 13 | // the loop, along with core utilization and avg frequency. 14 | // 5. Print out the deltas of the programmable counters 15 | // over the entire sample loop. 16 | // 17 | // Note that this code makes no attempt to detect or correct 18 | // for wraparound of the fixed or programmable performance 19 | // counters -- either for the entire loop or for the individual 20 | // measurements within the loop! 21 | // 22 | gen_cyc_end = rdpmc_actual_cycles(); 23 | gen_ref_end = rdpmc_reference_cycles(); 24 | tsc_end = rdtscp(); 25 | for (i=0; i>24) construct is used to inhibit replacement 7 | // of this loop by a call to "memset()", which may contain 8 | // unwanted SIMD instructions that change the processor 9 | // power level.) 10 | // 2. Collect initial values of fixed-function core counters, 11 | // programmable core performance counters, and TSC 12 | // to monitor behavior over the entire sample loop. 13 | 14 | for (j=0; j>24); 15 | for (i=0; i>24) construct is used to inhibit replacement 7 | // of this loop by a call to "memset()", which may contain 8 | // unwanted SIMD instructions that change the processor 9 | // power level.) 10 | // 2. Collect initial values of fixed-function core counters, 11 | // programmable core performance counters, and TSC 12 | // to monitor behavior over the entire sample loop. 13 | 14 | for (j=0; j>24); 15 | for (i=0; i> log.test_${MODE}_${COMPILER} 19 | done 20 | echo "" 21 | else 22 | echo "executable ./timer_ovhd_${MODE}.${COMPILER}.exe not found, skipping...." 23 | fi 24 | done 25 | done 26 | -------------------------------------------------------------------------------- /LowOverheadTimersTests/summarize.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # compute the average of the average values for each of these timer events in a specified output file 4 | # extra cruft to sort the results and discard the slowest one 5 | 6 | for COUNTER in multiline_inline_rdtscp rdtsc inline_rdtsc_64 inline_rdtsc_32 rdtscp inline_rdtscp_64 inline_rdtscp_32 full_rdtscp rdpmc_instructions rdpmc_actual_cycles inline_rdpmc_actual_cycles_64 inline_rdpmc_actual_cycles_32 rdpmc inline_rdpmc_programmable_64 inline_rdpmc_programmable_32 rdpmc_reference_cycles inline_rdpmc_reference_cycles_64 inline_rdpmc_reference_cycles_32 7 | do 8 | echo -n "${COUNTER} " 9 | grep "^${COUNTER} " $1 | awk 'START {max=0} {s+=$5; if($5>max) max=$5} END {print (s-max)/(NR-1)}' 10 | done 11 | 12 | -------------------------------------------------------------------------------- /LowOverheadTimersTests/test_timer_overhead.c: -------------------------------------------------------------------------------- 1 | #define _GNU_SOURCE 2 | #include 3 | #include 4 | #include 5 | #include 6 | #include 7 | #include 8 | #include 9 | #include 10 | 11 | # ifndef MIN 12 | # define MIN(x,y) ((x)<(y)?(x):(y)) 13 | # endif 14 | # ifndef MAX 15 | # define MAX(x,y) ((x)>(y)?(x):(y)) 16 | # endif 17 | 18 | // ----------IMPORTANT ----------- 19 | // Use the INLINE_TIMERS preprocessor variable to determine 20 | // whether the source code to the timers is included here 21 | // or just the headers (for separate compilation and linking). 22 | // 23 | #ifdef INLINE_TIMERS 24 | #include "low_overhead_timers.c" 25 | #else 26 | #include "low_overhead_timers.h" 27 | #endif 28 | 29 | #define inline_rdpmc(hi,low,counter) \ 30 | __asm__ volatile("rdpmc" : "=a" (low), "=d" (hi) : "c" (counter)); 31 | 32 | #define inline_rdtsc(hi,low) \ 33 | __asm__ volatile("rdtsc": "=a" (low), "=d" (hi)); 34 | 35 | #define inline_rdtscp(hi,low,aux) \ 36 | __asm__ volatile("rdtscp": "=a" (low), "=d" (hi), "=c" (aux)); 37 | 38 | # define NUM_CORE_COUNTERS 2 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | #define FATAL(fmt,args...) do { \ 47 | ERROR(fmt, ##args); \ 48 | exit(1); \ 49 | } while (0) 50 | 51 | #define ERROR(fmt,args...) \ 52 | fprintf(stderr, fmt, ##args) 53 | 54 | #define NSAMPLES 64 55 | #define NHALFSAMPLES 256 56 | 57 | unsigned long values64[NSAMPLES]; 58 | unsigned int values32[NSAMPLES]; 59 | unsigned int highlow32[NHALFSAMPLES]; 60 | 61 | 62 | #define INNERTIMES (10000) 63 | #define MIDDLETIMES (1000) 64 | 65 | // This function can be extracted for standalone use.... 66 | // If USE_PAUSE is not defined, it will spin very quickly 67 | // (>1 billion increments per second). 68 | // If USE_PAUSE is defined, it will spin very slowly -- 69 | // between ~4 cycles and ~26 cycles per PAUSE instruction depending 70 | // on the processor model. 71 | unsigned long spin_function(unsigned long initial_counter) 72 | { 73 | int middle, inner; 74 | #pragma nounroll 75 | for (middle=0; middle>12; 57 | *core = c & 0xFFFUL; 58 | 59 | return (a | (d << 32)); 60 | } 61 | 62 | 63 | extern inline __attribute__((always_inline)) int get_core_number() 64 | { 65 | unsigned long a, d, c; 66 | 67 | __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c)); 68 | 69 | return ( c & 0xFFFUL ); 70 | } 71 | 72 | extern inline __attribute__((always_inline)) int get_socket_number() 73 | { 74 | unsigned long a, d, c; 75 | 76 | __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c)); 77 | 78 | return ( (c & 0xF000UL)>>12 ); 79 | } 80 | 81 | 82 | extern inline __attribute__((always_inline)) unsigned long rdpmc_instructions() 83 | { 84 | unsigned long a, d, c; 85 | 86 | c = (1UL<<30); 87 | __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); 88 | 89 | return (a | (d << 32)); 90 | } 91 | 92 | extern inline __attribute__((always_inline)) unsigned long rdpmc_actual_cycles() 93 | { 94 | unsigned long a, d, c; 95 | 96 | c = (1UL<<30)+1; 97 | __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); 98 | 99 | return (a | (d << 32)); 100 | } 101 | 102 | extern inline __attribute__((always_inline)) unsigned long rdpmc_reference_cycles() 103 | { 104 | unsigned long a, d, c; 105 | 106 | c = (1UL<<30)+2; 107 | __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); 108 | 109 | return (a | (d << 32)); 110 | } 111 | 112 | extern inline __attribute__((always_inline)) unsigned long rdpmc(int c) 113 | { 114 | unsigned long a, d; 115 | 116 | __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c)); 117 | 118 | return (a | (d << 32)); 119 | } 120 | 121 | // number of core performance counters per logical processor 122 | // varies by model and mode of operation (HT often splits the 123 | // counters across threads). 124 | // The number of counters per logical processor is contained in 125 | // bits 15:8 of EAX after executing the CPUID instruction 126 | // with an initial EAX value of 0x0a (optional input in ECX is not used). 127 | int get_num_core_counters() 128 | { 129 | unsigned int eax, ebx, ecx, edx; 130 | unsigned int leaf, subleaf; 131 | int width; 132 | 133 | leaf = 0x0000000a; 134 | subleaf = 0x0; 135 | __asm__ __volatile__ ("cpuid" : \ 136 | "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (leaf), "c" (subleaf)); 137 | 138 | return((eax & 0x0000ff00) >> 8); 139 | } 140 | 141 | // core performance counter width varies by processor 142 | // the width is contained in bits 23:16 of the EAX register 143 | // after executing the CPUID instruction with an initial EAX 144 | // argument of 0x0a (subleaf 0x0 in ECX). 145 | int get_core_counter_width() 146 | { 147 | unsigned int eax, ebx, ecx, edx; 148 | unsigned int leaf, subleaf; 149 | int width; 150 | 151 | leaf = 0x0000000a; 152 | subleaf = 0x0; 153 | __asm__ __volatile__ ("cpuid" : \ 154 | "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (leaf), "c" (subleaf)); 155 | 156 | return((eax & 0x00ff0000) >> 16); 157 | } 158 | 159 | // fixed-function performance counter width varies by processor 160 | // the width is contained in bits 12:5 of the EDX register 161 | // after executing the CPUID instruction with an initial EAX 162 | // argument of 0x0a (subleaf 0x0 in ECX). 163 | int get_fixed_counter_width() 164 | { 165 | unsigned int eax, ebx, ecx, edx; 166 | unsigned int leaf, subleaf; 167 | int width; 168 | 169 | leaf = 0x0000000a; 170 | subleaf = 0x0; 171 | __asm__ __volatile__ ("cpuid" : \ 172 | "=a" (eax), "=b" (ebx), "=c" (ecx), "=d" (edx) : "a" (leaf), "c" (subleaf)); 173 | 174 | return((edx & 0x00001fe0) >> 5); 175 | } 176 | 177 | // assume that these functions will automatically do the right thing if they are 178 | // included more than once.... 179 | #include 180 | #include 181 | 182 | // Utility routine to compute counter differences taking into account rollover 183 | // when the performance counter width is not known at compile time. 184 | // Use the "get_counter_width()" function to get the counter width on the 185 | // current system, then use that as the third argument to this function. 186 | // 64-bit counters don't generally roll over, but I added a special case 187 | // for this 188 | unsigned long corrected_pmc_delta(unsigned long end, unsigned long start, int pmc_width) 189 | { 190 | unsigned long error_return=0xffffffffffffffff; 191 | unsigned long result; 192 | // sanity checks 193 | if ((pmc_width <= 0) || (pmc_width > 64)) { 194 | fprintf(stderr,"ERROR: corrected_pmc_delta() called with illegal performance counter width %d\n",pmc_width); 195 | return(error_return); 196 | } 197 | // Due to the specifics of unsigned arithmetic, for pmc_width == sizeof(unsigned long), 198 | // the simple calculation (end-start) gives the correct delta even if the counter has 199 | // rolled (leaving end < start). 200 | if (pmc_width == 64) { 201 | return (end - start); 202 | } else { 203 | // for pmc_width < sizeof(unsigned long), rollover must be detected and corrected explicitly 204 | if (end >= start) { 205 | result = end - start; 206 | } else { 207 | // I think this works independent of ordering, but this makes the most intuitive sense 208 | result = (end + (1UL<0; base--){ 264 | if (buffer[base] == 0x7a) { 265 | // printf("Found z at location %d\n",base); 266 | if (buffer[base-1] == 0x48) { 267 | // printf("Found H at location %d\n",base-1); 268 | if (buffer[base-2] == 0x47) { 269 | // printf("Found G at location %d\n",base-2); 270 | // printf(" -- need to extract string now\n"); 271 | i = base-3; 272 | stop = base-3; 273 | // printf("begin reverse search at stop character location %d\n",i); 274 | while(buffer[i] != 0x20) { 275 | // printf("found a non-blank character %c (%x) at location %d\n",buffer[i],buffer[i],i); 276 | i--; 277 | } 278 | start = i+1; 279 | length = stop - start + 1; 280 | k = length+1; 281 | // for (j=stop; j