├── .travis.yml ├── LICENSE ├── README.md ├── tsc.go ├── tsc_amd64.s ├── tsc_test.go └── tsc_unsupported.go /.travis.yml: -------------------------------------------------------------------------------- 1 | language: go 2 | 3 | sudo: false 4 | 5 | go: 6 | - 1.4 7 | - 1.5 8 | - 1.6 9 | - tip 10 | 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015, David Terei. 2 | 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions 7 | are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright 10 | notice, this list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright 13 | notice, this list of conditions and the following disclaimer in the 14 | documentation and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the author nor the names of his contributors 17 | may be used to endorse or promote products derived from this software 18 | without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE CONTRIBUTORS ``AS IS'' AND ANY EXPRESS 21 | OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 22 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR 24 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 26 | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 27 | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, 28 | STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 29 | ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 30 | POSSIBILITY OF SUCH DAMAGE. 31 | 32 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # gotsc 2 | 3 | [![Build Status](https://travis-ci.org/dterei/gotsc.svg)](https://travis-ci.org/dterei/gotsc) 4 | [![Go Report Card](https://goreportcard.com/badge/github.com/dterei/gotsc)](https://goreportcard.com/report/github.com/dterei/gotsc) 5 | [![godoc](https://godoc.org/github.com/dterei/gotsc?status.svg)](http://godoc.org/github.com/dterei/gotsc) 6 | [![BSD3 License](http://img.shields.io/badge/license-BSD3-brightgreen.svg?style=flat)][tl;dr Legal: BSD3] 7 | 8 | [tl;dr Legal: BSD3]: 9 | https://tldrlegal.com/license/bsd-3-clause-license-(revised) 10 | "BSD3 License" 11 | 12 | Golang library for access the CPU timestamp cycle counter (TSC) on x86-64. If 13 | not familar with using the TSC for benchmarking, refer to the 14 | [Intel whitepaper][intel1]. This is designed to be used for benchmarking code, so 15 | takes steps to prevent instruction reordering across measurement boundaries by 16 | the CPU. 17 | 18 | Golang 1.4 or later is currently supported and x86-64 architetcture. The 19 | package will build on other architectures but all functions will simply return 20 | 0. 21 | 22 | ## Usage 23 | 24 | ``` .go 25 | package main 26 | 27 | import ( 28 | "fmt" 29 | "github.com/dterei/gotsc" 30 | ) 31 | 32 | const N = 100 33 | 34 | func main() { 35 | tsc := gotsc.TSCOverhead() 36 | fmt.Println("TSC Overhead:", tsc) 37 | 38 | start := gotsc.BenchStart() 39 | for i := 0; i < N; i++ { 40 | // code to evaluate 41 | } 42 | end := gotsc.BenchEnd() 43 | avg := (end - start - tsc) / N 44 | 45 | fmt.Println("Cycles:", avg) 46 | } 47 | ``` 48 | 49 | ## Compared with time.Now() 50 | 51 | There are two advantages over the standard golang `time.Now()` function: 52 | 53 | 1. Measurement is in cycles - for many situations cycle count is a more 54 | informative number than wall-clock time. 55 | 2. Careful use of CPU serializing instructions to ensure no code you are 56 | benchmarking is moved outside the timed region, and no code you aren't 57 | benchmarking is moved into it. 58 | 59 | Claim (2) may be a little contensious, so see below. For benchmarking with the 60 | TSC, we use the approach suggested by [Intel][intel1]: 61 | 62 | ``` .asm 63 | cpuid 64 | rdtsc 65 | // code to benchmark 66 | rdtscp 67 | cpuid 68 | ``` 69 | 70 | ## Reading the TSC 71 | 72 | There appears to be only confusion on what is both the correct and best way to 73 | read the TSC when benchmarking code. The most obvious and naive approach would 74 | be: 75 | 76 | ``` .asm 77 | // code before 78 | rdtsc 79 | // code to benchmark 80 | rdtsc 81 | // code after 82 | ``` 83 | 84 | But `rdtsc` doesn't prevent instructions being reordered by the CPU around it. 85 | Thus, code before and after could move into the benchmarked region, while code 86 | in the benchmarked region could move out. 87 | 88 | The best Intel documentation on this suggests the following approach: 89 | 90 | ``` .asm 91 | // code before 92 | cpuid 93 | rdtsc 94 | // code to benchmark 95 | rdtscp 96 | cpuid 97 | /// code after 98 | ``` 99 | 100 | The `cpuid` instruction is a full barrier, preventing reordering in both 101 | directions, while `rdtscp` prevents reordering from above. We use `rdtscp` at 102 | the end rather than `cpuid; rdtsc` as `cpuid` is an expensive instruction with 103 | high variance, so we want it outside the benchmarked region. 104 | 105 | Ideally our benchmarking approach provides the following: 106 | 107 | 1. Low variance for instructions involved in retrieving start and end TSC so 108 | that we can subtract their overhead from the measurement with more 109 | confidence. 110 | 2. Low cost to read the TSC so that we can take benchmarks as often as possible 111 | without affecting application performance. 112 | 3. High resolution so that we can measure the cost of very small sets of 113 | instructions. 114 | 115 | The recommended Intel approach provides (1) and (3) but the use of `cpuid` is 116 | fairly expensive. The Intel SDM suggest that the `lfence` instruction can be 117 | used as an alternative to `cpuid`, while AMD suggest the use of `mfence`. 118 | 119 | Linux takes this approach. [Originally][lxr1] ([LKML][lkml1]), it used an 120 | `lfence` either side with the thinking being that `lfence` only prevents 121 | reordering from above: 122 | 123 | ``` .asm 124 | // Linux kernel TSC usage (circa 2008) 125 | lfence 126 | rdtsc 127 | lfence 128 | ``` 129 | 130 | This was [later][lxr2] ([LKML][lkml2]) 'optimized' to just use one `lfence` 131 | before `rdtsc`: 132 | 133 | ``` .asm 134 | // Linux kernel TSC usage (circa 2011+) 135 | lfence 136 | rdtsc 137 | ``` 138 | 139 | The kernel developer of the older `lfence` both sides approach appears to 140 | object to this optimization as 'unsafe' due to being a barrier in only one 141 | direction. The 'modern' thinking appears to be that while this is technically 142 | true, a microprocessor would never take advantage of this reordering---there is 143 | no performance reason to do so. 144 | 145 | The Akaros project [investigated][arakos] a number of alternative approaches 146 | (including all the above issues). Eventually taking the modern Linux approach 147 | and suggesting the following: 148 | 149 | ``` .asm 150 | // code before 151 | lfence 152 | rdtsc 153 | // code to benchmark 154 | lfence 155 | rdtsc 156 | // code after 157 | ``` 158 | 159 | Finally, for complete reference, an older [Intel guide][intel2] to using 160 | `rdtsc` (pre `rdtscp` days) suggest that you 'warm up' the `cpuid` and `rdtsc` 161 | instructions a few times before benchmarking the code: 162 | 163 | ``` .asm 164 | // code before 165 | 166 | // warmup 167 | cpuid 168 | rdtsc 169 | cpuid 170 | rdtsc 171 | cpuid 172 | rdtsc 173 | 174 | cpuid 175 | rdtsc 176 | // code to benchmark 177 | cpuid 178 | rdtsc 179 | // code afer 180 | ``` 181 | 182 | It's not clear if this is valuable any more when we have `rdtscp` to avoid 183 | include the highly variable `cpuid` in our measurement region. 184 | 185 | This is a very confusing situation. We keep it simple and stick with the 186 | recommendation from Intel. This works well, but is a little more expensive due 187 | to the `cpuid` calls compared to alternatives. However, it also appears to be 188 | the 'safest', ensuring accurate measurements. For very frequent calls to the 189 | TSC when benchmarking is not your goal, the standard Go `time.Now()` call is 190 | very fast, essentially being `lfence; rdtsc`. 191 | 192 | ## Converting Cycles to Time 193 | 194 | To convert from cycles to wall-clock time we need to know TSC frequency. 195 | Frequency scaling on modern Intel chips doesn't affect the TSC. 196 | 197 | Sadly, the only way to determine the TSC frequency appears to be through a MSR 198 | using the `rdmsr` instruction. This instruction is privileged and can't be 199 | executed from user-space. 200 | 201 | If we could, we want to access the `MSR_PLATFORM_INFO`: 202 | 203 | > Register Name: MSR_PLATFORM_INFO [15:8] 204 | > Description: Package Maximum Non-Turbo Ratio (R/O) 205 | > The is the ratio of the frequency that invariant TSC runs at. 206 | > Frequency = ratio * 100 MHz. 207 | 208 | The multiplicative factor of `100 MHz` varies across architectures. Luckily, it 209 | appears to be `100 MHz` on all Intel architectures except Nehalem, for which it 210 | is `133.3 MHz`. 211 | 212 | If this method fails or is unavailable, Linux appears to determine the TSC 213 | clock speed through a [calibration] [lxr3] against hardware timers. 214 | 215 | For now, we don't provide the ability to convert cycles to time. 216 | 217 | ## Licensing 218 | 219 | This library is BSD-licensed. 220 | 221 | ## Get involved! 222 | 223 | We are happy to receive bug reports, fixes, documentation enhancements, 224 | and other improvements. 225 | 226 | Please report bugs via the 227 | [github issue tracker](http://github.com/dterei/gotsc/issues). 228 | 229 | Master [git repository](http://github.com/dterei/gotsc): 230 | 231 | * `git clone git://github.com/dterei/gotsc.git` 232 | 233 | ## Authors 234 | 235 | This library is written and maintained by David Terei, . 236 | 237 | [intel1]: http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html 238 | [intel2]: https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf 239 | [lxr1]: http://lxr.free-electrons.com/source/include/asm-x86/system.h?v=2.6.25#L403 240 | [lkml1]: https://lkml.org/lkml/2008/1/7/276 241 | [lxr2]: http://lxr.free-electrons.com/source/arch/x86/include/asm/msr.h#L168 242 | [lkml2]: https://lkml.org/lkml/2011/5/10/297 243 | [arakos]: http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c 244 | [lxr3]: http://lxr.free-electrons.com/source/arch/x86/kernel/tsc.c#L670 245 | 246 | -------------------------------------------------------------------------------- /tsc.go: -------------------------------------------------------------------------------- 1 | // Copyright 2016 David Terei. All rights reserved. 2 | // Use of this source code is governed by a BSD-style 3 | // license that can be found in the LICENSE file. 4 | // 5 | // +build amd64 6 | 7 | // Package gotsc provides access to the timestamp cycle counter on x86-64 for 8 | // performing close to cycle accurate benchmarking on x86-64. All recent (think 9 | // since 2010) generation Intel CPU's provide a global, synchronized cycle 10 | // counter great for benchmarking and time measurement across all cores. 11 | // 12 | // On non x86-64 platforms, all functions simply return 0 at this time. 13 | // 14 | package gotsc 15 | 16 | // BenchStart obtains the cycle counter. It should be used at the start of 17 | // benchmarking some code. 18 | func BenchStart() uint64 19 | 20 | // BenchEnd obtains the cycle counter. It should be used at the end of 21 | // benchmarking some code. There is a subtle difference in the allowed 22 | // reordering of operations between BenchEnd and BenchStart, hence the two 23 | // functions rather than one. 24 | func BenchEnd() uint64 25 | 26 | // TSCOverhead measures the cycle overhead of calling the underlying `rdtsc` 27 | // instruction to obtain the current CPU cycle count. You should subtract this 28 | // value from all your cycle count measurements to accurately benchmark some 29 | // code. 30 | func TSCOverhead() uint64 { 31 | var t0, t1 uint64 32 | overhead := uint64(1000000000000000000) 33 | 34 | for i := 0; i < 100000; i++ { 35 | t0 = BenchStart() 36 | t1 = BenchEnd() 37 | if t1-t0 < overhead { 38 | overhead = t1 - t0 39 | } 40 | } 41 | 42 | return overhead 43 | } 44 | -------------------------------------------------------------------------------- /tsc_amd64.s: -------------------------------------------------------------------------------- 1 | // Copyright 2016 David Terei. All rights reserved. 2 | // Use of this source code is governed by a BSD-style 3 | // license that can be found in the LICENSE file. 4 | 5 | #include "textflag.h" 6 | 7 | // func BenchStart() uint64 8 | TEXT ·BenchStart(SB),NOSPLIT,$0-8 9 | CPUID 10 | RDTSC 11 | SHLQ $32, DX 12 | ADDQ DX, AX 13 | MOVQ AX, ret+0(FP) 14 | RET 15 | 16 | // func BenchEnd() uint64 17 | TEXT ·BenchEnd(SB),NOSPLIT,$0-8 18 | BYTE $0x0F // RDTSCP 19 | BYTE $0x01 20 | BYTE $0xF9 21 | SHLQ $32, DX 22 | ADDQ DX, AX 23 | MOVQ AX, ret+0(FP) 24 | CPUID 25 | RET 26 | -------------------------------------------------------------------------------- /tsc_test.go: -------------------------------------------------------------------------------- 1 | // Copyright 2016 David Terei. All rights reserved. 2 | // Use of this source code is governed by a BSD-style 3 | // license that can be found in the LICENSE file. 4 | // 5 | // +build amd64 6 | 7 | package gotsc 8 | 9 | import ( 10 | "testing" 11 | "time" 12 | ) 13 | 14 | const ( 15 | TSC_LOW = 10 16 | TSC_HIGH = 200 17 | ) 18 | 19 | func TestTSCOverhead(t *testing.T) { 20 | tsc := TSCOverhead() 21 | if tsc < 10 || tsc > 100 { 22 | t.Errorf("TSC Overhead returned number outside expected range: %d\n", tsc) 23 | } 24 | } 25 | 26 | func TestBench(t *testing.T) { 27 | tsc := TSCOverhead() 28 | start := BenchStart() 29 | end := BenchEnd() 30 | 31 | delta := end - start - tsc 32 | if start > end { 33 | t.Error("BenchEnd() earlier than BenchStart()") 34 | } else if delta > TSC_HIGH { 35 | t.Errorf("BenchEnd() - BenchStart() far greater than TSC overhead: %d\n", 36 | delta) 37 | } 38 | } 39 | 40 | func BenchmarkTime(b *testing.B) { 41 | for i := 0; i < b.N; i++ { 42 | time.Now() 43 | } 44 | } 45 | 46 | func BenchmarkBenchStart(b *testing.B) { 47 | for i := 0; i < b.N; i++ { 48 | BenchStart() 49 | } 50 | } 51 | 52 | func BenchmarkBenchEnd(b *testing.B) { 53 | for i := 0; i < b.N; i++ { 54 | BenchEnd() 55 | } 56 | } 57 | -------------------------------------------------------------------------------- /tsc_unsupported.go: -------------------------------------------------------------------------------- 1 | // Copyright 2016 David Terei. All rights reserved. 2 | // Use of this source code is governed by a BSD-style 3 | // license that can be found in the LICENSE file. 4 | // 5 | // +build !amd64 6 | 7 | // Package gotsc provides access to the timestamp cycle counter on x86-64 for 8 | // performing close to cycle accurate benchmarking on x86-64. All recent (think 9 | // since 2010) generation Intel CPU's provide a global, synchronized cycle 10 | // counter great for benchmarking and time measurement across all cores. 11 | // 12 | // On non x86-64 platforms, all functions simply return 0 at this time. 13 | // 14 | package gotsc 15 | 16 | // BenchStart obtains the cycle counter. It should be used at the start of 17 | // benchmarking some code. 18 | func BenchStart() uint64 { 19 | return 0 20 | } 21 | 22 | // BenchEnd obtains the cycle counter. It should be used at the end of 23 | // benchmarking some code. There is a subtle difference in the allowed 24 | // reordering of operations between BenchEnd and BenchStart, hence the two 25 | // functions rather than one. 26 | func BenchEnd() uint64 { 27 | return 0 28 | } 29 | 30 | // TSCOverhead measures the cycle overhead of calling the underlying `rdtsc` 31 | // instruction to obtain the current CPU cycle count. You should subtract this 32 | // value from all your cycle count measurements to accurately benchmark some 33 | // code. 34 | func TSCOverhead() uint64 { 35 | return 0 36 | } 37 | --------------------------------------------------------------------------------