├── .travis.yml
├── LICENSE
├── README.md
├── tsc.go
├── tsc_amd64.s
├── tsc_test.go
└── tsc_unsupported.go


/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: go
 2 | 
 3 | sudo: false
 4 | 
 5 | go:
 6 |   - 1.4
 7 |   - 1.5
 8 |   - 1.6
 9 |   - tip
10 | 
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2015, David Terei.
 2 | 
 3 | All rights reserved.
 4 | 
 5 | Redistribution and use in source and binary forms, with or without
 6 | modification, are permitted provided that the following conditions
 7 | are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright
10 |    notice, this list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright
13 |    notice, this list of conditions and the following disclaimer in the
14 |    documentation and/or other materials provided with the distribution.
15 | 
16 | 3. Neither the name of the author nor the names of his contributors
17 |    may be used to endorse or promote products derived from this software
18 |    without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE CONTRIBUTORS ``AS IS'' AND ANY EXPRESS
21 | OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
22 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED.  IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE FOR
24 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 | OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 | HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
28 | STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
29 | ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
30 | POSSIBILITY OF SUCH DAMAGE.
31 | 
32 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # gotsc
  2 | 
  3 | [![Build Status](https://travis-ci.org/dterei/gotsc.svg)](https://travis-ci.org/dterei/gotsc)
  4 | [![Go Report Card](https://goreportcard.com/badge/github.com/dterei/gotsc)](https://goreportcard.com/report/github.com/dterei/gotsc)
  5 | [![godoc](https://godoc.org/github.com/dterei/gotsc?status.svg)](http://godoc.org/github.com/dterei/gotsc)
  6 | [![BSD3 License](http://img.shields.io/badge/license-BSD3-brightgreen.svg?style=flat)][tl;dr Legal: BSD3]
  7 | 
  8 | [tl;dr Legal: BSD3]:
  9 |   https://tldrlegal.com/license/bsd-3-clause-license-(revised)
 10 |   "BSD3 License"
 11 | 
 12 | Golang library for access the CPU timestamp cycle counter (TSC) on x86-64. If
 13 | not familar with using the TSC for benchmarking, refer to the
 14 | [Intel whitepaper][intel1]. This is designed to be used for benchmarking code, so
 15 | takes steps to prevent instruction reordering across measurement boundaries by
 16 | the CPU.
 17 | 
 18 | Golang 1.4 or later is currently supported and x86-64 architetcture. The
 19 | package will build on other architectures but all functions will simply return
 20 | 0.
 21 | 
 22 | ## Usage
 23 | 
 24 | ``` .go
 25 | package main
 26 | 
 27 | import (
 28 |   "fmt"
 29 |   "github.com/dterei/gotsc"
 30 | )
 31 | 
 32 | const N = 100
 33 | 
 34 | func main() {
 35 |   tsc := gotsc.TSCOverhead()
 36 |   fmt.Println("TSC Overhead:", tsc)
 37 | 
 38 |   start := gotsc.BenchStart()
 39 |   for i := 0; i < N; i++ {
 40 |     // code to evaluate
 41 |   }
 42 |   end := gotsc.BenchEnd()
 43 |   avg := (end - start - tsc) / N
 44 | 
 45 |   fmt.Println("Cycles:", avg)
 46 | }
 47 | ```
 48 | 
 49 | ## Compared with time.Now()
 50 | 
 51 | There are two advantages over the standard golang `time.Now()` function:
 52 | 
 53 | 1. Measurement is in cycles - for many situations cycle count is a more
 54 |    informative number than wall-clock time.
 55 | 2. Careful use of CPU serializing instructions to ensure no code you are
 56 |    benchmarking is moved outside the timed region, and no code you aren't
 57 |    benchmarking is moved into it.
 58 | 
 59 | Claim (2) may be a little contensious, so see below. For benchmarking with the
 60 | TSC, we use the approach suggested by [Intel][intel1]:
 61 | 
 62 | ``` .asm
 63 | cpuid
 64 | rdtsc
 65 | // code to benchmark
 66 | rdtscp
 67 | cpuid
 68 | ```
 69 | 
 70 | ## Reading the TSC
 71 | 
 72 | There appears to be only confusion on what is both the correct and best way to
 73 | read the TSC when benchmarking code. The most obvious and naive approach would
 74 | be:
 75 | 
 76 | ``` .asm
 77 | // code before
 78 | rdtsc
 79 | // code to benchmark
 80 | rdtsc
 81 | // code after
 82 | ```
 83 | 
 84 | But `rdtsc` doesn't prevent instructions being reordered by the CPU around it.
 85 | Thus, code before and after could move into the benchmarked region, while code
 86 | in the benchmarked region could move out.
 87 | 
 88 | The best Intel documentation on this suggests the following approach:
 89 | 
 90 | ``` .asm
 91 | // code before
 92 | cpuid
 93 | rdtsc
 94 | // code to benchmark
 95 | rdtscp
 96 | cpuid
 97 | /// code after
 98 | ```
 99 | 
100 | The `cpuid` instruction is a full barrier, preventing reordering in both
101 | directions, while `rdtscp` prevents reordering from above. We use `rdtscp` at
102 | the end rather than `cpuid; rdtsc` as `cpuid` is an expensive instruction with
103 | high variance, so we want it outside the benchmarked region.
104 | 
105 | Ideally our benchmarking approach provides the following:
106 | 
107 | 1. Low variance for instructions involved in retrieving start and end TSC so
108 |    that we can subtract their overhead from the measurement with more
109 |    confidence.
110 | 2. Low cost to read the TSC so that we can take benchmarks as often as possible
111 |    without affecting application performance.
112 | 3. High resolution so that we can measure the cost of very small sets of
113 |    instructions.
114 | 
115 | The recommended Intel approach provides (1) and (3) but the use of `cpuid` is
116 | fairly expensive. The Intel SDM suggest that the `lfence` instruction can be
117 | used as an alternative to `cpuid`, while AMD suggest the use of `mfence`.
118 | 
119 | Linux takes this approach. [Originally][lxr1] ([LKML][lkml1]), it used an
120 | `lfence` either side with the thinking being that `lfence` only prevents
121 | reordering from above:
122 | 
123 | ``` .asm
124 | // Linux kernel TSC usage (circa 2008)
125 | lfence
126 | rdtsc
127 | lfence
128 | ```
129 | 
130 | This was [later][lxr2] ([LKML][lkml2]) 'optimized' to just use one `lfence`
131 | before `rdtsc`:
132 | 
133 | ``` .asm
134 | // Linux kernel TSC usage (circa 2011+)
135 | lfence
136 | rdtsc
137 | ```
138 | 
139 | The kernel developer of the older `lfence` both sides approach appears to
140 | object to this optimization as 'unsafe' due to being a barrier in only one
141 | direction. The 'modern' thinking appears to be that while this is technically
142 | true, a microprocessor would never take advantage of this reordering---there is
143 | no performance reason to do so.
144 | 
145 | The Akaros project [investigated][arakos] a number of alternative approaches
146 | (including all the above issues). Eventually taking the modern Linux approach
147 | and suggesting the following:
148 | 
149 | ``` .asm
150 | // code before
151 | lfence
152 | rdtsc
153 | // code to benchmark
154 | lfence
155 | rdtsc
156 | // code after
157 | ```
158 | 
159 | Finally, for complete reference, an older [Intel guide][intel2] to using
160 | `rdtsc` (pre `rdtscp` days) suggest that you 'warm up' the `cpuid` and `rdtsc`
161 | instructions a few times before benchmarking the code:
162 | 
163 | ``` .asm
164 | // code before
165 | 
166 | // warmup
167 | cpuid
168 | rdtsc
169 | cpuid
170 | rdtsc
171 | cpuid
172 | rdtsc
173 | 
174 | cpuid
175 | rdtsc
176 | // code to benchmark
177 | cpuid
178 | rdtsc
179 | // code afer
180 | ```
181 | 
182 | It's not clear if this is valuable any more when we have `rdtscp` to avoid
183 | include the highly variable `cpuid` in our measurement region.
184 | 
185 | This is a very confusing situation. We keep it simple and stick with the
186 | recommendation from Intel. This works well, but is a little more expensive due
187 | to the `cpuid` calls compared to alternatives. However, it also appears to be
188 | the 'safest', ensuring accurate measurements. For very frequent calls to the
189 | TSC when benchmarking is not your goal, the standard Go `time.Now()` call is
190 | very fast, essentially being `lfence; rdtsc`.
191 | 
192 | ## Converting Cycles to Time
193 | 
194 | To convert from cycles to wall-clock time we need to know TSC frequency.
195 | Frequency scaling on modern Intel chips doesn't affect the TSC.
196 | 
197 | Sadly, the only way to determine the TSC frequency appears to be through a MSR
198 | using the `rdmsr` instruction. This instruction is privileged and can't be
199 | executed from user-space.
200 | 
201 | If we could, we want to access the `MSR_PLATFORM_INFO`:
202 | 
203 | > Register Name: MSR_PLATFORM_INFO [15:8]
204 | > Description: Package Maximum Non-Turbo Ratio (R/O)
205 | >              The is the ratio of the frequency that invariant TSC runs at.
206 | >              Frequency = ratio * 100 MHz.
207 | 
208 | The multiplicative factor of `100 MHz` varies across architectures. Luckily, it
209 | appears to be `100 MHz` on all Intel architectures except Nehalem, for which it
210 | is `133.3 MHz`.
211 | 
212 | If this method fails or is unavailable, Linux appears to determine the TSC
213 | clock speed through a [calibration] [lxr3] against hardware timers.
214 | 
215 | For now, we don't provide the ability to convert cycles to time.
216 | 
217 | ## Licensing
218 | 
219 | This library is BSD-licensed.
220 | 
221 | ## Get involved!
222 | 
223 | We are happy to receive bug reports, fixes, documentation enhancements,
224 | and other improvements.
225 | 
226 | Please report bugs via the
227 | [github issue tracker](http://github.com/dterei/gotsc/issues).
228 | 
229 | Master [git repository](http://github.com/dterei/gotsc):
230 | 
231 | * `git clone git://github.com/dterei/gotsc.git`
232 | 
233 | ## Authors
234 | 
235 | This library is written and maintained by David Terei, <code@davidterei.com>.
236 | 
237 | [intel1]: http://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
238 | [intel2]: https://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf
239 | [lxr1]: http://lxr.free-electrons.com/source/include/asm-x86/system.h?v=2.6.25#L403
240 | [lkml1]: https://lkml.org/lkml/2008/1/7/276
241 | [lxr2]: http://lxr.free-electrons.com/source/arch/x86/include/asm/msr.h#L168
242 | [lkml2]: https://lkml.org/lkml/2011/5/10/297
243 | [arakos]: http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c
244 | [lxr3]: http://lxr.free-electrons.com/source/arch/x86/kernel/tsc.c#L670
245 | 
246 | 


--------------------------------------------------------------------------------
/tsc.go:
--------------------------------------------------------------------------------
 1 | // Copyright 2016 David Terei.  All rights reserved.
 2 | // Use of this source code is governed by a BSD-style
 3 | // license that can be found in the LICENSE file.
 4 | //
 5 | // +build amd64
 6 | 
 7 | // Package gotsc provides access to the timestamp cycle counter on x86-64 for
 8 | // performing close to cycle accurate benchmarking on x86-64. All recent (think
 9 | // since 2010) generation Intel CPU's provide a global, synchronized cycle
10 | // counter great for benchmarking and time measurement across all cores.
11 | //
12 | // On non x86-64 platforms, all functions simply return 0 at this time.
13 | //
14 | package gotsc
15 | 
16 | // BenchStart obtains the cycle counter. It should be used at the start of
17 | // benchmarking some code.
18 | func BenchStart() uint64
19 | 
20 | // BenchEnd obtains the cycle counter. It should be used at the end of
21 | // benchmarking some code. There is a subtle difference in the allowed
22 | // reordering of operations between BenchEnd and BenchStart, hence the two
23 | // functions rather than one.
24 | func BenchEnd() uint64
25 | 
26 | // TSCOverhead measures the cycle overhead of calling the underlying `rdtsc`
27 | // instruction to obtain the current CPU cycle count. You should subtract this
28 | // value from all your cycle count measurements to accurately benchmark some
29 | // code.
30 | func TSCOverhead() uint64 {
31 | 	var t0, t1 uint64
32 | 	overhead := uint64(1000000000000000000)
33 | 
34 | 	for i := 0; i < 100000; i++ {
35 | 		t0 = BenchStart()
36 | 		t1 = BenchEnd()
37 | 		if t1-t0 < overhead {
38 | 			overhead = t1 - t0
39 | 		}
40 | 	}
41 | 
42 | 	return overhead
43 | }
44 | 


--------------------------------------------------------------------------------
/tsc_amd64.s:
--------------------------------------------------------------------------------
 1 | // Copyright 2016 David Terei.  All rights reserved.
 2 | // Use of this source code is governed by a BSD-style
 3 | // license that can be found in the LICENSE file.
 4 | 
 5 | #include "textflag.h"
 6 | 
 7 | // func BenchStart() uint64
 8 | TEXT ·BenchStart(SB),NOSPLIT,$0-8
 9 | 	CPUID
10 | 	RDTSC
11 | 	SHLQ	$32, DX
12 | 	ADDQ	DX, AX
13 | 	MOVQ	AX, ret+0(FP)
14 | 	RET
15 | 
16 | // func BenchEnd() uint64
17 | TEXT ·BenchEnd(SB),NOSPLIT,$0-8
18 | 	BYTE	$0x0F // RDTSCP
19 | 	BYTE	$0x01
20 | 	BYTE	$0xF9
21 | 	SHLQ	$32, DX
22 | 	ADDQ	DX, AX
23 | 	MOVQ	AX, ret+0(FP)
24 | 	CPUID
25 | 	RET
26 | 


--------------------------------------------------------------------------------
/tsc_test.go:
--------------------------------------------------------------------------------
 1 | // Copyright 2016 David Terei.  All rights reserved.
 2 | // Use of this source code is governed by a BSD-style
 3 | // license that can be found in the LICENSE file.
 4 | //
 5 | // +build amd64
 6 | 
 7 | package gotsc
 8 | 
 9 | import (
10 | 	"testing"
11 | 	"time"
12 | )
13 | 
14 | const (
15 | 	TSC_LOW  = 10
16 | 	TSC_HIGH = 200
17 | )
18 | 
19 | func TestTSCOverhead(t *testing.T) {
20 | 	tsc := TSCOverhead()
21 | 	if tsc < 10 || tsc > 100 {
22 | 		t.Errorf("TSC Overhead returned number outside expected range: %d\n", tsc)
23 | 	}
24 | }
25 | 
26 | func TestBench(t *testing.T) {
27 | 	tsc := TSCOverhead()
28 | 	start := BenchStart()
29 | 	end := BenchEnd()
30 | 
31 | 	delta := end - start - tsc
32 | 	if start > end {
33 | 		t.Error("BenchEnd() earlier than BenchStart()")
34 | 	} else if delta > TSC_HIGH {
35 | 		t.Errorf("BenchEnd() - BenchStart() far greater than TSC overhead: %d\n",
36 | 			delta)
37 | 	}
38 | }
39 | 
40 | func BenchmarkTime(b *testing.B) {
41 | 	for i := 0; i < b.N; i++ {
42 | 		time.Now()
43 | 	}
44 | }
45 | 
46 | func BenchmarkBenchStart(b *testing.B) {
47 | 	for i := 0; i < b.N; i++ {
48 | 		BenchStart()
49 | 	}
50 | }
51 | 
52 | func BenchmarkBenchEnd(b *testing.B) {
53 | 	for i := 0; i < b.N; i++ {
54 | 		BenchEnd()
55 | 	}
56 | }
57 | 


--------------------------------------------------------------------------------
/tsc_unsupported.go:
--------------------------------------------------------------------------------
 1 | // Copyright 2016 David Terei.  All rights reserved.
 2 | // Use of this source code is governed by a BSD-style
 3 | // license that can be found in the LICENSE file.
 4 | //
 5 | // +build !amd64
 6 | 
 7 | // Package gotsc provides access to the timestamp cycle counter on x86-64 for
 8 | // performing close to cycle accurate benchmarking on x86-64. All recent (think
 9 | // since 2010) generation Intel CPU's provide a global, synchronized cycle
10 | // counter great for benchmarking and time measurement across all cores.
11 | //
12 | // On non x86-64 platforms, all functions simply return 0 at this time.
13 | //
14 | package gotsc
15 | 
16 | // BenchStart obtains the cycle counter. It should be used at the start of
17 | // benchmarking some code.
18 | func BenchStart() uint64 {
19 | 	return 0
20 | }
21 | 
22 | // BenchEnd obtains the cycle counter. It should be used at the end of
23 | // benchmarking some code. There is a subtle difference in the allowed
24 | // reordering of operations between BenchEnd and BenchStart, hence the two
25 | // functions rather than one.
26 | func BenchEnd() uint64 {
27 | 	return 0
28 | }
29 | 
30 | // TSCOverhead measures the cycle overhead of calling the underlying `rdtsc`
31 | // instruction to obtain the current CPU cycle count. You should subtract this
32 | // value from all your cycle count measurements to accurately benchmark some
33 | // code.
34 | func TSCOverhead() uint64 {
35 | 	return 0
36 | }
37 | 


--------------------------------------------------------------------------------