4 | consistent hashing
5 |
6 |
8 |
9 |
10 |
11 |
14 |
15 |
16 |
--------------------------------------------------------------------------------
/consistent-hashing/slides.md:
--------------------------------------------------------------------------------
1 | class: center, middle, inverse
2 |
3 | ## consistent hashing in go
4 |
5 | ---
6 |
7 | # a problem
8 |
9 | - I have a set of keys and values
10 |
11 | --
12 |
13 | - I have some servers for a k/v store (memcache, redis, mysql)
14 |
15 | --
16 |
17 | - I want to distribute the keys over the servers so I can find them again
18 |
19 | --
20 |
21 | - *without* having to store a global directory
22 |
23 | ---
24 |
25 | # mod-N hashing
26 |
27 | - hash(key) % N
28 |
29 | --
30 | - easy
31 |
32 | - fast (except for the modulo operator)
33 | --
34 |
35 | - but other things will be slower
36 | - and watch out for slow hash functions (sha1...)
37 |
38 | --
39 |
40 | - changing N means almost every key will map somewhere else
41 |
42 | --
43 |
44 | - what would be optimal?
45 | - when adding/removing servers, only 1/nth of the keys should move
46 | - don't move any keys that don't need to move
47 |
48 | ---
49 | # consistent hashing
50 |
51 | - 1997: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web
52 |
53 | --
54 |
55 | - 2007: Ketama (last.fm)
56 | - 2007: Amazon's Dynamo paper
57 |
58 | --
59 |
60 | - standard scaling technique
61 | - Cassandra
62 | - Riak
63 | - basically every distributed system that needs to distribute load over servers
64 |
65 | ---
66 |
67 | # ring-based hashing
68 |
69 | - points on a circle
70 |
71 | --
72 |
73 | - add point for each `hash(node_i)`
74 | - find the first `node_i` such that `hash(key) <= hash(node_i)`
75 |
76 | --
77 |
78 | - vnodes: each node appears multiple times
79 | - reduces variance
80 | - allows weighting
81 |
82 | ---
83 |
84 |
85 |
86 | ---
87 |
88 |
89 | # ring-based hashing -- insert (groupcache)
90 |
91 | ```go
92 | func (m *Map) Add(keys ...string) {
93 | for _, key := range keys {
94 | for i := 0; i < m.replicas; i++ {
95 | hash := int(m.hash([]byte(strconv.Itoa(i) + "_" + key)))
96 | m.keys = append(m.keys, hash)
97 | m.hashMap[hash] = key
98 | }
99 | }
100 | sort.Ints(m.keys)
101 | }
102 | ```
103 |
104 | ---
105 |
106 | # ring-based hashing -- lookup (groupcache)
107 |
108 | ```go
109 | func (m *Map) Get(key string) string {
110 | hash := int(m.hash([]byte(key)))
111 |
112 | idx := sort.Search(len(m.keys), func(i int) bool { return m.keys[i] >= hash })
113 |
114 | if idx == len(m.keys) {
115 | idx = 0
116 | }
117 |
118 | return m.hashMap[m.keys[idx]]
119 | }
120 | ```
121 |
122 | ---
123 |
124 | # ketama (last.fm)
125 |
126 | - memcache client
127 | - needed for compatibility at $WORK
128 |
129 | --
130 |
131 | ```c
132 | unsigned int k_limit = floorf(pct * 40.0 * ketama->numbuckets);
133 | ```
134 |
135 | --
136 | ```c
137 | float floorf(float x);
138 | unsigned int numbuckets;
139 | float pct;
140 | ```
141 |
142 | --
143 |
144 | ```go
145 | limit := int(float32(float64(pct) * 40.0 * float64(numbuckets)))
146 | ```
147 |
148 | --
149 |
150 | - immediately wrote libchash which doesn't depend on floating point round-off error
151 |
152 | ---
153 |
154 | # "are we done?" OR "why is this still a research topic?"
155 |
156 | --
157 |
158 | - unequal load distribution
159 | - 100 points per server gives stddev of ~10%
160 | - 99% CI for bucket size: (0.76, 1.28)
161 | - 1000 points per server gives ~3.2% stddev
162 | - 99% CI for bucket size: (0.92, 1.09)
163 |
164 | --
165 |
166 | - memory requirements for 1000 shards
167 | - 4MB of data
168 | - O(log n) search times (for n=1e6)
169 | - all of which are cache misses
170 | ---
171 |
172 | # jump hash
173 |
174 | - Google, 2014: https://arxiv.org/abs/1406.2294
175 |
176 | --
177 | - but actually, Google 2011 in Guava
178 |
179 | --
180 |
181 | - no memory overhead
182 | - even key distribution
183 | - stddev 0.000000764% -> (0.99999998, 1.00000002)
184 |
185 | - *fast*
186 |
187 | ---
188 | # jump hash
189 |
190 | ```go
191 | func Hash(key uint64, numBuckets int) int32 {
192 | var b int64 = -1
193 | var j int64
194 |
195 | for j < int64(numBuckets) {
196 | b = j
197 | key = key*2862933555777941757 + 1
198 | j = int64(float64(b+1) * (float64(int64(1)<<31) / float64((key>>33)+1)))
199 | }
200 |
201 | return int32(b)
202 | }
203 | ```
204 |
205 | ---
206 |
207 | # jump hash -- downsides
208 |
209 | - doesn't support arbitrary bucket names
210 |
211 | --
212 |
213 | - can only add/remove buckets at the end
214 |
215 | --
216 |
217 | - only applicable for data storage cases
218 |
219 | ---
220 |
221 | # "are we done?" OR "why is this still a research topic?" (2)
222 |
223 | --
224 |
225 | - can't use arbitrary bucket names
226 |
227 | - how to get low variance without the memory overhead
228 |
229 | ---
230 |
231 | # multi-probe consistent hashing
232 |
233 | - Google, 2015: https://arxiv.org/abs/1505.00062
234 |
235 | --
236 |
237 | - O(n) space but O(k) lookup time
238 | - sneaky data structure to prevent O(k log n)
239 |
240 | --
241 |
242 | - instead of multiple points on the circle, hash each node once but hash the key k times
243 |
244 | - for low variance ("1.05 peak-to-mean ratio")
245 | - k = 21 lookups
246 | - J = 700 ln n points on the circle
247 |
248 | ---
249 |
250 |
251 |
252 | ---
253 |
254 | # others
255 |
256 | - rendezvous hashing (1997, https://www.eecs.umich.edu/techreports/cse/96/CSE-TR-316-96.pdf )
257 | - SPOCA (2011, https://www.usenix.org/legacy/event/atc11/tech/final_files/Chawla.pdf )
258 | - maglev hashing (2016, http://research.google.com/pubs/pub44824.html )
259 |
260 | ---
261 |
262 | # source code
263 |
264 | - https://github.com/dgryski/go-ketama
265 | - https://github.com/dgryski/libchash / https://github.com/golang/groupcache
266 | - https://github.com/dgryski/go-jump
267 | - https://github.com/dgryski/go-mpchash
268 | - https://github.com/dgryski/go-maglev
269 |
270 | ---
271 | class: center, middle, inverse
272 |
273 | ## fin
274 |
275 | ???
276 |
277 | vim: ft=markdown
278 |
--------------------------------------------------------------------------------
/dmrgo/DroidSerif.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/DroidSerif.ttf
--------------------------------------------------------------------------------
/dmrgo/YanoneKaffeesatz-Regular.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/YanoneKaffeesatz-Regular.ttf
--------------------------------------------------------------------------------
/dmrgo/remark-latest.min.js:
--------------------------------------------------------------------------------
1 | ../remarkjs/remark-latest.min.js
--------------------------------------------------------------------------------
/dmrgo/slides.css:
--------------------------------------------------------------------------------
1 | /* Slideshow styles */
2 |
3 | @font-face {
4 | font-family: 'Droid Serif';
5 | font-style: normal;
6 | font-weight: 400;
7 | src: local('Droid Serif'), local('DroidSerif'), url(DroidSerif.ttf) format('truetype');
8 | }
9 |
10 |
11 | @font-face {
12 | font-family: 'Yanone Kaffeesatz';
13 | font-style: normal;
14 | font-weight: 400;
15 | src: local('Yanone Kaffeesatz Regular'), local('YanoneKaffeesatz-Regular'), url(YanoneKaffeesatz-Regular.ttf) format('truetype');
16 | }
17 |
18 | body { font-family: 'Droid Serif'; font-size: 1.5em; }
19 | h1, h2, h3 {
20 | font-family: 'Yanone Kaffeesatz';
21 | font-weight: 400;
22 | margin-bottom: 0;
23 | }
24 | h1 { font-size: 3em; }
25 | h2 { font-size: 2em; }
26 | h3 { font-size: 1.5em; }
27 | .footnote {
28 | position: absolute;
29 | bottom: 3em;
30 | }
31 | li p { line-height: 1.25em; }
32 | .red { color: #fa0000; }
33 | .large { font-size: 2em; }
34 | a, a > code {
35 | color: rgb(249, 38, 114);
36 | text-decoration: none;
37 | }
38 | .inverse {
39 | background: #272822;
40 | color: #777872;
41 | text-shadow: 0 0 20px #333;
42 | }
43 | .inverse h1, .inverse h2 {
44 | color: #f3f3f3;
45 | line-height: 0.8em;
46 | }
47 |
--------------------------------------------------------------------------------
/dmrgo/slides.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | dmrgo
5 |
6 |
8 |
9 |
10 |
11 |
12 |
20 |
21 |
22 |
--------------------------------------------------------------------------------
/dmrgo/slides.md:
--------------------------------------------------------------------------------
1 | class: center, middle, inverse
2 |
3 | ## dmrgo
4 |
5 | ---
6 | ## Overview
7 |
8 | - MapReduce
9 |
10 | - dmrgo
11 |
12 | ---
13 |
14 | ## MapReduce
15 |
16 | - MapReduce: Simplified Data Processing on Large Clusters (Dean and Ghemawat, 2004)
17 |
18 | - Google has moved beyond it.
19 |
20 | - Hadoop hasn't.
21 |
22 |
23 | ---
24 | ## WordCount (input)
25 |
26 | ```
27 | hello howdy world everybody hello everybody howdy howdy
28 | ```
29 |
30 | ---
31 | ## WordCount (map)
32 |
33 | ```
34 | hello 1
35 | howdy 1
36 | world 1
37 | everybody 1
38 | hello 1
39 | everybody 1
40 | howdy 1
41 | howdy 1
42 | ```
43 |
44 | ---
45 | ## WordCount (shuffle)
46 |
47 | ```
48 | everybody 1
49 | everybody 1
50 | hello 1
51 | hello 1
52 | howdy 1
53 | howdy 1
54 | howdy 1
55 | world 1
56 | ```
57 |
58 | ---
59 | ## WordCount (reduce)
60 |
61 | ```
62 | everybody 2
63 | hello 2
64 | howdy 3
65 | world 1
66 | ```
67 |
68 | ---
69 | ## Streaming Hadoop
70 |
71 | - stdin/stdout
72 |
73 | - TSV
74 |
75 | ---
76 | ## dmrgo
77 |
78 | - statically typed Java API
79 |
80 | - dynamically typed Python library mrjob ( https://github.com/Yelp/mrjob )
81 |
82 | ---
83 |
84 | ```go
85 |
86 | type Emitter interface {
87 | Emit(key string, value string)
88 | Flush()
89 | }
90 |
91 | type MapReduceJob interface {
92 | Map(key string, value string, emitter Emitter)
93 |
94 | // Called at the end of the Map phase
95 | MapFinal(emitter Emitter)
96 |
97 | Reduce(key string, values []string, emitter Emitter)
98 | }
99 | ```
100 |
101 | ---
102 |
103 | class: center, middle, inverse
104 |
105 | ## Questions?
106 |
107 | ---
108 |
109 | class: center, middle, inverse
110 |
111 | ## fin
112 |
113 | ???
114 |
115 | vim: ft=markdown
116 |
--------------------------------------------------------------------------------
/dotgo-2016/append/bench_test.go:
--------------------------------------------------------------------------------
1 | package main
2 |
3 | import (
4 | "strconv"
5 | "testing"
6 | )
7 |
8 | var sink int
9 |
10 | func benchmarkGrow(b *testing.B, n int) {
11 | for i := 0; i < b.N; i++ {
12 | var s []int
13 | for j := 0; j < n; j++ {
14 | s = append(s, j)
15 | }
16 | sink += len(s)
17 | }
18 | }
19 |
20 | func benchmarkAllocate(b *testing.B, n int) {
21 | for i := 0; i < b.N; i++ {
22 | s := make([]int, n)
23 | for j := 0; j < n; j++ {
24 | s = append(s, j)
25 | }
26 | sink += len(s)
27 | }
28 | }
29 |
30 | func BenchmarkGrow(b *testing.B) {
31 | for i := 10; i < 200; i += 10 {
32 | b.Run(strconv.Itoa(i), func(b *testing.B) { benchmarkGrow(b, i) })
33 | }
34 | }
35 |
36 | func BenchmarkAllocate(b *testing.B) {
37 | for i := 10; i < 2000; i += 10 {
38 | b.Run(strconv.Itoa(i), func(b *testing.B) { benchmarkAllocate(b, i) })
39 | }
40 | }
41 |
--------------------------------------------------------------------------------
/dotgo-2016/append/main.go:
--------------------------------------------------------------------------------
1 | package main
2 |
3 | func main() {
4 |
5 | var old struct {
6 | len int
7 | cap int
8 | }
9 |
10 | var cap int
11 |
12 | // START OMIT
13 | newcap := old.cap
14 | doublecap := newcap + newcap
15 | if cap > doublecap {
16 | newcap = cap
17 | } else {
18 | if old.len < 1024 {
19 | newcap = doublecap
20 | } else {
21 | for newcap < cap {
22 | newcap += newcap / 4
23 | }
24 | }
25 | }
26 |
27 | // END OMIT
28 | }
29 |
--------------------------------------------------------------------------------
/dotgo-2016/cacheline/cache.go:
--------------------------------------------------------------------------------
1 | package main
2 |
3 | import (
4 | "fmt"
5 | "time"
6 | )
7 |
8 | func mult3() {
9 | arr := make([]int32, 128*1024*1024)
10 |
11 | for j := 0; j < 20; j++ {
12 | for i := 0; i < len(arr); i++ {
13 | arr[i] *= 3
14 | }
15 | }
16 |
17 | // START OMIT
18 | for _, j := range []int{1, 2, 3, 4, 5, 6, 7, 8,
19 | 9, 10, 11, 12, 13, 14, 15, 16,
20 | 32, 64, 128, 256, 512, 1024} {
21 | for c := 0; c < 20; c++ {
22 | t0 := time.Now()
23 | for i := 0; i < len(arr); i += j {
24 | arr[i] *= 3
25 | }
26 | fmt.Println(j, time.Since(t0))
27 | }
28 | }
29 | // END OMIT
30 | }
31 |
32 | func multRand() {
33 | arr := make([]int32, 128*1024*1024)
34 |
35 | for i := 0; i < len(arr); i++ {
36 | arr[i] *= 3
37 | }
38 |
39 | for _, j := range []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64, 128, 256, 512, 1024} {
40 | for c := 0; c < 10; c++ {
41 | t0 := time.Now()
42 | for i := 0; i < len(arr); i += j {
43 | arr[i] *= int32(xorshift32(uint32(i)))
44 | }
45 | fmt.Println(j, time.Since(t0))
46 | }
47 | }
48 | }
49 |
50 | func xorshift32(y uint32) uint32 {
51 | y ^= (y << 13)
52 | y ^= (y >> 17)
53 | y ^= (y << 5)
54 | return y
55 | }
56 |
57 | func main() {
58 | mult3()
59 | }
60 |
--------------------------------------------------------------------------------
/dotgo-2016/cacheline/cache.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/dotgo-2016/cacheline/cache.png
--------------------------------------------------------------------------------
/dotgo-2016/cacheline/plot.sh:
--------------------------------------------------------------------------------
1 | set terminal png size 1920,1440 crop enhanced font "/usr/share/fonts/truetype/times.ttf,30" dashlength 2;
2 | set termoption linewidth 3;
3 |
4 | set output "cache.png";
5 | set title "";
6 | set ylabel "milliseconds"
7 | set xlabel "step size";
8 | set logscale x 2;
9 | plot "c.out" title "";
10 |
--------------------------------------------------------------------------------
/dotgo-2016/images/BigO.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/dotgo-2016/images/BigO.png
--------------------------------------------------------------------------------
/dotgo-2016/images/Part6-CPU-and-RAM-speeds-563x300.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/dotgo-2016/images/Part6-CPU-and-RAM-speeds-563x300.jpg
--------------------------------------------------------------------------------
/dotgo-2016/images/latency.txt:
--------------------------------------------------------------------------------
1 | Latency Comparison Numbers
2 | --------------------------
3 | L1 cache reference 0.5 ns // HL
4 | Branch mispredict 5 ns
5 | L2 cache reference 7 ns 14x L1 cache // HL
6 | Mutex lock/unlock 25 ns
7 | Main memory reference 100 ns 20x L2 cache, 200x L1 cache // HL
8 | Compress 1K bytes with Zippy 3,000 ns 3 us
9 | Send 1K bytes over 1 Gbps network 10,000 ns 10 us
10 | Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
11 | Read 1 MB sequentially from memory 250,000 ns 250 us
12 | Round trip within same datacenter 500,000 ns 500 us
13 | Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
14 | Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
15 | Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
16 | Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
17 |
--------------------------------------------------------------------------------
/dotgo-2016/insert/insert-log.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/dotgo-2016/insert/insert-log.png
--------------------------------------------------------------------------------
/dotgo-2016/insert/insert.go:
--------------------------------------------------------------------------------
1 | package insert
2 |
3 | import (
4 | "math/rand"
5 | "sort"
6 | )
7 |
8 | type node struct {
9 | next *node
10 | val int
11 | }
12 |
13 | func insertList(n int) *node {
14 | rand.Seed(0)
15 |
16 | var root *node
17 | for i := 0; i < n; i++ {
18 | x := rand.Intn(n)
19 |
20 | var prev *node
21 | var curr = root
22 | for curr != nil && curr.val < x {
23 | prev, curr = curr, curr.next
24 | }
25 |
26 | nn := &node{next: curr, val: x}
27 |
28 | if prev != nil {
29 | prev.next = nn
30 | } else {
31 | root = nn
32 | }
33 | }
34 |
35 | return root
36 | }
37 |
38 | func deleteList(root *node, size int) {
39 |
40 | for root != nil {
41 | i := rand.Intn(size)
42 | var prev *node
43 | var curr = root
44 | for j := 0; j < i; j++ {
45 | prev, curr = curr, curr.next
46 | }
47 | if prev != nil {
48 | prev.next = curr.next
49 | } else {
50 | root = curr.next
51 | }
52 |
53 | size--
54 | }
55 | }
56 |
57 | func insertSlice(n int) []int {
58 | rand.Seed(0)
59 |
60 | var s []int
61 | for i := 0; i < n; i++ {
62 | x := rand.Intn(n)
63 | idx := sort.Search(len(s), func(j int) bool { return s[j] >= x })
64 | s = append(s, 0)
65 | copy(s[idx+1:], s[idx:])
66 | s[idx] = x
67 | }
68 |
69 | return s
70 | }
71 |
72 | func deleteSlice(s []int) {
73 | for len(s) > 0 {
74 | i := rand.Intn(len(s))
75 | s = s[:i+copy(s[i:], s[i+1:])]
76 | }
77 | }
78 |
--------------------------------------------------------------------------------
/dotgo-2016/insert/insert.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/dotgo-2016/insert/insert.png
--------------------------------------------------------------------------------
/dotgo-2016/insert/insert_test.go:
--------------------------------------------------------------------------------
1 | package insert
2 |
3 | import (
4 | "strconv"
5 | "testing"
6 | )
7 |
8 | func TestSlice(t *testing.T) {
9 | v := insertSlice(10)
10 | t.Log(v)
11 | deleteSlice(v)
12 | }
13 |
14 | func benchmarkSlice(b *testing.B, n int) {
15 | for bn := 0; bn < b.N; bn++ {
16 | s := insertSlice(n)
17 | deleteSlice(s)
18 | }
19 | }
20 |
21 | func TestList(t *testing.T) {
22 | const size = 10
23 | v := insertList(size)
24 | for n := v; n != nil; n = n.next {
25 | t.Log(n.val)
26 | }
27 | deleteList(v, size)
28 | }
29 |
30 | func benchmarkList(b *testing.B, n int) {
31 | for bn := 0; bn < b.N; bn++ {
32 | r := insertList(n)
33 | deleteList(r, n)
34 | }
35 | }
36 |
37 | func BenchmarkSlice(b *testing.B) {
38 | sizes := []int{10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000}
39 | for _, n := range sizes {
40 | b.Run(strconv.Itoa(n), func(b *testing.B) { benchmarkSlice(b, n) })
41 | }
42 | }
43 |
44 | func BenchmarkList(b *testing.B) {
45 | sizes := []int{10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000}
46 | for _, n := range sizes {
47 | b.Run(strconv.Itoa(n), func(b *testing.B) { benchmarkList(b, n) })
48 | }
49 | }
50 |
--------------------------------------------------------------------------------
/dotgo-2016/insert/plot.sh:
--------------------------------------------------------------------------------
1 |
2 | set terminal png size 1920,1440 crop enhanced font "/usr/share/fonts/truetype/times.ttf,30" dashlength 2;
3 | set termoption linewidth 3;
4 | set output "insert.png";
5 |
6 | set ylabel "milliseconds"
7 | set xlabel "elements"
8 |
9 | set logscale x;
10 |
11 | plot "bench.out" using 1:2 title "slice" with lines, "bench.out" using 1:3 title "list" with lines;
12 |
--------------------------------------------------------------------------------
/dotgo-2016/map/list-vs-map1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/dotgo-2016/map/list-vs-map1.png
--------------------------------------------------------------------------------
/dotgo-2016/map/list-vs-map2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/dotgo-2016/map/list-vs-map2.png
--------------------------------------------------------------------------------
/dotgo-2016/map/membership.go:
--------------------------------------------------------------------------------
1 | package membership
2 |
3 | import (
4 | "sort"
5 | )
6 |
7 | type sliceSet32 []uint32
8 |
9 | func (s sliceSet32) search(n uint32) bool {
10 | for _, v := range s {
11 | if v == n {
12 | return true
13 | }
14 | }
15 | return false
16 | }
17 |
18 | type sortedSet32 []uint32
19 |
20 | func (s sortedSet32) Len() int { return len(s) }
21 | func (s sortedSet32) Less(i, j int) bool { return s[i] < s[j] }
22 | func (s sortedSet32) Swap(i, j int) { s[i], s[j] = s[j], s[i] }
23 |
24 | func (s sortedSet32) linear(n uint32) bool {
25 | for _, v := range s {
26 | if v > n {
27 | return false
28 | }
29 | if v == n {
30 | return true
31 | }
32 | }
33 | return false
34 | }
35 |
36 | func (s sortedSet32) binary(n uint32) bool {
37 | i := sort.Search(len(s), func(i int) bool { return s[i] >= n })
38 | return i < len(s) && s[i] == n
39 | }
40 |
41 | func (s sortedSet32) inlined(n uint32) bool {
42 |
43 | low, high := 0, len(s)
44 |
45 | for low < high {
46 | mid := low + (high-low)/2
47 | if s[mid] >= n {
48 | high = mid
49 | } else {
50 | low = mid + 1
51 | }
52 | }
53 |
54 | return low < len(s) && s[low] == n
55 | }
56 |
57 | type mapSet32 map[uint32]bool
58 |
59 | func (m mapSet32) search(n uint32) bool {
60 | _, ok := m[n]
61 | return ok
62 | }
63 |
--------------------------------------------------------------------------------
/dotgo-2016/map/membership_test.go:
--------------------------------------------------------------------------------
1 | package membership
2 |
3 | import (
4 | "sort"
5 | "strconv"
6 | "testing"
7 | )
8 |
9 | func testSorted(t *testing.T, n int) {
10 | s := sliceSet32(make([]uint32, n))
11 | sorted := sortedSet32(make([]uint32, n))
12 | for i := 0; i < n; i++ {
13 | needle := xorshift32(uint32(i))
14 | s[i] = uint32(needle)
15 | sorted[i] = uint32(needle)
16 | }
17 |
18 | sort.Sort(sorted)
19 |
20 | for i := 0; i < 1000*n; i++ {
21 | needle := xorshift32(uint32(i))
22 | found := s.search(needle)
23 | s1 := sorted.linear(needle)
24 | s2 := sorted.binary(needle)
25 | s3 := sorted.inlined(needle)
26 | if found != s1 || found != s2 || found != s3 {
27 | t.Fatalf("mismatch for %v: found=%v s1=%v s2=%v s3=%v", needle, found, s1, s2, s3)
28 | }
29 | }
30 | }
31 |
32 | func TestSorted(t *testing.T) {
33 | var sizes = []int{1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200}
34 | for _, n := range sizes {
35 | t.Run(strconv.Itoa(n), func(t *testing.T) { testSorted(t, n) })
36 | }
37 | }
38 |
39 | var found int
40 |
41 | func benchmarkSlice32(b *testing.B, n int) {
42 | s := sliceSet32(make([]uint32, n))
43 | for i := 0; i < n; i++ {
44 | s[i] = xorshift32(uint32(i))
45 | }
46 | b.ResetTimer()
47 | for i := 0; i < b.N; i++ {
48 | needle := xorshift32(uint32(i))
49 | if s.search(needle) {
50 | found++
51 | }
52 | }
53 | }
54 |
55 | func benchmarkSortedLinear32(b *testing.B, n int) {
56 | s := sortedSet32(make([]uint32, n))
57 | for i := 0; i < n; i++ {
58 | s[i] = xorshift32(uint32(i))
59 | }
60 | sort.Sort(s)
61 | b.ResetTimer()
62 | for i := 0; i < b.N; i++ {
63 | needle := xorshift32(uint32(i))
64 | if s.linear(needle) {
65 | found++
66 | }
67 | }
68 | }
69 |
70 | func benchmarkSortedBinary32(b *testing.B, n int) {
71 | s := sortedSet32(make([]uint32, n))
72 | for i := 0; i < n; i++ {
73 | s[i] = xorshift32(uint32(i))
74 | }
75 | sort.Sort(s)
76 | b.ResetTimer()
77 | for i := 0; i < b.N; i++ {
78 | needle := xorshift32(uint32(i))
79 | if s.binary(needle) {
80 | found++
81 | }
82 | }
83 | }
84 |
85 | func benchmarkSortedInlined32(b *testing.B, n int) {
86 | s := sortedSet32(make([]uint32, n))
87 | for i := 0; i < n; i++ {
88 | s[i] = xorshift32(uint32(i))
89 | }
90 | sort.Sort(s)
91 | b.ResetTimer()
92 | for i := 0; i < b.N; i++ {
93 | needle := xorshift32(uint32(i))
94 | if s.inlined(needle) {
95 | found++
96 | }
97 | }
98 | }
99 |
100 | func benchmarkMap32(b *testing.B, n int) {
101 | m := mapSet32(make(map[uint32]bool))
102 | for i := 0; i < n; i++ {
103 | m[xorshift32(uint32(i))] = true
104 | }
105 | b.ResetTimer()
106 | for i := 0; i < b.N; i++ {
107 | needle := xorshift32(uint32(i))
108 | if m.search(needle) {
109 | found++
110 | }
111 | }
112 | }
113 |
114 | func xorshift32(y uint32) uint32 {
115 | y ^= (y << 13)
116 | y ^= (y >> 17)
117 | y ^= (y << 5)
118 | return y
119 | }
120 |
121 | func BenchmarkSlice32(b *testing.B) {
122 | var sizes = []int{1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
123 | for _, n := range sizes {
124 | b.Run(strconv.Itoa(n), func(b *testing.B) { benchmarkSlice32(b, n) })
125 | }
126 | }
127 |
128 | func BenchmarkSortedLinear32(b *testing.B) {
129 | var sizes = []int{1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
130 | for _, n := range sizes {
131 | b.Run(strconv.Itoa(n), func(b *testing.B) { benchmarkSortedLinear32(b, n) })
132 | }
133 | }
134 |
135 | func BenchmarkSortedBinary32(b *testing.B) {
136 | var sizes = []int{1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
137 | for _, n := range sizes {
138 | b.Run(strconv.Itoa(n), func(b *testing.B) { benchmarkSortedBinary32(b, n) })
139 | }
140 | }
141 |
142 | func BenchmarkSortedInlined32(b *testing.B) {
143 | var sizes = []int{1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800, 3000, 3200, 3400, 3600, 3800, 4000, 4500, 5000, 10000}
144 | for _, n := range sizes {
145 | b.Run(strconv.Itoa(n), func(b *testing.B) { benchmarkSortedInlined32(b, n) })
146 | }
147 | }
148 |
149 | func BenchmarkMap32(b *testing.B) {
150 | var sizes = []int{1, 2, 3, 4, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, 350, 400, 500, 600, 700, 800, 900, 1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800, 3000, 3200, 3400, 3600, 3800, 4000, 4500, 5000, 10000}
151 | for _, n := range sizes {
152 | b.Run(strconv.Itoa(n), func(b *testing.B) { benchmarkMap32(b, n) })
153 | }
154 | }
155 |
--------------------------------------------------------------------------------
/dotgo-2016/map/plot1.sh:
--------------------------------------------------------------------------------
1 | set terminal png size 1920,1440 crop enhanced font "/usr/share/fonts/truetype/times.ttf,30" dashlength 2;
2 | set termoption linewidth 3;
3 |
4 | set output "list-vs-map1.png";
5 | set ylabel "milliseconds"
6 | set xlabel "elements"
7 | plot "linear.out" title "linear" with lines, "map-small.out" title "map" with lines;
8 |
--------------------------------------------------------------------------------
/dotgo-2016/map/plot2.sh:
--------------------------------------------------------------------------------
1 | set terminal png size 1920,1440 crop enhanced font "/usr/share/fonts/truetype/times.ttf,30" dashlength 2;
2 | set termoption linewidth 3;
3 |
4 | set output "list-vs-map2.png";
5 | set ylabel "milliseconds"
6 | set xlabel "elements"
7 | plot "inlined.out" title "binary search" with lines, "map.out" title "map" with lines;
8 |
--------------------------------------------------------------------------------
/dotgo-2016/sizeof/main.go:
--------------------------------------------------------------------------------
1 | package main
2 |
3 | import (
4 | "fmt"
5 | "reflect"
6 | "unsafe"
7 | )
8 |
9 | type s struct {
10 | b bool
11 | i int
12 | }
13 |
14 | func main() {
15 |
16 | slice := make([]s, 20)
17 |
18 | // START OMIT
19 |
20 | slice14 := &slice[14]
21 |
22 | sh := (*reflect.SliceHeader)(unsafe.Pointer(&slice))
23 | off14 := sh.Data + 14*unsafe.Sizeof(slice[0])
24 |
25 | // END OMIT
26 |
27 | fmt.Printf("slice14 = %x\n", uintptr(unsafe.Pointer(slice14)))
28 | fmt.Printf("off14 = %x\n", off14)
29 | }
30 |
--------------------------------------------------------------------------------
/dotgo-2016/slices.slide:
--------------------------------------------------------------------------------
1 | Slices
2 | Performance through cache friendliness
3 |
4 | dotGo
5 | 10 Oct 2016
6 |
7 | Damian Gryski
8 | @dgryski
9 | https://github.com/dgryski
10 |
11 | * Video
12 |
13 | .link https://www.youtube.com/watch?v=jEG4Qyo_4Bc Watch this talk on YouTube
14 |
15 | * Slices
16 |
17 | reflect/value.go
18 |
19 | type SliceHeader struct {
20 | Data uintptr
21 | Len int
22 | Cap int
23 | }
24 |
25 | : Today I'm going to talk about how slices work with CPU caches and how to take advantage of this when writing performant code.
26 |
27 | : This is a slice. A pointer to a block of memory, a length field, and a total capacity.
28 |
29 | : The length field tells us how many entries are valid. The capacity is how much has been allocated. Both these numbers are counts and relative need to be scaled by the size of each slice element to be turned into bytes.
30 |
31 | * Sizeof
32 |
33 | .play sizeof/main.go /START OMIT/,/END OMIT/
34 |
35 | : We can get this with unsafe.Sizeof(). Then we can calculate exactly where each element is in memory:
36 |
37 | : This is what the compiler does every time you access an element. These are bounds checked to make sure we don't access invalid memory. The SSA backend works to remove these checks where possible, and the branch predictor in your processor will quickly learn they all pass. So in the end they have very little cost.
38 |
39 | * Append
40 |
41 | runtime/slice.go
42 | .code append/main.go /START OMIT/,/END OMIT/
43 |
44 | : You can also append to a slice. If there's space left, that is the new length is less than the capacity, it'll get slotted in. Otherwise a new chunk of memory is allocated with a larger capacity, the old slice is copied over, and the new element is added.
45 |
46 | : The new capacity is twice the old capacity, or 1.25 the old capacity if the slice is bigger than 1024 elements. This gives amortized O(1) append. It's pretty cool, actually. The theory and practice match up. You can benchmark this. It works.
47 |
48 | : If you know how big your slice needs to be, preallocate it and save yourself the copying and the extra gc pressure.
49 |
50 | : Append is getting faster in 1.8 too. There was a small patch to `runtime.growslice`, the method that actually handles append under the hood to not clear the bytes that will be overwritten with the old data. The benchmarks show this as ~10-15% faster.
51 |
52 | * Homework
53 |
54 | .link https://blog.golang.org/go-slices-usage-and-internals
55 | .link https://blog.golang.org/slices
56 |
57 | .link https://github.com/golang/go/blob/master/src/runtime/slice.go runtime/slice.go
58 |
59 | : So, slices are very convenient. Dynamically sized arrays. Amortized O(1) append(). Bounds checked on access. Length and capacity are part of the slice so they never get out of sync.
60 |
61 | : All the basics are covered in these two blog posts
62 |
63 | : But for today's talk I only really care about the data field. The large chunk of contiguous bytes in memory.
64 |
65 | * Moore's Law
66 |
67 | .image images/Part6-CPU-and-RAM-speeds-563x300.jpg 500 _
68 |
69 | : First, a brief digression on computer hardware.
70 |
71 | : Moore's Law says that the number of transistors on a processor doubles every 18 months. You can see on this graph where the single core performance plateaued. We're now scaling things by adding more cores. Which of course is why languages like Go are interesting.
72 |
73 | : The bottom line on the graph is rate at which RAM speed is increasing. Much much slower.
74 |
75 | : So instead processors add levels of caches to try to hide this massive speed difference.
76 |
77 | * Numbers Every Programmer Should Know
78 |
79 | .code images/latency.txt
80 |
81 | 3GHz * 4 instructions per cycle = 12 instructions per nanosecond
82 |
83 | : We've all seen this slide in various forms. The original numbers were from 2003, but there have been a few updates since then.
84 |
85 | : We see the timings for the different layers of CPU caches. Order of magnitude bigger, but order of magnitude slower.
86 |
87 | : My laptop has 2 x 32kB L1 cache; 2 x 256kB L2 cache, 3MB L3 cache. And 16GB of main memory.
88 |
89 | : Some quick calculations. Let's say you have a 3GHz processor. That's three cycles every nanosecond. If your processor can do 4 instructions per cycle, then in 100ns it takes to fetch a cacheline from main memory you've just stalled for 1200 instructions. You can do a lot of work in 1200 instructions.
90 |
91 | : So, while we can process data faster than ever before, the bottleneck is now getting the data to the processor.
92 |
93 | : Which means a program that plays nicely with the different layers of caching is going to be faster that one that can't. And as the latency numbers tell us, orders of magnitude faster.
94 |
95 | * Cache Lines
96 |
97 | .code cacheline/cache.go /START OMIT/,/END OMIT/
98 |
99 | : Here's some code. It allocates 128M of int32s, do some math with every element, then returns.
100 |
101 | : Then I do the same thing but for only every second element. I'm doing half the work, it should be twice as fast.
102 |
103 | : But it's not.
104 |
105 | * Cache Lines
106 |
107 | .image cacheline/cache.png 600 _
108 |
109 | : In order to access a single element, we've had to pull in an entire cacheline. In this case, 64-bytes, or 16 4-byte integers. Accessing the other elements in the same cache line is basically free.
110 |
111 | : The expensive thing is accessing a new cacheline. It's only once we're accessing fewer cache lines, that our program starts speeding up.
112 |
113 | : This graph looks roughly the same for ints16s and int64s. Different numbers of elements, but still 64-bytes.
114 |
115 | : Accessing elements on the same cacheline is free. Let's see how to use this.
116 |
117 | : But first, some theory.
118 |
119 | * Big-O Notation
120 |
121 | .image images/BigO.png 500 _
122 |
123 | : Back to school. Algorithms. Complexity analysis. We've all had these facts pounded into our brains.
124 |
125 | : accessing a map element is O(1); binary search is O(log n); iterating over a list is O(n); sorting is O(n log n)
126 |
127 | : Two things that people forget when discussion big-O notation
128 |
129 | : One: there's a constant factor involved. Two algorithms which have the same algorithmic complexity can have different constant factors. Imagine running a looping over a list 100 times vs just looping over it once Even though both are O(n), one has a constant factor that's 100 times higher.
130 |
131 | : These constant factors are why even though merge sort, quicksort, and heapsort all O(n log n), everybody uses quicksort because it's the fastest. It has the smallest constant factor.
132 |
133 | : The second thing that people forget is that big-O only says "as n grows to infinity". It says nothing about small n. "As the numbers get big, this is the growth factor that will dominate the run time."
134 |
135 | : There's frequently a cut-off point below which a dumber algorithm is faster. An nice example from the Go standard library's `sort` package. Most of the time it's using quicksort, but it has a shell-sort pass then insertion sort when the partition size drops below 12 elements.
136 |
137 | * Map vs Linear Search
138 |
139 | .image map/list-vs-map1.png 600 _
140 |
141 | : Enough theory. Back to some code.
142 |
143 | : Let's talk about storing a set of ints.
144 |
145 | : If I just had to remember a single integer, I'd use a variable. A map is overkill. What if I had two integers? 10 integers? Where's the cut-off where a map becomes faster than searching through an array of integers.
146 |
147 | : I've benchmarked this and here's the graph. From a performance perspective, until you have about 30 elements, a slice is faster.
148 |
149 | : This obviously comes with a lot of caveats. An interesting one is that this only works with integers. Strings require not only an additional pointer dereference (meaning the body of the string needs to be fetched into the cache), but a comparing two strings takes longer than comparing two integers. You have to iterate over all the bytes.
150 |
151 | : Here the constant factor shifts. For strings, a map is *always* faster.
152 |
153 | : Next, what about if it's sorted?
154 |
155 | : Using sort.Search() is expensive because of the function call overhead -- you can't really get beyond 10-20 elements before the map is faster. However, you can inline the binary search code. Always dangerous. It's notoriously tricky to get right. That's why there's one in the standard library called sort.Search().
156 |
157 | * Map vs Binary Search
158 |
159 | .image map/list-vs-map2.png 600 _
160 |
161 | : However, when performance is important, you do dangerous things. With a custom binary search, on my laptop, I got up to about 1500 integers before a map is consistently faster. And in fact a binary search in a regular sorted array isn't that cache friendly. There are other layouts that can make it even faster.
162 |
163 | * Slice vs Linked-List
164 |
165 | 5, 1, 4, 2
166 |
167 | 5
168 | 1 5
169 | 1 4 5
170 | 1 2 4 5
171 |
172 | 1, 2, 0, 0
173 |
174 | 1 2 4 5
175 | 1 4 5
176 | 1 4
177 | 4
178 |
179 |
180 | : A slightly more complicated example here. This is from a great article by Bjarne Stroustroup called "Software for Infrastructure".
181 |
182 | : The problem statement:
183 |
184 | : First: generate N random integers and insert them in sorted order into a sequence
185 |
186 | : Then: remove them one at a time by selecting a random position from the sequence
187 |
188 | : Lets consider two standard implementations here: a slice and a linked-list.
189 |
190 | : For a slice, we have to find the right position, which we can do with a binary search which is easy enough. Inserting the element though requires shifting every element over one spot. Same thing for removal. Easy enough to find the spot, but then expensive to remove.
191 |
192 | : For a linked-list we have find the spot (linear time), though, we can do the insert and deletion piece is only O(1). That's much faster.
193 |
194 | : So what does it look like when we run it?
195 |
196 | * Slice vs Linked-List
197 |
198 | .image insert/insert-log.png 600 _
199 |
200 | : Well, not so good for linked-lists. I ran this up to 500k elements. The linked-list never caught up, and in fact just kept getting worse.
201 |
202 | : All those O(n) list traversals were chasing pointers. Which end up being cache misses. The constant factor on that is pretty high.
203 |
204 | : In the slice version, the CPU is *very* good at copying memory around. The constant factor for that about as low as it can get.
205 |
206 | : And even if we don't use a binary search, just a regular linear scan through the array, the CPU will prefetch all that memory with no stalls. It's still faster.
207 |
208 | : This graph is a little misleading because it's log-scale on both axis. Here's what happens if I make the Y axis linear.
209 |
210 | * Slice vs Linked-List
211 |
212 | .image insert/insert.png 600 _
213 |
214 | : At 200k elements, the slice version took 12s and the linked list version took 4m8s.
215 | : At 500k elements, the slice version took 1m22s and I killed the linked list version because I was tired of waiting.
216 |
217 | : Now, obviously this is an artificial example. But this does happen in practice. Algorithms with faster theoretical numbers which are drowned out by the constant factors due to pointer chasing and poor use of cache.
218 |
219 | * Conclusions
220 |
221 | * Optimizing for Cache Usage
222 |
223 | - Use kcachegrind and perf to count cache misses
224 |
225 | - Store less
226 |
227 | - Access memory in a predictable manner
228 |
229 | : So, what can we learn from this. Is the conclusion "always use a slice"? Kind of.
230 |
231 | : Like all benchmarks, these ones have flaws. The real issue here is that in all these cases we're not doing any *work*. Our benchmarks are entirely dominated by data access time.
232 |
233 | : If we did some computation at each step, even something as dumb as calling a very fast random number generator, we can very easily shift our program from being bound by the data access to be bound by computation.
234 |
235 | : We know how to optimize CPU bound programs. Profile. Choose a more efficient algorithm. Do less work.
236 |
237 | : How do you optimize programs bound by data access?
238 |
239 | : First, you have to recognize this is happening. Pprof will tell you which routines are taking up the most time. But it's hard to know if that time is wasted because of cache misses. kcachegrind will do this for you. Perf will do this for you. Once you have some rough numbers, then you can begin.
240 |
241 | : Store the minimum amount of data you need. You can fit more more smaller things into the cache. I had one program I sped up 15% with a single change of a slice from int64s to int32s. With a smaller ID, I could fit more into my processor cache which made searching through them faster.
242 |
243 | : While I've been talking mostly about slices, this applies to structs too. Remove extra fields you don't need. Re-order fields to eliminate struct padding. There are tools to help you with this.
244 |
245 | : Next, access memory in a predictable manner.
246 |
247 | : Searching through the array linearly is predictable access. Shifting the elements over one is predictable. Traversing the linked-list is not.
248 |
249 | : Modern processors have a prefetcher that look for patterns in memory and fill the cache with data so you don't have to wait as long. But it can't predict pointer traversals.
250 |
251 | : So, now you know how to optimize for cache usage.
252 |
253 | : But should you actually do this?
254 |
255 | * (Real) Conclusion
256 |
257 | "Notes on C Programming" (Rob Pike, 1989)
258 |
259 | Rule 3. Fancy algorithms are slow when n is small, and n is usually small.
260 | Fancy algorithms have big constants. Until you know that n is frequently going
261 | to be big, don't get fancy.
262 |
263 | Rule 4. Fancy algorithms are buggier than simple ones, and they're much
264 | harder to implement. Use simple algorithms as well as simple data structures.
265 |
266 | : The point I really want to leave you with is this: modern computers and compilers have gotten very good at executing simple, dumb algorithms quickly. So that's where you should start. These benchmarks, as artificial as they were, are indications of why. Simple algorithms tend not to have too much baggage. A list of things, maybe a counter or two. Simple algorithms tend to have predictable access patterns. Scanning through arrays. Copying values around.
267 |
268 | : Here are rules from Rob Pike's "Notes on C Programming".
269 |
270 | : What `n` needs to be in order to be considered "small" is getting bigger all the time.
271 |
272 | : Use simple algorithms and simple data structures.
273 |
274 | : And it might still be fast.
275 |
276 | : Thanks.
277 |
--------------------------------------------------------------------------------
/fuzzing/DroidSerif.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/DroidSerif.ttf
--------------------------------------------------------------------------------
/fuzzing/YanoneKaffeesatz-Regular.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/YanoneKaffeesatz-Regular.ttf
--------------------------------------------------------------------------------
/fuzzing/remark-latest.min.js:
--------------------------------------------------------------------------------
1 | ../remarkjs/remark-latest.min.js
--------------------------------------------------------------------------------
/fuzzing/slides.css:
--------------------------------------------------------------------------------
1 | /* Slideshow styles */
2 |
3 | @font-face {
4 | font-family: 'Droid Serif';
5 | font-style: normal;
6 | font-weight: 400;
7 | src: local('Droid Serif'), local('DroidSerif'), url(DroidSerif.ttf) format('truetype');
8 | }
9 |
10 |
11 | @font-face {
12 | font-family: 'Yanone Kaffeesatz';
13 | font-style: normal;
14 | font-weight: 400;
15 | src: local('Yanone Kaffeesatz Regular'), local('YanoneKaffeesatz-Regular'), url(YanoneKaffeesatz-Regular.ttf) format('truetype');
16 | }
17 |
18 | body { font-family: 'Droid Serif'; font-size: 1.5em; }
19 | h1, h2, h3 {
20 | font-family: 'Yanone Kaffeesatz';
21 | font-weight: 400;
22 | margin-bottom: 0;
23 | }
24 | h1 { font-size: 3em; }
25 | h2 { font-size: 2em; }
26 | h3 { font-size: 1.5em; }
27 | .footnote {
28 | position: absolute;
29 | bottom: 3em;
30 | }
31 | li p { line-height: 1.25em; }
32 | .red { color: #fa0000; }
33 | .large { font-size: 2em; }
34 | a, a > code {
35 | color: rgb(249, 38, 114);
36 | text-decoration: none;
37 | }
38 | .inverse {
39 | background: #272822;
40 | color: #777872;
41 | text-shadow: 0 0 20px #333;
42 | }
43 | .inverse h1, .inverse h2 {
44 | color: #f3f3f3;
45 | line-height: 0.8em;
46 | }
47 |
--------------------------------------------------------------------------------
/fuzzing/slides.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | randomized testing
5 |
6 |
8 |
9 |
10 |
11 |
14 |
15 |
16 |
--------------------------------------------------------------------------------
/fuzzing/slides.md:
--------------------------------------------------------------------------------
1 | class: center, middle, inverse
2 |
3 | ## randomized testing for go
4 |
5 | ---
6 |
7 | ## randomized testing
8 |
9 | - `testing/quick`
10 |
11 | - `github.com/dvyukov/go-fuzz`
12 |
13 | - plus some others
14 |
15 | ---
16 |
17 | # history
18 |
19 | - University of Wisconsin Madison in 1989 by Professor Barton Miller and his students
20 |
21 | - "An Empirical Study of the Reliability of UNIX Utilities" (1990) and "Fuzz Revisited: A Re-examination of the Reliability of UNIX Utilities and Services" (1995)
22 |
23 | - some bugs reported in 1990 were still present in 1995
24 |
25 | - mid-2000s picked up by the security community
26 |
27 | ---
28 |
29 | # why we care
30 |
31 | - writing tests is boring
32 |
33 | - humans write biased tests
34 |
35 | - have the computer write them for you
36 |
37 | ---
38 |
39 | # type of random testing
40 |
41 | - property-based testing
42 |
43 | - generational fuzzing
44 |
45 | - mutational fuzzing
46 |
47 | - stateful testing
48 |
49 | ---
50 |
51 | # testing/quick
52 |
53 | ```go
54 | func TestQuick(t *testing.T) {
55 | q := func(i, j int) bool {
56 | quo, rem := Div(i, j)
57 | return i == quo*j+rem
58 | }
59 |
60 | if err := quick.Check(q, nil); err != nil {
61 | t.Error(err)
62 | }
63 | }
64 | ```
65 |
66 | ---
67 |
68 | # testing/quick
69 |
70 | ```go
71 | func TestQuick(t *testing.T) {
72 | q := func(i, j int) bool {
73 | quo, rem := Div(i, j)
74 | return i == quo*j+rem
75 | }
76 |
77 | if err := quick.Check(q, nil); err != nil {
78 | err := err.(*quick.CheckError)
79 | t.Errorf("iteration %d failed: args=(%v,%v)",
80 | err.Count, err.In[0], err.In[1])
81 | }
82 | }
83 | ```
84 |
85 | ---
86 |
87 | # testing/quick
88 |
89 | ```go
90 | func TestQuick(t *testing.T) {
91 | q := func(i, j int) bool {
92 | quo, rem := Div(i, j)
93 | if got := quo*j+rem; got != i {
94 | t.Errorf("Div(%d,%d)=(%d,%d), check %d", i, j, quo, rem, got)
95 | }
96 | }
97 |
98 | quick.Check(q, nil)
99 | }
100 | ```
101 |
102 | ---
103 |
104 | # testing/quick
105 |
106 | ```go
107 | type small int
108 |
109 | func (s small) Generate(rand *rand.Rand, size int) reflect.Value {
110 | return reflect.ValueOf(small(rand.Intn(100)))
111 | }
112 |
113 | func TestQuick(t *testing.T) {
114 | f := func(si, sj small) bool {
115 | i, j := int(si), int(sj)
116 | quo, rem := Div(i, j+1)
117 | return i == quo*j+rem
118 | }
119 |
120 | if err := quick.Check(f, nil); err != nil {
121 | t.Log(err)
122 | }
123 |
124 | }
125 | ```
126 |
127 | ---
128 |
129 | # github.com/google/gofuzz
130 |
131 | - split off from kubernetes
132 |
133 | - random value generation
134 |
135 | - register handlers for each type
136 |
137 | - use as a component for other tests
138 |
139 | ---
140 |
141 | # generational fuzzing
142 |
143 | - lots of packages
144 |
145 | - might need to write your own if your language is "tricky"
146 |
147 | ---
148 |
149 | # github.com/zimmski/tavor
150 |
151 | - framework with lots of features
152 |
153 | - grammar based fuzzing
154 |
155 | ---
156 |
157 | # github.com/zimmski/tavor
158 |
159 | ```
160 | START = target
161 | target = metric | function
162 | function = name "(" arguments ")"
163 | name = ([A-Za-z]) *([\w])
164 | arguments = argument *( "," argument )
165 | argument = metric | function | qstring | number
166 | qstring = "\"" +([\w]) "\""
167 | number = +([0-9]) | +([0-9]) "." +([0-9])
168 | metric = name *( "." +([\w]) )
169 | ```
170 |
171 | ---
172 |
173 | # github.com/MozillaSecurity/dharma
174 |
175 | ```
176 | target :=
177 | +metric+
178 | +function+
179 |
180 | function :=
181 | +node+(%repeat%(+argument+, ","))
182 |
183 | argument :=
184 | +metric+
185 | +function+
186 | +qstring+
187 | +common:integer+
188 | +common:decimal_number+
189 |
190 | node :=
191 | %repeat%(+alpha+)%repeat%(+word+)
192 |
193 | qstring :=
194 | "+common:text+"
195 |
196 | metric :=
197 | %repeat%(+node+, ".")
198 | ```
199 |
200 | ---
201 |
202 | # sequitur
203 |
204 | - dgryski/go-sequitur
205 |
206 | ---
207 |
208 | # github.com/dvyukov/go-fuzz
209 |
210 | - dvyukov
211 |
212 | - mutation fuzzing
213 |
214 | - coverage guided
215 |
216 | - file formats, protocols, parsing *anything*
217 |
218 | - based on afl
219 |
220 | - understands simple text protocols
221 |
222 | - clustering mode
223 |
224 | ---
225 |
226 | # github.com/dvyukov/go-fuzz
227 |
228 | - cd github.com/user/package
229 | - mkdir corpus && cp inputs* corpus
230 | - go-fuzz-build && go-fuzz
231 |
232 | ---
233 |
234 | # github.com/dvyukov/go-fuzz
235 |
236 | - `workdir/corpus`
237 |
238 | - `test1.in`
239 | - ``
240 |
241 | --
242 |
243 | - `workdir/crashers`
244 |
245 | - ``
246 | - `.quoted`
247 | - `.output`
248 |
249 | --
250 |
251 | - `workdir/supressions`
252 |
253 | ---
254 |
255 | # github.com/dvyukov/go-fuzz
256 |
257 | ```go
258 | func Fuzz(data []byte) int {
259 | if _, err := Decode(data); err != nil {
260 | return 0
261 | }
262 |
263 | return 1
264 | }
265 | ```
266 |
267 | ---
268 |
269 | # github.com/dvyukov/go-fuzz
270 |
271 | ```go
272 | func Fuzz(data []byte) int {
273 | packed := Encode(data);
274 |
275 | var unpacked []byte
276 | var err error
277 | if unpacked, err = Decode(packed); err != nil {
278 | panic("unpacking packed data failed")
279 | }
280 |
281 | if !bytes.Equal(unpacked, data) {
282 | panic("roundtrip failed")
283 | }
284 |
285 | return 1
286 | }
287 | ```
288 |
289 | ---
290 |
291 | # github.com/dvyukov/go-fuzz
292 |
293 | ```go
294 | func Fuzz(data []byte) int {
295 | fast := EncodeFast(data);
296 | slow := EncodeSlow(data)
297 |
298 | if !bytes.Equal(fast, slow) {
299 | panic("behaviour mismatch")
300 | }
301 |
302 | return 1
303 | }
304 | ```
305 |
306 | ---
307 |
308 | # github.com/dvyukov/go-fuzz
309 |
310 | ```go
311 | func Fuzz(data []byte) int {
312 | purego := hashGo(data);
313 | asm, cgo := C.Hash(data), hashAsm(data)
314 |
315 | if !bytes.Equal(purego, asm) || !bytes.Equal(purego, cgo) {
316 | panic("behaviour mismatch")
317 | }
318 |
319 | return 1
320 | }
321 | ```
322 |
323 | ---
324 |
325 | # stateful fuzzing
326 |
327 | - ideas from functional programming
328 |
329 | - verifying invariants and post-conditions on API calls
330 |
331 | - trying to figure out what a Go version would look like
332 |
333 | ---
334 |
335 | # stateful fuzzing
336 |
337 | - list of API calls
338 |
339 | ```
340 | func (h heap) Put(key, prio int) error { ... }
341 | func (h heap) Min() (int, error) { ... }
342 | func (h heap) Len() int { ... }
343 | func (h heap) Cap() int { ... }
344 | ```
345 |
346 | ---
347 |
348 | # stateful fuzzing
349 |
350 | - list of API calls with claims
351 |
352 | ```
353 | "put": {
354 | pre: func() bool { return h.Len() < h.Cap() },
355 | call: h.Put,
356 | post: func() bool { return h.Len() != 0 },
357 | }
358 | ```
359 |
360 | ---
361 |
362 | # stateful fuzzing
363 |
364 | - list of API calls with models
365 |
366 | ```
367 | "put": {
368 | pre: func() bool { return h.Len() < h.Cap() },
369 | call: func(x, prio int) { h.Put(x, prio); dumbHeap.Put(x, prio) }
370 | post: func() bool { h.Len() == dumbHeap.Len() },
371 | }
372 | ```
373 |
374 | ---
375 |
376 | # stateful fuzzing
377 |
378 | - list of API calls that leads to a failure
379 |
380 | ```
381 | put(1, 12); put(2, 19); getmin(); getmin(); put(4, 31);
382 | ```
383 |
384 | - and minimize it; extract the signal from the noise
385 |
386 | ```
387 | put(1, 12); getmin(); put(4, 31);
388 | ```
389 |
390 | ---
391 |
392 | # stateful fuzzing
393 |
394 | - dgryski/go-tinymap: real-world custom example
395 |
396 | - mschoch/smat: sadly abandoned repo
397 |
398 | ---
399 |
400 | # more reading
401 |
402 | - randomized test case generation
403 |
404 | - test case minimization
405 |
406 | - https://www.fuzzingbook.org/
407 |
408 | ---
409 | class: center, middle, inverse
410 |
411 | ## fin
412 |
413 | ???
414 |
415 | vim: ft=markdown
416 |
--------------------------------------------------------------------------------
/golanguk-2015/1percentrule.svg:
--------------------------------------------------------------------------------
1 |
2 |
35 |
--------------------------------------------------------------------------------
/golanguk-2015/41HXBdOgdNL.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/41HXBdOgdNL.jpg
--------------------------------------------------------------------------------
/golanguk-2015/ChartGo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/ChartGo.png
--------------------------------------------------------------------------------
/golanguk-2015/Community_title.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/Community_title.jpg
--------------------------------------------------------------------------------
/golanguk-2015/DroidSerif.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/DroidSerif.ttf
--------------------------------------------------------------------------------
/golanguk-2015/VLU.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/VLU.png
--------------------------------------------------------------------------------
/golanguk-2015/YanoneKaffeesatz-Regular.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/YanoneKaffeesatz-Regular.ttf
--------------------------------------------------------------------------------
/golanguk-2015/bill-ted2.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/bill-ted2.jpg
--------------------------------------------------------------------------------
/golanguk-2015/canadian-flag-mosaic-by-tim-van-horn-2010.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/canadian-flag-mosaic-by-tim-van-horn-2010.jpg
--------------------------------------------------------------------------------
/golanguk-2015/dgryski-simpsons.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/dgryski-simpsons.png
--------------------------------------------------------------------------------
/golanguk-2015/golang-trends.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/golang-trends.jpg
--------------------------------------------------------------------------------
/golanguk-2015/gopherconf.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/gopherconf.png
--------------------------------------------------------------------------------
/golanguk-2015/martini.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/martini.jpeg
--------------------------------------------------------------------------------
/golanguk-2015/remark-latest.min.js:
--------------------------------------------------------------------------------
1 | ../remarkjs/remark-latest.min.js
--------------------------------------------------------------------------------
/golanguk-2015/slides.css:
--------------------------------------------------------------------------------
1 | /* Slideshow styles */
2 |
3 | @font-face {
4 | font-family: 'Droid Serif';
5 | font-style: normal;
6 | font-weight: 400;
7 | src: local('Droid Serif'), local('DroidSerif'), url(DroidSerif.ttf) format('truetype');
8 | }
9 |
10 |
11 | @font-face {
12 | font-family: 'Yanone Kaffeesatz';
13 | font-style: normal;
14 | font-weight: 400;
15 | src: local('Yanone Kaffeesatz Regular'), local('YanoneKaffeesatz-Regular'), url(YanoneKaffeesatz-Regular.ttf) format('truetype');
16 | }
17 |
18 | body { font-family: 'Droid Serif'; font-size: 1.5em; }
19 | h1, h2, h3 {
20 | font-family: 'Yanone Kaffeesatz';
21 | font-weight: 400;
22 | margin-bottom: 0;
23 | }
24 | h1 { font-size: 3em; }
25 | h2 { font-size: 2em; }
26 | h3 { font-size: 1.5em; }
27 | .footnote {
28 | position: absolute;
29 | bottom: 3em;
30 | }
31 | li p { line-height: 1.25em; }
32 | .red { color: #fa0000; }
33 | .large { font-size: 2em; }
34 | a, a > code {
35 | color: rgb(249, 38, 114);
36 | text-decoration: none;
37 | }
38 | .inverse {
39 | background: #272822;
40 | color: #777872;
41 | text-shadow: 0 0 20px #333;
42 | }
43 | .inverse h1, .inverse h2 {
44 | color: #f3f3f3;
45 | line-height: 0.8em;
46 | }
47 |
--------------------------------------------------------------------------------
/golanguk-2015/slides.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | the go community
5 |
6 |
8 |
9 |
10 |
11 |
14 |
15 |
16 |
--------------------------------------------------------------------------------
/golanguk-2015/slides.md:
--------------------------------------------------------------------------------
1 | class: center, middle, inverse
2 |
3 | ## the go community
4 |
5 | ???
6 |
7 |
8 | ---
9 |
10 | .center[]
11 |
12 | .center[
13 | @dgryski
14 |
15 | github.com/dgryski
16 |
17 | reddit.com/u/dgryski
18 | ]
19 |
20 | ???
21 |
22 | Hello, my name is Damian, I am a developer at Booking.com. I work in our
23 | Reliability Engineering Group focussing on Infrastructure Monitoring.
24 |
25 | I'd like to thank you all for coming to the conference today, and I'd like to
26 | thank the organizers for giving me the honour of delivering the closing
27 | keynote.
28 |
29 | Now, normally I give fairly technical talks. But not today. Instead, I'm
30 | going to talk about community, which seems to be a common topic for conference
31 | keynotes. I'm much more comfortable giving technical talks. On the other
32 | hand. I'm an extrovert so I could stand up here and talk about chickens for 30
33 | minutes if I wanted to.
34 |
35 | If you've ever read the golang subreddit, you’ve probably seen my username. If
36 | you’ve tweeted a question with #golang, there’s a good chance you've interacted
37 | with me. And I suppose that's why I was asked to stand up here and talk about
38 | community. I'm a visible member of the community. It probably helps that I'm
39 | an extrovert. I *do* stuff. I certainly didn't start out saying “Hey, I’m
40 | going to get involved in the Go community”, I just sort of fell into it.
41 |
42 | ---
43 |
44 | ## What is Community?
45 |
46 | --
47 |
48 | - Wikpedia:
49 |
50 | --
51 | > Community is an American television sitcom created by Dan Harmon that premiered on NBC on September 17, 2009.
52 |
53 | .center[]
54 |
55 |
56 | ???
57 |
58 | So, what *is* community? Wikipedia says that Community is an American
59 | television sitcom that premiered on NBC in September 2009. But if you read
60 | further, you see it’s about a group of people who come together to form a study
61 | group and their shared experiences. This sounds like a better definition: a
62 | group of people with shared values.
63 |
64 | ---
65 |
66 | ## Do Gophers have community?
67 |
68 | --
69 |
70 | - Wikipedia:
71 |
72 | --
73 | > Gophers are solitary outside of breeding season.
74 |
75 | - A *concurrence* of (Go) Gophers
76 |
77 | ???
78 |
79 | Do Gophers have a community? Back to Wikipedia. “Gophers are solitary outside
80 | of breeding season” So, basically not. This, by, the way, is where there’s no
81 | collective noun for a group of gophers. Although Michael Jones has suggested the
82 | wonderful phrase “a concurrence of gophers”.
83 |
84 | ---
85 |
86 |
87 |
88 | ???
89 |
90 | Our “concurrence” is growing, because our community is growing. And our
91 | community is growing because people are interested in learning more about Go.
92 | Lots of different backgrounds: different languages both computer and spoken,
93 | different levels of knowledge, different levels of experience; some self
94 | taught, other have formal training, different areas of focus, different
95 | priorities. *pause*
96 |
97 | Naturally, this means the make-up of our community is changing. And this is
98 | concerning for some.. Something drew us all to Go in the first place. We want
99 | to ensure that the things that we enjoy, that brought us here, don't change.
100 |
101 | ---
102 |
103 | ## Code of Conduct
104 |
105 |
106 | .center[
107 |
108 | ]
109 |
110 | ???
111 |
112 | This isn't going to be a talk about Codes of Conduct. There have been a number
113 | of those already. The code of conduct I personally try to live by “Be
114 | Excellent to Each Other”. If the one-sentence version of yours is sufficiently
115 | different, it's probably worth asking “why”.
116 |
117 | ---
118 | class: center, middle, inverse
119 |
120 | ## "the" go community
121 |
122 | ???
123 |
124 | So, let’s talk about the Go community. Can we even talk about “The” Go
125 | Community?
126 |
127 | ---
128 |
129 | class: center, middle, inverse
130 |
131 | ## "the" go communities
132 |
133 | ???
134 |
135 | I can certainly talk about the Go *communities* I’m a part of. The
136 | gophers at my work, the Amsterdam Go community, perhaps the Slack, Twitter and
137 | Reddit Go communities, maybe IRC and mailing list. But what about “the” Go
138 | community?
139 |
140 | ---
141 |
142 | ## Community Survey Results
143 |
144 | Development Platform
145 |
146 | OS |
147 | -------- | --------
148 | Linux | 46%
149 | OSX | 43%
150 | Windows | 11%
151 |
152 | ???
153 |
154 | Jason Buberel, the Product Manager for the Go Team at Google, posted the
155 | results of a community survey:
156 |
157 | --
158 |
159 | - Where are the Windows users from China?
160 |
161 | - Downloads, Google Trends, "Build Web Applications in Go"
162 |
163 | - Why is Go so popular in China?
164 |
165 | ???
166 |
167 | And yet, these results are at odds with the fact that we *know* the most
168 | popular downloads of the Go installer are Windows users from China. So
169 | clearly, the Windows-using Chinese developers are not talking loudly on the
170 | go-nuts mailing list. Google Trends, also, shows the immense popularity of
171 | 'golang' in China. Before Docker and Kubernetes, the most starred Go
172 | repository was Asta Xie’s book “Build Web Applications in Go”, available only
173 | in Chinese. There was even a blog post trying to answer the question “Why is
174 | Go so popular in China?”
175 |
176 |
177 | ---
178 |
179 | ## GopherChina
180 |
181 | .center[]
182 |
183 | - Qihoo 360 running 10B messages/day with 200M+ real-time connections.
184 | - Baidu running 100B requests/day through Go.
185 |
186 | ???
187 |
188 | One thing to come out of GopherChina conference in Shanghai was the immense
189 | scale some of these Chinese users were running at.
190 |
191 | The rest of us hadn’t heard anything about this, or even hardly anything about
192 | these companies. There was just a feeling Go was taking off in China.
193 |
194 | So clearly, there is a Go community a China. A huge one from the looks of it.
195 | But one that has little in common with the rest of us.
196 |
197 | Which brings up another interesting point.
198 |
199 | ---
200 |
201 | ## Who are "the rest of us" ?
202 |
203 | --
204 |
205 | - All the gophers outside of China
206 |
207 | ???
208 |
209 | We could say "all the gophers outside of China".
210 |
211 | --
212 |
213 | - golang-nuts@, #go-nuts, #golang, /r/golang
214 |
215 | ???
216 |
217 | We have the communities I mentioned earlier: the mailing list, IRC, twitter, reddit
218 |
219 | --
220 |
221 | .center[]
222 |
223 | ???
224 |
225 | People who attend conferences like this one.
226 |
227 | ---
228 |
229 | ## Gophers by the numbers
230 |
231 | .center[]
232 |
233 | ???
234 |
235 | G+ | 20000
236 | golang-nuts | 17000
237 | /r/golang | 16000
238 | GitHub | 14000
239 |
240 | Obviously these are not all independent groups; there’s overlap. But here's an
241 | interesting question: How many people are writing Go code but are not
242 | subscribed to the mailing list? How many people are writing Go code, but don't
243 | read the subreddit? Are these people part of our community? Are they Gophers?
244 | Well, they’re Go developers. Is there a difference? Our community needs to be
245 | accessible to everyone.
246 |
247 | ---
248 |
249 | .center[]
250 |
251 | .footnote["1percentrule" by Life of Riley - Own work. Licensed under CC BY-SA 3.0 via Commons]
252 |
253 | ???
254 |
255 | We know there are more developers that. There’s a rule of thumb about
256 | participation in online communities: 1:9:90. 1% create content, 9%
257 | contribute, and 90% lurk. 1% write blog posts. 9% comment on Reddit and
258 | Hacker News. 90% lurk. That’s a lot of people we hear nothing from. Are
259 | those 10% who *are* active sampled uniformly at random from the entire
260 | population? Unlikely. So we *know* the Gopher activity we see represented
261 | online is not the entire picture.
262 |
263 | And this doesn’t even include “dark matter developers”, who aren't involved in
264 | the community at *all*, not even to lurk.
265 |
266 | We need make decisions that don’t just reflect what we see, which is skewed,
267 | but that tries to make sense for cases we haven’t considered. This is what
268 | Russ Cox meant by “do less, to enable more” in his GopherCon keynote.
269 |
270 | ---
271 |
272 | .center[]
273 |
274 | ???
275 |
276 | It’s obviously hard to quantify the needs of "dark matter gophers". But
277 | they’re still writing code. They still consider themselves part of the Go
278 | community. In fact, these people you don't see or hear from, probably make up
279 | the majority of users. They have an impact on adoption. They have an impact
280 | on mindshare. We might call them the subconscious mind of the Programmer
281 | community.
282 |
283 | We don’t know how many of them are out there. We don’t really even know what
284 | they’re using Go for. What do you do if most of your community is invisible?
285 | We’re not getting bug reports from them. We’re not getting patches from them.
286 | How skewed is our view of the world without them?
287 |
288 | How skewed is our view of the world without more insight into what's happening
289 | in China?
290 |
291 |
292 | ---
293 |
294 | ## What do Go programmers want from their community?
295 |
296 | --
297 |
298 | - meeting like-minded people
299 | - finding great content
300 | ???
301 |
302 | Why do people get involved in an online community anyway? Luckily, there’s
303 | research on this topic. Two common motivators are: “meeting like-minded
304 | people” and “finding great content”.
305 |
306 | These are pretty generic reasons, but they apply offline too. Why are you here
307 | today? Maybe your company sent you here as a perk and you don’t really care.
308 | But probably it was one, or perhaps both of these. You want to meet some other
309 | gophers, and you want to learn some things.
310 |
311 | “We like people who are like us.” A principle from social psychology of
312 | influence. When you meet some gophers, you already have something in common,
313 | already have a connection. Since they like Go, it’s likely share your beliefs
314 | on what being a Gopher entails. They’re part of your tribe.
315 |
316 | --
317 |
318 | - technical community
319 | - learn
320 | - recommendations for packages, solutions
321 | - upcoming changes, bugs
322 |
323 | ???
324 |
325 | At the same time, we are a technical community. We want places where you can
326 | get help: where we can learn, without being made to feel stupid.
327 | Recommendations for packages to use, or the best approach to solve a problem.
328 | A place to discuss other Go-related topics. Upcoming changes in the new
329 | release. Bugs.
330 |
331 | We want our communities to be friendly places.
332 |
333 | ---
334 |
335 | ## What do we believe?
336 |
337 | --
338 |
339 | - Code
340 | - gofmt
341 | - if err != nil { ... }
342 | - go vet
343 | - go build -race
344 | - golint (CodeReviewComments)
345 | - godoc.org/github.com/user/package
346 | - go get
347 | - interface{}
348 |
349 | --
350 |
351 | - Social
352 | - Simple is better than complex
353 | - Performance matters
354 | - Costs are visible
355 |
356 | ???
357 |
358 | This is the culture we want to preserve. These are the values that brought
359 | many of us to Go in the first place. These are the ideas we want to propagate
360 | to newcomers.
361 |
362 | This to some extent *is* the underlying “gopherness” we want to keep in the
363 | community. This is what we’re scared of losing. We want the community to
364 | respect these beliefs, because that's what's important to us.
365 |
366 | The Go community is seeing an influx of new people, and many of them are trying
367 | to bring their own development habits, informed by their experiences in
368 | previous languages and previous communities, to ours. Go’s Eternal September,
369 | if you like.
370 |
371 | ---
372 | class: center, middle, inverse
373 |
374 | ## change
375 |
376 | ---
377 |
378 | ## Accents
379 |
380 | ???
381 |
382 | Francesc Campoy-Flores gave a great talk at dotGo last year on accents, both in spoken
383 | languages and computer languages. French with a Spanish accent. Go with a Java
384 | accent, or Go with a Python accent.
385 |
386 | --
387 |
388 | - BBC English
389 |
390 | - "Moutain View" Go
391 |
392 | ???
393 |
394 | Consider, then, BBC English. Received Pronunciation. The equivalent for
395 | Gophers would be the “accent” of Mountain View. The “proper” way to speak.
396 | The “proper” way to code.
397 |
398 | ---
399 |
400 |
401 |
402 | .footnote[Canadian flag mosaic by Tim Van Horn, Canadian Mosaic Project]
403 |
404 | ???
405 |
406 | There's an interesting Canadian term that applies here: “Cultural mosaic”.
407 | It’s the mix of ethnic groups, languages and cultures that coexist within
408 | society. We use it when discussing government policy on multiculturalism. It
409 | stands in opposition to the “melting pot” metaphor that’s often used in
410 | American discussions on cultural assimilation.
411 |
412 | For a while, the Go community ideal *was* cultural assimilation.
413 |
414 | Write code like the standard library. Idiomatic code above all else. We
415 | wanted people to conform. You must write code like this. You must drink this
416 | kool-aid.
417 |
418 | As Russ Cox pointed out in his GopherCon keynote: The Go community needs to
419 | exist outside of Google, doing things that Google isn’t doing.
420 |
421 | We need diversity. We need different ideas, different backgrounds. This is
422 | how we become a "technical cultural mosaic", where our differences contribute
423 | to the whole, instead of dividing it. We’re all exploring still. We don’t yet
424 | know the best way to do everything. Our current ideals are really “This is the
425 | best way we’ve found to do it so far”.
426 |
427 | Many of these "best ways" reflect the experience of the Go team, but as we’ve seen
428 | with vendoring living inside the Google Bubble doesn’t always produce solutions
429 | that satisfy those outside.
430 |
431 | ---
432 |
433 | ##
434 |
435 |
436 | .center[]
437 |
438 | ???
439 |
440 | It’s always important to understand *why* we do these things. Otherwise you
441 | end up doing things out of habit, even though they don’t make sense any more.
442 | For example, buying a roast from the butcher, and cutting off the small end
443 | before putting it into the roasting pan to cook. “My mother always did it this
444 | way.” One day you call your mom and you find out it’s because her roasting pan
445 | was too small.
446 |
447 | Different people have different needs.
448 |
449 | I have a bigger roasting pan than my mother.
450 |
451 | ---
452 |
453 | .center[
454 |
455 | ]
456 |
457 | ???
458 |
459 | Let’s take Martini. By any measure, it’s a very successful project: 7400
460 | github stars. Active community. Lots of contributors, both to martini itself
461 | and the ecosystem. Used in lots of projects. Yet many people look down on it
462 | and its users because it's "not idiomatic". They point to the blog post Jeremy
463 | Saenz wrote disclaiming his own project. But clearly there’s something there.
464 | Are we really going to tell people “No true Gopher uses Martini?” These people
465 | are trying to get stuff done. They have different priorities. We have to
466 | accept that there will not be the One True Way for people. There will be
467 | different … I don’t want to say “factions”, but sub-communities. Gopher towns.
468 | Some will want to follow the shoe, some will follow the gourd.
469 |
470 | People *are* going to be building packages that are “non-idiomatic”. And the
471 | success of Martini led people to build more frameworks that achieved similar
472 | things, but with less magic, that spoke with a more Mountain View accent. And
473 | our community is stronger for it.
474 |
475 | So, we can treat things like gofmt and error checking, as oaths in the
476 | citizenship ceremony. Or perhaps less formally the shibboleths. But there’s
477 | still a lot of leeway for people to experiment with, and Go’s success depends
478 | on them doing so.
479 |
480 | ---
481 | class: center, middle, inverse
482 |
483 | ## what I do
484 |
485 | ???
486 |
487 | So, how do you actually improve Go and turn it from a group of programmers into
488 | a community?
489 |
490 | I’m going to talk about some of the small things that *I’m* responsible for,
491 | and why I chose to do them. Each time it was "here's something I can do to
492 | make the world a little bit better for other Gophers."
493 |
494 | ---
495 |
496 | ## What I do
497 |
498 | --
499 |
500 | - Reddit
501 |
502 | ???
503 |
504 | Reddit. At this point, I’m almost certainly the top submitter to /r/golang.
505 | Why did I start? Internet points. Not the most altruistic reason I agree.
506 | But why keep going? I always read a lot of news sites, and I wanted /r/golang to be populated with articles too.
507 | And since I was seeing them anyway (apparently before lots of other people),
508 | it was easy to drop it into the submit form.
509 |
510 | An active subreddit is valuable because it makes it clear the community is
511 | *doing* things. If you visit a subreddit that has one post only every few
512 | days, the watering hole is dry. People aren’t going to come back if there
513 | nothing to read. By filling it with content, people come back. They realize
514 | that 1) there’s stuff happening (the content itself) and 2) there’s a community
515 | of gophers who are talking about it. A subreddit filled with content satisfies
516 | both of the reasons people join communities.
517 |
518 | --
519 |
520 | - gophervids.appspot.com
521 |
522 | ???
523 |
524 | Gophervids:
525 | This is the clearest example of me going out of my way to do something to
526 | improve the community. Except that I didn’t want to do it.
527 |
528 | Gophervids is a website (built with bootstrap + angular and well beyond my
529 | front-end capabilities) that at the moment has more than 350 Go talks tagged by
530 | topic and speaker. I wanted a site like this to exist, and the easiest way to
531 | make that happen was to create it myself. I had always hoped that somebody
532 | else who actually knew how to build websites would come along and fix it up for
533 | me. And some people have contributed some CSS cleanups or javascript tweaks.
534 | But mostly it’s me updating a json file by hand once in a while.
535 |
536 | However, to some extent, much like my reddit submissions, this was taking
537 | something I was doing anyway: scanning youtube for Go videos, and helping make
538 | that information more widely available in a more useful form.
539 |
540 | If you want to watch videos on concurrency, or testing, or optimization, or all
541 | of Andrew's 30+ talks, you can watch them here.
542 |
543 | --
544 |
545 | - \#golang
546 |
547 | ???
548 |
549 | Twitter: I answer questions on the #golang hashtag. I’m a helpful person.
550 |
551 | Lots of people tweet questions, problems, whatever, tagged with #golang.
552 | Shouting into the wind, basically. Most of the time these are beginners who
553 | need help with something, or people looking for packages but they haven’t yet
554 | found godoc’s search feature. Whatever.
555 |
556 | Surprisingly, I’ve also spent a lot of time suggesting to people that they
557 | *don’t* switch to Go. Or rather, trying to educate them and set expectations
558 | about what they’re getting into. Why? I don’t want people coming in unhappy.
559 | In some senses, this actually is selfish too. If they’re unhappy, they’ll
560 | write angry blog posts that make it to the top of Hacker News and start another
561 | 200 comment discussion on why Go is a terrible language. Nobody wins here.
562 |
563 | In 2012, Rob Pike wrote what has become one of our founding documents:
564 | "Language Design in the Service of Software Engineering." In it, he delves into
565 | the reasons behind some of the design decisions in Go. I’ve probably sent this
566 | talk out 100 times to people asking on Twitter if they should learn Go. People
567 | have many different use cases for their language, and it’s important that they
568 | understand where Go is coming from. It might be trying to solve problems they
569 | don’t have -- millions of lines of code, thousands of engineers -- in which
570 | case some of the decisions might not make sense for their kind of work.
571 |
572 | People have different beliefs about what a programming language should provide.
573 | And then you can find out what languages match your beliefs, ranging from heavily
574 | dyanamic "liberal" languages (to use Steve Yegge's terminology) to extremely
575 | "conservative" languages like Haskell. A lot of flame wars on the internet
576 | from people who disagree about the Right Way to do things. When somebody
577 | coming from node.js or rails says “I want to learn Go”, I send them Rob Pike’s
578 | talk. Then they can decide “Is this the language for me”. And that *is*
579 | valuable.
580 |
581 | ---
582 | class: center, middle, inverse
583 |
584 | ## what you can do
585 |
586 | ???
587 |
588 | The last section to a good keynote is a call to action. The A/B-tested “buy
589 | now” button next to the shopping cart.
590 |
591 | And here we are.
592 |
593 | ---
594 |
595 | ## What you can do
596 |
597 | - Grow the community
598 |
599 | ???
600 |
601 | Lets define our goals. Go has a large community, and there is no evidence that
602 | its growth will slow anytime soon. Lots of benefits here: more packages, more
603 | edge cases hit, more resources in general. More user groups. More blog posts.
604 | More conferences. However, we want to avoid living in an echo chamber. As
605 | environments and priorities change, we want to be able to recognize that and
606 | take a critical look at our beliefs.
607 |
608 | --
609 |
610 | - Write Code
611 |
612 | ???
613 |
614 | Write Code.
615 |
616 | This for me is important for two reasons. The first is kind of selfish. I enjoy
617 | writing code. I write a lot of code. But I also enjoy having the code I need
618 | already written. Perl is still a very powerful force, based almost entirely on
619 | the range of package available on CPAN. Python too, has a huge number of
620 | packages. Separate out internal business logic from the useful code. Push your
621 | code to GitHub. Write docs and get it listed on godoc.org so others can find
622 | it. Not just interesting algorithms and papers, but parsers for obscure file
623 | formats. API clients for web services.
624 |
625 | Second, we learn about Go, and how to use it, but writing actual code. Much of
626 | what we know about the “best way” to write Go code comes from Google. But
627 | that’s only because they already have millions of lines of Go. Millions of
628 | lines of Go code that interacts with their systems, their infrastructure, their
629 | workflows. They’ve well explored the areas of Go that are important to them.
630 | But what about the areas that are important to you? There’s lots of ground to
631 | cover before we can say that we really “understand” Go.
632 |
633 | We’ve heard a number of times “The language is done.” Now we need to see how
634 | it works, to see how people use it, to see what people build.
635 |
636 | --
637 |
638 | - Write content
639 |
640 | ???
641 |
642 | As I mentioned, there are two reasons people join online communities. To find
643 | like-minded people, and to find great content. Right now, a lot of Go content
644 | is "I wrote Hello World in Go and this is how it compares to Haskell". We
645 | need more about Real World Go. Content written by Gophers. Not just for other
646 | Gophers, but a wider audience. Share your success stories with the world, not
647 | just your neighbours. And share your failures too, because Go is not the right
648 | solution for every problem. One way we improve Go is by knowing where it
649 | doesn't work. Maybe it's just missing documentation or maybe it's actually missing
650 | libraries.
651 |
652 | --
653 |
654 | - Bug reports
655 |
656 | ???
657 |
658 | As a developer, listen to bug reports. Be patient. Remember that not every bug
659 | report you get will be written by a native English speaker, or might not even
660 | be in English, or even written by someone who knows how to write a good bug
661 | report.
662 |
663 | File bugs yourself, even if you don’t have a fix. A known documented bug is better than
664 | an undocumented one.
665 |
666 | Leave code better than you found it. If you’re going to use a library, and "go
667 | vet" or golint flag some issues, fix them and file a PR. Even if it's just
668 | documenting the exported symbols.
669 |
670 | --
671 |
672 | - Run tip, beta, rc
673 |
674 | ???
675 |
676 | Finally, run the betas and the release candidates. We need more testers with
677 | odd use cases. Go 1.5 was released on Wednesday. Thursday somebody found a
678 | code generation bug. The way we keep upgrades painless is by having the
679 | pre-releases exercised with lots of different code bases, with lots of
680 | different server configurations, lots of different production workloads. The
681 | more bugs we can find before the final release version is tagged, the more bugs
682 | real users don't have to wait months for a point release fix -- real users who
683 | might have put their career on the line that Go was the right tool for the job.
684 |
685 | ---
686 |
687 | class: center, middle, inverse
688 |
689 | ## Be Involved
690 |
691 | ???
692 |
693 | Monday while I was busy procrastinating on this talk, I saw a small library I
694 | thought would make a great example for how to use Dmitry Vyukov’s go-fuzz tool
695 | (which you should all be using, by the way.) I was hoping somebody else would
696 | do it, but in the end signed up to Medium and banged it out myself. Sometimes the
697 | best way to get something done is to do it yourself. And so far I’ve found
698 | that’s the way with building community.
699 |
700 | This all really boils down to “Be Involved”. Whenever you think “I need this
701 | thing”, somebody else does too. You can make it happen. Even if it's just
702 | starting with markdown on a GitHub gist, you can help build things. Small
703 | steps.
704 |
705 | And that’s what I want to leave you with. I just “fell into” doing community
706 | stuff. Doing small things I thought would be helpful to other people. And you
707 | can too. Small steps. Make a difference. Help people grow. Respect our
708 | differences. Stay humble
709 |
710 | ---
711 | class: center, middle, inverse
712 |
713 | ## fin
714 |
715 | ???
716 |
717 | Thank you.
718 |
719 | vim: ft=markdown
720 |
--------------------------------------------------------------------------------
/golanguk-2015/unknown-gopher-question.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/unknown-gopher-question.png
--------------------------------------------------------------------------------
/golanguk-2015/unknown-gopher.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/golanguk-2015/unknown-gopher.png
--------------------------------------------------------------------------------
/remarkjs/DroidSerif.ttf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/remarkjs/DroidSerif.ttf
--------------------------------------------------------------------------------
/remarkjs/YanoneKaffeesatz-Regular.ttf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/remarkjs/YanoneKaffeesatz-Regular.ttf
--------------------------------------------------------------------------------
/streaming/Bloom_filter.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |
111 |
--------------------------------------------------------------------------------
/streaming/DroidSerif.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/DroidSerif.ttf
--------------------------------------------------------------------------------
/streaming/YanoneKaffeesatz-Regular.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/YanoneKaffeesatz-Regular.ttf
--------------------------------------------------------------------------------
/streaming/count-min-sketch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dgryski/talks/8ae60cd70cb355b681acd93c4df0762cec6198da/streaming/count-min-sketch.png
--------------------------------------------------------------------------------
/streaming/remark-latest.min.js:
--------------------------------------------------------------------------------
1 | ../remarkjs/remark-latest.min.js
--------------------------------------------------------------------------------
/streaming/slides.css:
--------------------------------------------------------------------------------
1 | /* Slideshow styles */
2 |
3 | @font-face {
4 | font-family: 'Droid Serif';
5 | font-style: normal;
6 | font-weight: 400;
7 | src: local('Droid Serif'), local('DroidSerif'), url(DroidSerif.ttf) format('truetype');
8 | }
9 |
10 |
11 | @font-face {
12 | font-family: 'Yanone Kaffeesatz';
13 | font-style: normal;
14 | font-weight: 400;
15 | src: local('Yanone Kaffeesatz Regular'), local('YanoneKaffeesatz-Regular'), url(YanoneKaffeesatz-Regular.ttf) format('truetype');
16 | }
17 |
18 | body { font-family: 'Droid Serif'; font-size: 1.5em; }
19 | h1, h2, h3 {
20 | font-family: 'Yanone Kaffeesatz';
21 | font-weight: 400;
22 | margin-bottom: 0;
23 | }
24 | h1 { font-size: 3em; }
25 | h2 { font-size: 2em; }
26 | h3 { font-size: 1.5em; }
27 | .footnote {
28 | position: absolute;
29 | bottom: 3em;
30 | }
31 | li p { line-height: 1.25em; }
32 | .red { color: #fa0000; }
33 | .large { font-size: 2em; }
34 | a, a > code {
35 | color: rgb(249, 38, 114);
36 | text-decoration: none;
37 | }
38 | .inverse {
39 | background: #272822;
40 | color: #777872;
41 | text-shadow: 0 0 20px #333;
42 | }
43 | .inverse h1, .inverse h2 {
44 | color: #f3f3f3;
45 | line-height: 0.8em;
46 | }
47 |
--------------------------------------------------------------------------------
/streaming/slides.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Probabalistic Data Structures
5 |
6 |
8 |
9 |
10 |
11 |
12 |
20 |
21 |
22 |
--------------------------------------------------------------------------------
/streaming/slides.md:
--------------------------------------------------------------------------------
1 | class: center, middle, inverse
2 |
3 | ## Streaming Algorithms and Approximate Data Structures
4 |
5 | ---
6 | ## Overview
7 |
8 | - Introduction
9 |
10 | --
11 |
12 | - Bloom Filters
13 |
14 | --
15 |
16 | - Count-Min Sketch
17 |
18 | --
19 |
20 | - Reading
21 |
22 | ---
23 |
24 | ## Intro
25 |
26 | - Many of these problems have exact solutions if you have enough memory and/or CPU
27 |
28 | --
29 |
30 | - Reduce CPU and memory by approximating
31 |
32 | --
33 |
34 | - I will skim the math
35 |
36 | ---
37 |
38 | ## Terms you should know
39 |
40 | --
41 |
42 | - Cash Register Model
43 | - things go in
44 | - x z x x y w z x y x
45 |
46 | --
47 |
48 | - Turnstile model
49 | - things go in and out
50 | - [x,1], [z,1], [x,2], [y,1], [w,1], [z,-2], [x,-1], [y,-1], [x,-1]
51 |
52 | --
53 |
54 | - epsilon/delta
55 | - answer is within epsilon of true value, with probability of failure delta
56 |
57 | --
58 |
59 | - k independent hash functions
60 | - murmur3 hash function with random seed
61 | - other tricks work too: h(x,i) = h1(x) + i * h2(x)
62 | - universal hashing
63 |
64 | ---
65 |
66 | ## What I'm not going to talk about
67 |
68 | - HyperLogLog
69 |
70 | - TopK
71 |
72 | - Streaming Quantiles
73 |
74 | ---
75 |
76 | ## Bloom Filters
77 |
78 | - B.H. Bloom, 1970
79 |
80 | --
81 |
82 | - Approximate Set
83 |
84 | --
85 |
86 | - "No I haven't seen this" or "Yes, I've probably seen this"
87 |
88 | --
89 |
90 | - Cache heavy queries: disk or network lookup
91 | - BigTable/Cassandra
92 | - Chrome's "malicious URL" check
93 | - Bloom join
94 | - "Network Applications of Bloom Filters: A Survey" (Broder, Mitzenmacher 2005)
95 |
96 | ---
97 |
98 | ### How would we build this?
99 |
100 | - One hash table of bits
101 |
102 | - One hash function
103 |
104 | --
105 |
106 | - Do this 'k' times
107 |
108 | --
109 |
110 | - ~10 bits per element + 5 hash functions give you <1% false positive rate
111 |
112 | --
113 | - "Given n elements and false-positive rate p, how many bits do I need?"
114 | - m = (-n ln(p)) / ( ln(2)^2 )
115 |
116 | --
117 |
118 | - "Given m bits of storage and n elements, how many hash functions do I need?"
119 | - k = m/n * ln(2)
120 |
121 | ---
122 | class: center, middle
123 |
124 | ### Wikipedia
125 |
126 | 
127 |
128 | ---
129 |
130 | ### Bloom Filters (cont)
131 |
132 | --
133 |
134 | - "What have I put in the set?"
135 |
136 | --
137 |
138 | - "How many items have I put in the set?"
139 |
140 | --
141 |
142 | - "How do I remove elements from the set?"
143 |
144 | --
145 |
146 | - "How many items can I put in the set?"
147 |
148 | --
149 |
150 | - Union, Intersection, Halving
151 |
152 | ---
153 |
154 | ### Count-Min Sketch
155 |
156 | --
157 |
158 | - Approximate Frequencies
159 | - "How many times have I seen this?"
160 |
161 | --
162 |
163 | - "How would we build this?"
164 | - one hash table of counters
165 | - one hash function
166 | - "do it k times"
167 |
168 | --
169 |
170 | - "How do we query this?"
171 | - collisions means the buckets are biased estimators
172 | - but they're all upper bounds
173 | - take the minimum across all the buckets
174 |
175 | ---
176 | class: center, middle
177 |
178 | 
179 |
180 | w = ceil(E/epsilon)
181 |
182 | d = ceil(ln(1/delta))
183 |
184 | estimate <= count + eps * N
185 |
186 | eps = 0.001, delta = 0.001 => w=2719, d=7, 32-bit counters is 73k of space
187 |
188 |
189 | ---
190 |
191 | ### Count Min Sketch (cont)
192 |
193 | --
194 |
195 | - "What have I put in the set?"
196 |
197 | --
198 |
199 | - "How many items have I put in the set?"
200 |
201 | --
202 |
203 | - "How do I remove elements from the set?"
204 |
205 | --
206 |
207 | - Union, Intersection, Halving
208 |
209 | ---
210 |
211 | ## Count-Min Sketch: Applications
212 |
213 | - Count tracking on large data sets
214 |
215 | - Heavy Hitters
216 | - Elephants from Mice
217 | - Count-min sketch + heap
218 | - Will see another (simpler, magic) algorithm later
219 |
220 | - Variations:
221 | - better low-frequency estimates (CountMeanMin),
222 | - (but larger under-estimation error for large items)
223 | - better cash-register estimates (Conservative Update)
224 | - (but deleting isn't allowed)
225 |
226 | - Many many others
227 |
228 | ---
229 |
230 | ### More Stuff I didn't talk about
231 |
232 | - Multi-pass streaming algorithms
233 |
234 | - Sliding windows
235 |
236 | ---
237 |
238 | ### Links
239 |
240 | - https://github.com/dgryski
241 | - libcmsketch (p5-cmsketch)
242 | - hokusai
243 |
244 | - http://research.neustar.biz/
245 |
246 | - https://gist.github.com/debasishg/8172796
247 |
248 | - https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
249 |
250 | ---
251 |
252 | class: center, middle, inverse
253 |
254 | ## Questions?
255 |
256 | ---
257 |
258 | class: center, middle, inverse
259 |
260 | ## fin
261 |
262 | ???
263 |
264 | vim: ft=markdown
265 |
--------------------------------------------------------------------------------
/streaming2/DroidSerif.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/DroidSerif.ttf
--------------------------------------------------------------------------------
/streaming2/YanoneKaffeesatz-Regular.ttf:
--------------------------------------------------------------------------------
1 | ../remarkjs/YanoneKaffeesatz-Regular.ttf
--------------------------------------------------------------------------------
/streaming2/remark-latest.min.js:
--------------------------------------------------------------------------------
1 | ../remarkjs/remark-latest.min.js
--------------------------------------------------------------------------------
/streaming2/slides.css:
--------------------------------------------------------------------------------
1 | /* Slideshow styles */
2 |
3 | @font-face {
4 | font-family: 'Droid Serif';
5 | font-style: normal;
6 | font-weight: 400;
7 | src: local('Droid Serif'), local('DroidSerif'), url(DroidSerif.ttf) format('truetype');
8 | }
9 |
10 |
11 | @font-face {
12 | font-family: 'Yanone Kaffeesatz';
13 | font-style: normal;
14 | font-weight: 400;
15 | src: local('Yanone Kaffeesatz Regular'), local('YanoneKaffeesatz-Regular'), url(YanoneKaffeesatz-Regular.ttf) format('truetype');
16 | }
17 |
18 | body { font-family: 'Droid Serif'; font-size: 1.5em; }
19 | h1, h2, h3 {
20 | font-family: 'Yanone Kaffeesatz';
21 | font-weight: 400;
22 | margin-bottom: 0;
23 | }
24 | h1 { font-size: 3em; }
25 | h2 { font-size: 2em; }
26 | h3 { font-size: 1.5em; }
27 | .footnote {
28 | position: absolute;
29 | bottom: 3em;
30 | }
31 | li p { line-height: 1.25em; }
32 | .red { color: #fa0000; }
33 | .large { font-size: 2em; }
34 | a, a > code {
35 | color: rgb(249, 38, 114);
36 | text-decoration: none;
37 | }
38 | .inverse {
39 | background: #272822;
40 | color: #777872;
41 | text-shadow: 0 0 20px #333;
42 | }
43 | .inverse h1, .inverse h2 {
44 | color: #f3f3f3;
45 | line-height: 0.8em;
46 | }
47 |
--------------------------------------------------------------------------------
/streaming2/slides.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Probabalistic Data Structures
5 |
6 |
8 |
9 |
10 |
11 |
12 |
20 |
21 |
22 |
--------------------------------------------------------------------------------
/streaming2/slides.md:
--------------------------------------------------------------------------------
1 | class: center, middle, inverse
2 |
3 | ## Streaming Algorithms and Approximate Data Structures
4 |
5 | ## Part II: Heavy Hitters and Cardinality Estimation
6 |
7 | ---
8 | ## Overview
9 |
10 | - TopK
11 |
12 | - HyperLogLog
13 |
14 | ---
15 |
16 | ## TopK
17 |
18 | - What are the TopK (for some constant 'k') elements in the stream
19 |
20 | --
21 |
22 | - Heavy Hitters
23 | - what are all elements in the stream with frequency > phi * N for some constant phi
24 |
25 | - no elements with frequency < (phi - eps) * N
26 |
27 | - difference between estimated and true frequency is <= eps * N
28 |
29 | ---
30 |
31 | ## TopK Algorithms
32 |
33 | - Count-min sketch + heap
34 |
35 | - Sampling: O(1/eps^2)
36 |
37 | - "Space Saving" (2005)
38 |
39 | - "Filtered Space Saving" (2010)
40 |
41 | ---
42 |
43 | ## Space Saving
44 |
45 | --
46 |
47 | - Set k=ceil(1/eps)
48 |
49 | --
50 |
51 | - Keep exact (key, count) pairs for first k elements
52 |
53 | --
54 |
55 | - When a new element arrives, if it's not being tracked, remove the least frequent item and replace it ...
56 |
57 | --
58 |
59 | - ... with (newkey, *oldcount+1*)
60 |
61 | ---
62 |
63 | ## Filtered Space Saving
64 |
65 | - Better estimates the error associated with each value
66 |
67 | --
68 |
69 | - SS: (newkey, oldcount+1, error=oldcount)
70 |
71 | --
72 |
73 | - maintain CM sketch with d=1, w=6k for estimates
74 |
75 | - FSS: (newkey, hash[newkey]+1, error=hash[newkey])
76 |
77 | ---
78 |
79 | ## HyperLogLog
80 |
81 | - Cardinality: how many distinct items have I seen?
82 |
83 | - "how many leading 0s did I see?"
84 |
85 | - At most n zero bits => cardinality ~2^n
86 |
87 | --
88 |
89 | - "Do this k times" (or k-times esitmate 1/kth of the stream)
90 |
91 | --
92 |
93 | - We don't need k hash functions
94 | - value only needs to count leading zeros == 5 bits
95 | - split a single 32-bit hash value into register + value
96 |
97 | - estimate = harmonic mean of registers * correction factor
98 |
99 | --
100 |
101 | - error is 1.04/sqrt(registers)
102 |
103 | - Redis uses 12k to store 16k registers => 0.81%
104 |
105 | ---
106 |
107 | ## Google: HyperLogLog++
108 |
109 | --
110 |
111 | - "Counting billions items isn't cool. You know what's cool? Counting *trillions* of items."
112 |
113 | --
114 |
115 | - Brute force improved correction factors
116 |
117 | - space optimizations
118 |
119 | ---
120 |
121 | ## Reading
122 |
123 | SS:
124 | - https://icmi.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
125 | - http://www.l2f.inesc-id.pt/~fmmb/wiki/uploads/Work/misnis.ref0a.pdf
126 |
127 | HLL:
128 | - http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
129 | - http://research.google.com/pubs/pub40671.html
130 | - http://antirez.com/news/75
131 | - http://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/
132 | - http://druid.io/blog/2014/02/18/hyperloglog-optimizations-for-real-world-systems.html
133 |
134 | ---
135 |
136 | class: center, middle, inverse
137 |
138 | ## Questions?
139 |
140 | ---
141 |
142 | class: center, middle, inverse
143 |
144 | ## fin
145 |
146 | ???
147 |
148 | vim: ft=markdown
149 |
--------------------------------------------------------------------------------