This repository has been archived by the owner on Apr 1, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 15
/
understanding-the-execution-tracer.slide
463 lines (287 loc) · 21.5 KB
/
understanding-the-execution-tracer.slide
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
Understanding the Go execution tracer
Novosibirsk
24 Sep 2017
Dave Cheney
dave@cheney.net
http://dave.cheney.net/
@davecheney
* License and Materials
This presentation is a collaboration between [[https://twitter.com/davecheney][David Cheney]] and [[https://twitter.com/francesc][Francesc Campoy]].
This presentation is licensed under the [[https://creativecommons.org/licenses/by-sa/4.0/][Creative Commons Attribution-ShareAlike 4.0 International]] licence.
The materials for this presentation are available on GitHub:
.link https://github.com/davecheney/understanding-the-execution-tracer
The code samples are licensed by, and copywrite of, their respective authors.
You are encouraged to remix, transform, or build upon the material, providing you give appropriate credit and distribute your contributions under the same license.
If you have suggestions or corrections to this presentation, please raise [[https://github.com/davecheney/understanding-the-execution-tracer/issues][an issue on the GitHub project]].
* Introduction
This is a presentation about the Go execution tracer.
The execution tracer was developed by [[https://github.com/dvyukov][Dmitry Vyukov]] for Go 1.5 and remained under documented, and under utilised, until last year.
Unlike sample based profiling, the execution tracer is integrated into the Go runtime, so it does just know what a Go program is doing at a particular point in time, but _why_.
* Agenda
This workshop is aimed at developers who are looking to improve the performance, or diagnose perfromance issues, in their Go applications.
- What is execution tracing, why do we need it?
- Excution tracer basics.
- Diagnosis with the execution tracer.
- Conclusion.
After each section we'll have time for questions.
* One more thing ...
This is a workshop, not a lecture. It's 💯 to ask questions.
If you don't understand something, or think what you're hearing is incorrect, please speak up.
* What is the execution tracer, why do we need it?
* What is the execution tracer, why do we need it?
I think its easiest to explain what the execution tracer does, and why it's important by looking at a piece of code where the profiler, `go`tool`pprof` performs poorly.
* mandelbrot.go (example)
The `examples/mandelbrot` directory contains a simple mandelbrot generator. This code is derived from [[https://github.com/campoy/mandelbrot][Francesc Campoy's mandelbrot package]].
cd examples/mandelbrot
go build && ./mandelbrot
If we build it, then run it, it generates something like this
.image examples/mandelbrot/mandelbrot.png _ 320
* How long does it take?
So, how long does this program take to generate a 1024 x 1024 pixel image?
The simplest way I know how to do this is to use something like `time(1)`.
% time ./mandelbrot
real 0m1.654s
user 0m1.630s
sys 0m0.015s
_Note:_ Don't use `time`go`run`mandebrot.go` or you'll time how long it takes to _compile_ the program as well as run it.
* What is the program doing?
So, in this example the program took 1.6 seconds to generate the mandelbrot and write to to a png.
Is that good? Could we make it faster?
One way to answer that question would be to use Go's built in pprof support to profile the program.
Let's try that.
* Generating the profile
To turn generate a profile we need to either
1. Use the `runtime/pprof` package directly.
2. Use a wrapper like `github.com/pkg/profile` to automate this.
* Generating a profile with runtime/pprof
To show you that there's no magic, let's modify the program to write a CPU profile to `os.Stdout`.
.code examples/mandelbrot-runtime-pprof/mandelbrot.go /START OMIT/,/END OMIT/
By adding this code to the top of the `main` function, this program will write a profile to `os.Stdout`.
cd examples/mandelbrot-runtime-pprof
go run mandelbrot.go > cpu.pprof
_Note_: We can use `go`run` in this case because the cpu profile will only include the execution of `mandelbrot.go`, not its compilation.
* Generating a profile with github.com/pkg/profile
The previous slide showed a super cheap way to generate a profile, but it has a few problems.
- If you forget to redirect the output to a file then you'll blow up that terminal session. 😞 (hint: `reset(1)` is your friend)
- If you write anything else to `os.Stdout`, for example, `fmt.Println` you'll corrupt the trace.
The recommended way to use `runtime/pprof` is to [[https://godoc.org/runtime/pprof#hdr-Profiling_a_Go_program][write the trace to a file]]. But, then you have to make sure the trace is stopped, and file is closed before your program stops, including if someone `^C`'s it.
* Generating a profile with github.com/pkg/profile (cont.)
So, a few years ago I wrote a [[https://godoc.org/github.gom/pkg/profile][package]] to take care of it.
.code examples/mandelbrot-pkg-profile/mandelbrot.go /START OMIT/,/END OMIT/
If we run this version, we get a profile written to the current working directory
go run mandelbrot.go
2017/09/17 12:22:06 profile: cpu profiling enabled, cpu.pprof
2017/09/17 12:22:08 profile: cpu profiling disabled, cpu.pprof
_Note_: Using `pkg/profile` is not mandatory, but it takes care of a lot of the boilerplate around collecting and recording traces, so we'll use it for the rest of this workshop.
* Analysing the profile
Now we have a profile, we can use `go`tool`pprof` to analyse it.
% go tool pprof cpu.pprof
Type: cpu
Time: Sep 17, 2017 at 12:22pm (AEST)
Duration: 1.81s, Total samples = 1.53s (84.33%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)
In this run we see that the program ran for 1.81s seconds (profiling adds a small overhead). We can also see that pprof only captured data for 1.53 seconds, as pprof is sample based, relying on the operating system's `SIGPROF` timer.
_Note_: Since Go 1.9 the `pprof` trace contains all the information you need to analyse the trace. You no longer need to also have the matching binary which produced the trace. 🎉
* Analysing the profile (cont.)
We can use the `top` pprof function to sort functions recorded by the trace
(pprof) top
Showing nodes accounting for 1.53s, 100% of 1.53s total
Showing top 10 nodes out of 57
flat flat% sum% cum cum%
1s 65.36% 65.36% 1s 65.36% main.paint /Users/dfc/devel/understanding-the-execution-tracer/examples/mandelbrot-pkg-profile/mandelbrot.go (inline)
0.27s 17.65% 83.01% 1.27s 83.01% main.fillPixel /Users/dfc/devel/understanding-the-execution-tracer/examples/mandelbrot-pkg-profile/mandelbrot.go
0.13s 8.50% 91.50% 0.16s 10.46% runtime.mallocgc /Users/dfc/go/src/runtime/malloc.go
0.05s 3.27% 94.77% 0.05s 3.27% syscall.Syscall /Users/dfc/go/src/syscall/asm_darwin_amd64.s
0.02s 1.31% 96.08% 0.02s 1.31% runtime.memmove /Users/dfc/go/src/runtime/memmove_amd64.s
0.02s 1.31% 97.39% 0.02s 1.31% runtime.mmap /Users/dfc/go/src/runtime/sys_darwin_amd64.s
0.01s 0.65% 98.04% 0.07s 4.58% compress/flate.(*compressor).deflate /Users/dfc/go/src/compress/flate/deflate.go
0.01s 0.65% 98.69% 0.01s 0.65% compress/flate.matchLen /Users/dfc/go/src/compress/flate/deflate.go
0.01s 0.65% 99.35% 0.01s 0.65% image/png.filter /Users/dfc/go/src/image/png/writer.go
0.01s 0.65% 100% 0.01s 0.65% runtime.memclrNoHeapPointers /Users/dfc/go/src/runtime/memclr_amd64.s
We see that the [[examples/mandelbrot-pkg-profile/mandelbrot.go][`main.paint`]] function was on the CPU the most when pprof captured the stack.
* Analysing the profile (cont.)
Finding `main.paint` on the stack isn't a surprise, this is what the program does; it paints pixels. But what is causing `paint` to spend so much time? We can check that with the _cummulative_ flag to `top`.
(pprof) top --cum
Showing nodes accounting for 1270ms, 83.01% of 1530ms total
Showing top 10 nodes out of 57
flat flat% sum% cum cum%
0 0% 0% 1530ms 100% main.main /Users/dfc/devel/understanding-the-execution-tracer/examples/mandelbrot-pkg-profile/mandelbrot.go
0 0% 0% 1530ms 100% runtime.main /Users/dfc/go/src/runtime/proc.go
270ms 17.65% 17.65% 1270ms 83.01% main.fillPixel /Users/dfc/devel/understanding-the-execution-tracer/examples/mandelbrot-pkg-profile/mandelbrot.go
0 0% 17.65% 1270ms 83.01% main.seqFillImg /Users/dfc/devel/understanding-the-execution-tracer/examples/mandelbrot-pkg-profile/mandelbrot.go
1000ms 65.36% 83.01% 1000ms 65.36% main.paint /Users/dfc/devel/understanding-the-execution-tracer/examples/mandelbrot-pkg-profile/mandelbrot.go (inline)
0 0% 83.01% 260ms 16.99% image/png.(*Encoder).Encode /Users/dfc/go/src/image/png/writer.go
0 0% 83.01% 260ms 16.99% image/png.Encode /Users/dfc/go/src/image/png/writer.go
0 0% 83.01% 190ms 12.42% image/png.(*encoder).writeIDATs /Users/dfc/go/src/image/png/writer.go
0 0% 83.01% 190ms 12.42% image/png.(*encoder).writeImage /Users/dfc/go/src/image/png/writer.go
0 0% 83.01% 180ms 11.76% main.(*img).At /Users/dfc/devel/understanding-the-execution-tracer/examples/mandelbrot-pkg-profile/mandelbrot.go
This is sort of suggesting that [[examples/mandelbrot-pkg-profile/mandelbrot.go][`main.fillPixed`]] is actually doing most of the work.
_Note_: You can also visualise the profile with the `web` command, which looks [[examples/mandelbrot-pkg-profile/cpu.svg][like this]].
* Tracing vs Profiling
Hopefully this example shows the limitations of profiling. Profiling told us what it saw, `fillPixel` was doing all the work, and there didn't look like there was much that could be done about that.
So now it's a good time to introduce the execution tracer which gives a different view of the same program.
* Using the execution tracer
Using the tracer is as simple as asking for a `profile.TraceProfile`, nothing else changes.
.code examples/mandelbrot-trace/mandelbrot.go /START OMIT/,/END OMIT/
When we run the program, we get a `trace.out` file in the current working directory.
% go build mandelbrot.go
% % time ./mandelbrot
2017/09/17 13:19:10 profile: trace enabled, trace.out
2017/09/17 13:19:12 profile: trace disabled, trace.out
real 0m1.740s
user 0m1.707s
sys 0m0.020s
* Using the execution tracer (cont.)
Just like pprof, there is a tool in the `go` command to analyse the trace.
% go tool trace trace.out
2017/09/17 12:41:39 Parsing trace...
2017/09/17 12:41:40 Serializing trace...
2017/09/17 12:41:40 Splitting trace...
2017/09/17 12:41:40 Opening browser. Trace viewer is listening on http://127.0.0.1:57842
* Analysing the trace
We can see from the trace that the program is only using one cpu.
.code examples/mandelbrot-trace/mandelbrot.go /SEQSTART OMIT/,/SEQEND OMIT/
This isn't a surprise, by default `mandelbrot.go` calls `fillPixel` for each pixel in each row in sequence.
Once the image is painted, see the execution switches to writing the `.png` file. This generates garbage on the heap, and so the
The trace profile offers timing resolution down to the _microsecond_ level. This is something you just can't get with external profiling.
* go tool trace
Before we go on there are some things we should talk about the usage of the trace tool.
- The tool uses the javascript debugging support built into Chrome. Trace profiles can only be viewed in Chrome, they won't work in Firefox, Safari, IE/Edge. Sorry.
- Because this is a Google product, it supports keyboard shortcuts; use `WASD` to navigate, use `?` to get a list.
- Viewing traces can take a *lot* of memory. Seriously, 4Gb won't cut it, 8Gb is probably the minimum, more is definitely better.
- If you've installed Go from an OS distribution like Fedora, the support files for the trace viewer may not be part of the main `golang` deb/rpm, they might be in some `-extra` package.
* Using more than one CPU
We saw from the previous trace that the program is running sequentially and not taking advantage of the other CPUs on this machine.
Mandelbrot generation is known as _embarassingly_parallel_. Each pixel is independant of any other, they could all be computed in parallel. So, let's try that.
% go build mandelbrot.go
% time ./mandelbrot -mode px
2017/09/17 13:19:48 profile: trace enabled, trace.out
2017/09/17 13:19:50 profile: trace disabled, trace.out
real 0m1.764s
user 0m4.031s
sys 0m0.865s
So the runtime was basically the same. There was more user time, which makes sense, we were using all the CPUs, but the real (wall clock) time was about the same.
Let's look a the trace.
* Using more than on CPU (cont.)
As you can see this trace generated _much_ more data.
- It looks like lots of work is being done, but if you zoom right in, there are gaps. This is believed to be the scheduler.
- While we're using all four cores, because each `fillPixel` is a relatively small amount of work, we're spending a lot of time in scheduling overhead.
* Batching up work
Using one goroutine per pixel was too fine grained. There wasn't enough work to justify the cost of the goroutine.
Instead, let's try processing one row per goroutine.
% go build mandelbrot.go
% time ./mandelbrot -mode row
2017/09/17 13:41:55 profile: trace enabled, trace.out
2017/09/17 13:41:55 profile: trace disabled, trace.out
real 0m0.764s
user 0m1.907s
sys 0m0.025s
This looks like a good improvement, we almost halved the runtime of the program. Let's look at the trace
* Batching up work (cont.)
As you can see the trace is now smaller and easier to work with. We get to see the whole trace in span, which is a nice bonus.
- At the start of the program we see the number of goroutines ramp up to around 1,000. This is an improvement over the 1 << 20 that we saw in the previous trace.
- Zooming in we see `onePerRowFillImg` runs for longer, and as the goroutine _producing_ work is done early, the scheduler efficiently works through the remaining runnable goroutines.
* Using workers
`mandelbrot.go` supports one other mode, let's try it.
% go build mandelbrot.go
% time ./mandelbrot -mode workers
2017/09/17 13:49:46 profile: trace enabled, trace.out
2017/09/17 13:49:50 profile: trace disabled, trace.out
real 0m4.207s
user 0m4.459s
sys 0m1.284s
So, the runtime was much worse than any previous. Let's look at the trace and see if we can figure out what happened.
* Using workers (cont.)
Looking at the trace you can see that with only one worker process the producer and consumer tend to alternate because there is only one worker and one consumer. Let's increase the number of workers
% go build mandelbrot.go
% time ./mandelbrot -mode workers -workers 4
2017/09/17 13:52:51 profile: trace enabled, trace.out
2017/09/17 13:52:57 profile: trace disabled, trace.out
real 0m5.528s
user 0m7.307s
sys 0m4.311s
So that made it worse! More real time, more CPU time. Let's look at the trace to see what happened.
* Using workers (cont.)
That trace was a mess. There were more workers available, but the seemed to spend all their time fighting over the work to do.
This is because the channel is unbuffered. An unbuffered channel cannot send until there is someone ready to receive.
- The producer cannot send work until there is a worker ready to receive it.
- Workers cannot receive work until there is someone ready to send, so they compete with each other when they are waiting.
- The sender is not privileged, it cannot take priority over a worker that is already running.
What we see here is a lot of latency introduced by the unbuffered channel. There are lots of stops and starts inside the scheduler, and potentially locks and mutexes while waiting for work, this is why we see the `sys` time higher.
* Using buffered channels
.code examples/mandelbrot-buffered/mandelbrot.go /BUFSTART OMIT/,/BUFEND OMIT/
% go build mandelbrot.go
% time ./mandelbrot -mode workers -workers 4
2017/09/17 14:23:56 profile: trace enabled, trace.out
2017/09/17 14:23:57 profile: trace disabled, trace.out
real 0m0.905s
user 0m2.150s
sys 0m0.121s
Which is pretty close to the per row mode above.
* Using buffered channels (cont.)
Using a buffered channel the trace showed us that:
- The producer doesn't have to wait for a worker to arrive, it can fill up the channel quickly
Using this method we got nearly the same speed using a channel to hand off work per pixel than we did previously scheduling on goroutine per row.
_Exercise_: modify `nWorkersFillImg` to work per row. Time the result and analyse the trace.
* Break
* Mandelbrot microservice
It's 2017, generating Mandelbrots is pointless unless you can offer them on the internet as a web sevice.
Thus, I present to you, _Mandelweb_
% go run examples/mandelweb/mandelweb.go
2017/09/17 15:29:21 listening on http://127.0.0.1:8080/
.link http://127.0.0.1:8080/mandelbrot
* Tracing running applications
In the previous example we ran the trace over the whole program.
As you saw, traces can be very large, even for small amounts of time, so collecting trace data continually would generate far too much data. Also, tracing can have an impact on the speed of your program, especially if there is a lot of activity.
What we want is a way to collect a short trace from a running program.
Fortuntately, the `net/http/pprof` package has just such a facility.
* Collecting traces via http
Hopefully everyone knows about the `net/http/pprof` package.
When imported
import _ "net/http/pprof"
It will register tracing and profiling routes with `http.DefaultServeMux`. Since Go 1.5 this includes the trace profiler.
We can grab a five second trace from mandelweb with `curl` (or `wget`)
curl -o trace.out http://127.0.0.1:8080/debug/pprof/trace?seconds=5
* Generating some load
The previous example was interesting, but an idle webserver has, by definition, no performance issues. We need to generate some load. For this I'm using [[https://github.com/rakyll/hey][`hey` by JBD]].
go get -u github.com/rakyll/hey
Let's start with one request per second.
hey -c 1 -n 1000 -q 1 http://127.0.0.1:8080/mandelbrot
And with that running, in another window collect the trace
% curl -o trace.out http://127.0.0.1:8080/debug/pprof/trace?seconds=5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 66169 0 66169 0 0 13233 0 --:--:-- 0:00:05 --:--:-- 17390
% go tool trace trace.out
2017/09/17 16:09:30 Parsing trace...
2017/09/17 16:09:30 Serializing trace...
2017/09/17 16:09:30 Splitting trace...
2017/09/17 16:09:30 Opening browser. Trace viewer is listening on http://127.0.0.1:60301
* Simulating overload
Let's increase the rate to 5 requests per second.
hey -c 5 -n 1000 -q 5 http://127.0.0.1:8080/mandelbrot
And with that running, in another window collect the trace
% curl -o trace.out http://127.0.0.1:8080/debug/pprof/trace?seconds=5
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 66169 0 66169 0 0 13233 0 --:--:-- 0:00:05 --:--:-- 17390
% go tool trace trace.out
2017/09/17 16:09:30 Parsing trace...
2017/09/17 16:09:30 Serializing trace...
2017/09/17 16:09:30 Splitting trace...
2017/09/17 16:09:30 Opening browser. Trace viewer is listening on http://127.0.0.1:60301
* Extra credit
* Sieve of Eratosthenes
The concurrent [[examples/sieve/main.go][prime sieve]] is one of the first Go programs written.
Ivan Daniluk [[http://divan.github.io/posts/go_concurrency_visualize/][wrote a great post on visualising]] it.
Let's take a look at its operation using the execution tracer.
* Conclusion
That's it! Now you know everything about the execution tracer.
* More resources
- Rhys Hiltner, [[https://www.youtube.com/watch?v=mmqDlbWk_XA][Go's execution tracer]] (dotGo 2016)
- Rhys Hiltner, [[https://www.youtube.com/watch?v=V74JnrGTwKA][An Introduction to "go tool trace"]] (GopherCon 2017)
- Dave Cheney, [[https://www.youtube.com/watch?v=2h_NFBFrciI][Seven ways to profile Go programs]] (GolangUK 2016)
- Dave Cheney, [[https://dave.cheney.net/training#high-performance-go][High performance Go workshop]]
- Ivan Daniluk, [[https://www.youtube.com/watch?v=KyuFeiG3Y60][Visualizing Concurrency in Go]] (GopherCon 2016)
- Kavya Joshi, [[https://www.youtube.com/watch?v=KBZlN0izeiY][Understanding Channels]] (GopherCon 2017)