Skip to content
This repository has been archived by the owner on Mar 30, 2018. It is now read-only.

How to Reduce CPU Run Time Spent in `runtime.morestack`

Bishop Brock edited this page Jul 22, 2016 · 2 revisions

How to Reduce CPU Run-Time Spent in runtime.morestack

Go implements dynamic stack allocation for goroutines, and stack-capacity checking on most function calls. Every goroutine begins life with a relatively small stack of a fixed size, currently 2 KiB. If a goroutine ever needs more stack space than its current allocation, runtime.morestack is called to arrange this. Besides the memory allocation overhead, this also requires copying the contents of the old stack into the new stack space.

When a goroutine terminates the goroutine structure is "recycled" rather than being discarded. However, if a dead goroutine has a stack that has grown from the default size, the stack memory is discarded. Therefore if a large number of short-lived goroutines are created by an application, it can be important for performance to ensure that most goroutines will not need to grow their stacks.

If you see large amounts of cumulative run-time spent in runtime.morestack in a CPU profile, then you may want to consider modifying the default. It is easy to change the default goroutine stack allocation as long as you are willing to rebuild Go, or at least rebuild the Go runtime package from a source installation. Go does not currently provide a command, switch or environment variable to alter this setting.

To change the default, look in your Go source installation directory in the file src/runtime/stack.go around line 71 for a line that looks like

_StackMin = 2048

Change 2048 to a larger power-of-2, then rebuild/reinstall the runtime package by executing

go install -a

from that directory.

As with most performance tuning, your mileage will vary. For the Hyperledger fabric, the higher the throughput of the test case the more likely you will be to see a performance improvement from changing the stack size.

The chart below shows the benefit of going to an 8 KiB minimum stack size for both an IBM POWER8 (P8) and Intel Xeon (X86) standalone server running a busywork benchmark. Note that this is not a competitive comparison; It simply shows that for this benchmark, a greater than 10% throughput improvement is possible for both systems simply by increasing the default goroutine stack allocation.

Benefit of increasing Go runtime._StackMin for the Hyperledger Fabric

In the chart above, the shape of the curve for the POWER8 (P8) system is more-or-less expected. The shape of the X86 curve may or may not be real, as there is a moderate amount of run-to-run variability in the throughput measured by this benchmark.

This is the busywork make target used to generate the above plots, executed from fabric/tools/busywork/counters:

.PHONY: sweepNoops
sweepNoops:
	for clients in 1 2 4 8 16 32 64; do \
	    userModeNetwork -noops 4; \
	    ./driver \
             -clients $$clients \
		     -transactions $$(1024 / $$clients)) \
		     -arrays 64 \
		     -peerBurst 16 \
		     ; \
       done