-
Notifications
You must be signed in to change notification settings - Fork 1k
How to Reduce CPU Run Time Spent in `runtime.morestack`
Go implements dynamic stack allocation for goroutines, and stack-capacity
checking on most function calls. Every goroutine begins life with a relatively small stack of a fixed size, currently 2 KiB. If a goroutine ever needs more stack space
than its current allocation, runtime.morestack
is called to arrange
this. Besides the memory allocation overhead, this also requires copying the contents of the old stack into the new stack space.
When a goroutine terminates the goroutine structure is "recycled" rather than being discarded. However, if a dead goroutine has a stack that has grown from the default size, the stack memory is discarded. Therefore if a large number of short-lived goroutines are created by an application, it can be important for performance to ensure that most goroutines will not need to grow their stacks.
If you see large amounts of cumulative run-time spent in
runtime.morestack
in a CPU profile, then you may want to consider modifying the default.
It is easy to change the default goroutine stack allocation as long as you are
willing to rebuild Go, or at least rebuild the Go runtime
package from a
source installation. Go does not currently provide a command, switch or
environment variable to alter this setting.
To change the default, look in your Go source installation directory in the
file src/runtime/stack.go
around line 71 for a line that looks like
_StackMin = 2048
Change 2048 to a larger power-of-2, then rebuild/reinstall the runtime
package
by executing
go install -a
from that directory.
As with most performance tuning, your mileage will vary. For the Hyperledger fabric, the higher the throughput of the test case the more likely you will be to see a performance improvement from changing the stack size.
The chart below shows the benefit of going to an 8 KiB minimum stack size for both an IBM POWER8 (P8) and Intel Xeon (X86) standalone server running a busywork benchmark. Note that this is not a competitive comparison; It simply shows that for this benchmark, a greater than 10% throughput improvement is possible for both systems simply by increasing the default goroutine stack allocation.
In the chart above, the shape of the curve for the POWER8 (P8) system is more-or-less expected. The shape of the X86 curve may or may not be real, as there is a moderate amount of run-to-run variability in the throughput measured by this benchmark.
This is the busywork make target used to generate the above plots,
executed from fabric/tools/busywork/counters
:
.PHONY: sweepNoops
sweepNoops:
for clients in 1 2 4 8 16 32 64; do \
userModeNetwork -noops 4; \
./driver \
-clients $$clients \
-transactions $$(1024 / $$clients)) \
-arrays 64 \
-peerBurst 16 \
; \
done