Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync/atomic: TestNilDeref flaky failure on windows-386 with runtime fatal error #70288

Open
cherrymui opened this issue Nov 11, 2024 · 15 comments
Assignees
Labels
arch-386 Issues solely affecting the 32-bit x86 architecture compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Milestone

Comments

@cherrymui
Copy link
Member

cherrymui commented Nov 11, 2024

#!watchflakes
default <- builder ~ `windows-386` && pkg == "sync/atomic" && test == "TestNilDeref"

sync/atomic.TestNilDeref has been flaky on windows-386 with runtime fatal error "traceback did not unwind completely"

=== RUN   TestNilDeref
runtime: g388: frame.sp=0x1922f8c top=0x1922ff0
	stack=[0x1922000-0x1923000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0xb0b5fb, 0x23})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x21aff5d4 sp=0x21aff5c0 pc=0x9fd5f5
runtime.(*unwinder).finishInternal(0x21aff65c)
	C:/b/s/w/ir/x/w/goroot/src/runtime/traceback.go:566 +0x16d fp=0x21aff5f4 sp=0x21aff5d4 pc=0x9ec57d
...
runtime.gopark(0xb109e4, 0x18122d0, 0x1a, 0xa, 0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/proc.go:435 +0xfd fp=0x1896f8c sp=0x1896f78 pc=0x9fd72d
runtime.gcBgMarkWorker(0x188e1c0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1363 +0xe2 fp=0x1896fe8 sp=0x1896f8c pc=0x9a6aa2
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x25 fp=0x1896ff0 sp=0x1896fe8 pc=0x9a69a5
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1896ff4 sp=0x1896ff0 pc=0xa02c61
created by runtime.gcBgMarkStartWorkers in goroutine 333
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x11a

It started since today (Nov 11).

https://ci.chromium.org/ui/test/golang/sync%2Fatomic.TestNilDeref?q=V%3Abuilder%3Dgotip-windows-386-test_only+V%3Ago_branch%3Dmaster+V%3Agoarch%3D386+V%3Agoos%3Dwindows+V%3Ahost_goarch%3D386+V%3Ahost_goos%3Dwindows

All other platforms look fine. Only windows-386.

I've seen it on trybot more than once, e.g. https://ci.chromium.org/ui/p/golang/builders/try/gotip-windows-386/b8731553101305354337/test-results?sortby=&groupby=

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 11, 2024
@gabyhelp
Copy link

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 11, 2024
@cherrymui cherrymui added this to the Go1.24 milestone Nov 11, 2024
@cherrymui
Copy link
Member Author

cc @golang/runtime

@cherrymui cherrymui changed the title sync/atomic: TestNilDeref flaky failure on windows-386 sync/atomic: TestNilDeref flaky failure on windows-386 with runtime fatal error Nov 11, 2024
@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
default <- builder ~ `windows-386` && pkg == "sync/atomic" && test == "TestNilDeref" && `traceback did not unwind completely`
2024-11-11 23:49 gotip-windows-386 go@f9159b11 sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
runtime: g388: frame.sp=0x1922f8c top=0x1922ff0
	stack=[0x1922000-0x1923000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0xb0b5fb, 0x23})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x21aff5d4 sp=0x21aff5c0 pc=0x9fd5f5
runtime.(*unwinder).finishInternal(0x21aff65c)
	C:/b/s/w/ir/x/w/goroot/src/runtime/traceback.go:566 +0x16d fp=0x21aff5f4 sp=0x21aff5d4 pc=0x9ec57d
...
runtime.gopark(0xb109e4, 0x18122d0, 0x1a, 0xa, 0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/proc.go:435 +0xfd fp=0x1896f8c sp=0x1896f78 pc=0x9fd72d
runtime.gcBgMarkWorker(0x188e1c0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1363 +0xe2 fp=0x1896fe8 sp=0x1896f8c pc=0x9a6aa2
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x25 fp=0x1896ff0 sp=0x1896fe8 pc=0x9a69a5
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x1896ff4 sp=0x1896ff0 pc=0xa02c61
created by runtime.gcBgMarkStartWorkers in goroutine 333
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x11a
2024-11-11 23:50 gotip-windows-386 go@4c8ab993 sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
runtime: g226: frame.sp=0x1049f8c top=0x1049ff0
	stack=[0x1049000-0x104a000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0x6bb5fb, 0x23})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x21c2f9d0 sp=0x21c2f9bc pc=0x5ad5f5
runtime.(*unwinder).finishInternal(0x21c2fa58)
	C:/b/s/w/ir/x/w/goroot/src/runtime/traceback.go:566 +0x16d fp=0x21c2f9f0 sp=0x21c2f9d0 pc=0x59c57d
...
runtime.gopark(0x6c09e4, 0x10123a8, 0x1a, 0xa, 0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/proc.go:435 +0xfd fp=0x111bf8c sp=0x111bf78 pc=0x5ad72d
runtime.gcBgMarkWorker(0x1088480)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1363 +0xe2 fp=0x111bfe8 sp=0x111bf8c pc=0x556aa2
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x25 fp=0x111bff0 sp=0x111bfe8 pc=0x5569a5
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x111bff4 sp=0x111bff0 pc=0x5b2c61
created by runtime.gcBgMarkStartWorkers in goroutine 254
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x11a

watchflakes

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
default <- builder ~ `windows-386` && pkg == "sync/atomic" && test == "TestNilDeref" && `traceback did not unwind completely`
2024-11-12 17:16 gotip-windows-386 go@c969491e sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
runtime: g225: frame.sp=0x1449f8c top=0x1449ff0
	stack=[0x1449000-0x144a000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0x21b5fb, 0x23})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x214ff9dc sp=0x214ff9c8 pc=0x10d5f5
runtime.(*unwinder).finishInternal(0x214ffa64)
	C:/b/s/w/ir/x/w/goroot/src/runtime/traceback.go:566 +0x16d fp=0x214ff9fc sp=0x214ff9dc pc=0xfc57d
...
runtime.gopark(0x2209e4, 0x1412288, 0x1a, 0xa, 0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/proc.go:435 +0xfd fp=0x14a0f8c sp=0x14a0f78 pc=0x10d72d
runtime.gcBgMarkWorker(0x14663c0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1363 +0xe2 fp=0x14a0fe8 sp=0x14a0f8c pc=0xb6aa2
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x25 fp=0x14a0ff0 sp=0x14a0fe8 pc=0xb69a5
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x14a0ff4 sp=0x14a0ff0 pc=0x112c61
created by runtime.gcBgMarkStartWorkers in goroutine 330
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x11a

watchflakes

@cherrymui
Copy link
Member Author

cc @golang/windows

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
default <- builder ~ `windows-386` && pkg == "sync/atomic" && test == "TestNilDeref" && `traceback did not unwind completely`
2024-11-12 18:35 gotip-windows-386 go@70f6c139 sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
runtime: g319: frame.sp=0x1135f8c top=0x1135ff0
	stack=[0x1135000-0x1136000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0x2cb5fb, 0x23})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x2146fac4 sp=0x2146fab0 pc=0x1bd5f5
runtime.(*unwinder).finishInternal(0x2146fb4c)
	C:/b/s/w/ir/x/w/goroot/src/runtime/traceback.go:566 +0x16d fp=0x2146fae4 sp=0x2146fac4 pc=0x1ac57d
...
runtime.gopark(0x2d09e4, 0x1012330, 0x1a, 0xa, 0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/proc.go:435 +0xfd fp=0x10aef8c sp=0x10aef78 pc=0x1bd72d
runtime.gcBgMarkWorker(0x1110140)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1363 +0xe2 fp=0x10aefe8 sp=0x10aef8c pc=0x166aa2
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x25 fp=0x10aeff0 sp=0x10aefe8 pc=0x1669a5
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x10aeff4 sp=0x10aeff0 pc=0x1c2c61
created by runtime.gcBgMarkStartWorkers in goroutine 316
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x11a

watchflakes

@mknyszek
Copy link
Contributor

The test binary seems to die abruptly. The package tests are marked as failed, but there are JSON lines missing from the full log (https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8731450114802979457/+/u/step/11/log/3). That's also why the test is marked as "aborted": the JSON test parser loses track of what exactly happened.

This also appears to have started yesterday based on the test history: https://ci.chromium.org/ui/test/golang/sync%2Fatomic.TestNilDeref?q=V%3Abuilder%3Dgotip-windows-386-test_only+V%3Ago_branch%3Dmaster+V%3Agoarch%3D386+V%3Agoos%3Dwindows+V%3Ahost_goarch%3D386+V%3Ahost_goos%3Dwindows. Though, there's no obvious culprit.

On Linux, we've seen immediate deaths like this end up being a bad dereference during a signal handler. Does something similar happen for crashes in the exception handler? (At least we have a stack trace here.)

CC @qmuntal who might have additional insights.

@rsc
Copy link
Contributor

rsc commented Nov 12, 2024

I've reproduced this as early as Go 1.20, with -count=10000. It is a stack unwinding / stack corruption problem.

Here is one error I saw at a later commit:

runtime: g1: frame.sp=0x94a6fa4 top=0x94a6ff0
    stack=[0x94a6000-0x94a7000
fatal error: traceback did not unwind completely

...

goroutine 1 [chan receive (scan)]:
runtime.gopark(0x3b8d20, 0x9414570, 0xe, 0x17, 0x2)
    /Users/rsc/go/src/runtime/proc.go:398 +0xfd fp=0x94a6c70 sp=0x94a6c5c pc=0x29e29d
runtime.chanrecv(0x9414540, 0x944fce3, 0x1)
    /Users/rsc/go/src/runtime/chan.go:583 +0x3ad fp=0x94a6cac sp=0x94a6c70 pc=0x265f0d
runtime.chanrecv1(0x9414540, 0x944fce3)
    /Users/rsc/go/src/runtime/chan.go:442 +0x1c fp=0x94a6cc0 sp=0x94a6cac pc=0x265b5c
testing.(*T).Run(0x94e21e0, {0x3abfbd, 0xc}, 0x3b89e0)
    /Users/rsc/go/src/testing/testing.go:1649 +0x3fa fp=0x94a6d24 sp=0x94a6cc0 pc=0x32f8ca
testing.runTests.func1(0x94e21e0)
    /Users/rsc/go/src/testing/testing.go:2054 +0x49 fp=0x94a6d48 sp=0x94a6d24 pc=0x3318e9
testing.tRunner(0x94e21e0, 0x944fdf0)
    /Users/rsc/go/src/testing/testing.go:1595 +0x11e fp=0x94a6d94 sp=0x94a6d48 pc=0x32ea7e
testing.runTests(0x94980c0, {0x4a6e20, 0x4c, 0x4c}, {0xc1c50eb92c798d2c, 0x8bb2fda08d, 0x4a8400})
    /Users/rsc/go/src/testing/testing.go:2052 +0x3a1 fp=0x94a6e04 sp=0x94a6d94 pc=0x331801
testing.(*M).Run(0xf8659c22)
    /Users/rsc/go/src/testing/testing.go:1925 +0x6b6 fp=0x94a6f68 sp=0x94a6e04 pc=0x3301e6
main.main()
    _testmain.go:201 +0x16f fp=0x94a6fa4 sp=0x94a6f68 pc=0x37cedf
sync/atomic_test.TestNilDeref.func59(0x29b680)
    /Users/rsc/go/src/sync/atomic/atomic_test.go:2526 +0x48 fp=0x94a6fb0 sp=0x94a6fa4 pc=0x378b18

...
goroutine 211 [wait for GC cycle]:
runtime.gopark(0x3b8ea4, 0x4a8764, 0x19, 0x14, 0x1)
    /Users/rsc/go/src/runtime/proc.go:398 +0xfd fp=0x94a7cb4 sp=0x94a7ca0 pc=0x29e29d
runtime.goparkunlock(...)
    /Users/rsc/go/src/runtime/proc.go:404
runtime.gcWaitOnMark(0x1dd2)
    /Users/rsc/go/src/runtime/mgc.go:529 +0x5c fp=0x94a7cc8 sp=0x94a7cb4 pc=0x2783ac
runtime.GC()
    /Users/rsc/go/src/runtime/mgc.go:472 +0x6d fp=0x94a7ce8 sp=0x94a7cc8 pc=0x27822d
sync/atomic_test.TestNilDeref.func59.1()
    /Users/rsc/go/src/sync/atomic/atomic_test.go:2522 +0x18 fp=0x94a7cf8 sp=0x94a7ce8 pc=0x37bc38
runtime.deferCallSave(0x94a7d5c, 0x3b8bec)
    /Users/rsc/go/src/runtime/panic.go:798 +0x76 fp=0x94a7d00 sp=0x94a7cf8 pc=0x29ad56
runtime.runOpenDeferFrame(0x9450090)
    /Users/rsc/go/src/runtime/panic.go:771 +0x1b0 fp=0x94a7d28 sp=0x94a7d00 pc=0x29abb0
panic({0x393320, 0x4a2028})
    /Users/rsc/go/src/runtime/panic.go:914 +0x1df fp=0x94a7d7c sp=0x94a7d28 pc=0x29af9f
runtime.panicmem(...)
    /Users/rsc/go/src/runtime/panic.go:261
runtime.sigpanic()
    /Users/rsc/go/src/runtime/signal_windows.go:364 +0x1f3 fp=0x94a7da0 sp=0x94a7d7c pc=0x2b06a3
sync/atomic.(*Int32).Load(...)
    /Users/rsc/go/src/sync/atomic/type.go:74
sync/atomic_test.TestNilDeref.func36()
    /Users/rsc/go/src/sync/atomic/atomic_test.go:2495 +0x15 fp=0x94a7dac sp=0x94a7da0 pc=0x37b725
sync/atomic_test.TestNilDeref.func59(0x94a7dc0)
    /Users/rsc/go/src/sync/atomic/atomic_test.go:2525 +0x3a fp=0x94a7db8 sp=0x94a7dac pc=0x378b0a
sync/atomic_test.TestNilDeref(0x94e22d0)
    /Users/rsc/go/src/sync/atomic/atomic_test.go:2526 +0x2f6 fp=0x94a7f98 sp=0x94a7db8 pc=0x378aa6
testing.tRunner(0x94e22d0, 0x3b89e0)
    /Users/rsc/go/src/testing/testing.go:1595 +0x11e fp=0x94a7fe4 sp=0x94a7f98 pc=0x32ea7e
testing.(*T).Run.func1()
    /Users/rsc/go/src/testing/testing.go:1648 +0x2c fp=0x94a7ff0 sp=0x94a7fe4 pc=0x32fa0c
runtime.goexit()
    /Users/rsc/go/src/runtime/asm_386.s:1363 +0x1 fp=0x94a7ff4 sp=0x94a7ff0 pc=0x2cd341
created by testing.(*T).Run in goroutine 1
    /Users/rsc/go/src/testing/testing.go:1648 +0x3db
FAIL    sync/atomic    15.498s

The panic is about not fully unwinding goroutine 1. You can see the traceback printer get it wrong too. main.main should be called from goexit, but it has been "called" from TestNilDeref.func59 instead. This suggests stack corruption near the top of goroutine 1's stack. The top of goroutine 1's stack is the bottom of goroutine 211's stack, which coincidentally (?) is the goroutine that did the nil dereference. func59 is also on that stack, but with a different caller pc (0x378b0a vs 0x378b18).

At Go 1.20, I got this error:

%  GOOS=windows GOARCH=386 go test sync/atomic -run=NilDeref -timeout=1h -count=10000
runtime: g 38: unexpected return pc for runtime.gcBgMarkWorker called from 0x0
stack: frame={sp:0x9892f94, fp:0x9892ff0} stack=[0x9892000,0x9893000)
0x09892f14:  0x00000000  0x00000000  0x0000000f  0x00000010 
0x09892f24:  0x0098a37c <runtime.lock2+0x0000005c>  0xf8659c22  0x00007fff  0x3fd00000 
0x09892f34:  0x00bb5518  0x00000001  0x00000000  0x00000001 
0x09892f44:  0x099823c0  0x09880000  0x00000000  0x009a0da4 <runtime.(*gcControllerState).removeIdleMarkWorker+0x00000054> 
0x09892f54:  0x00bdb3d8  0x00000001  0x0000000c  0x00000000 
0x09892f64:  0x00000000  0x098840f0  0x00000000  0x009b8ca0 <runtime.recovery+0x00000000> 
0x09892f74:  0x00000000  0xf9130e7c  0x00007fff  0x00aca390 
0x09892f84:  0x00000002  0x09a00240  0x099823c0  0x0099845e <runtime.gcBgMarkWorker+0x000000ee> 
0x09892f94: <0x00aca2e8  0x09812318  0xffff141a  0x00a8d97a <sync/atomic_test.TestNilDeref.func59+0x0000004a> 
0x09892fa4:  0x00000000  0x09884b40  0x00000000  0x0010005f 
0x09892fb4:  0x00001fa0  0x002b0023  0x0053002b  0x002b002b 
0x09892fc4:  0x00010246  0x00000000  0x00000000  0x00000000 
0x09892fd4:  0x00000000  0x00000000  0x00000000  0x00000000 
0x09892fe4:  0x00000000  0x00000000 !0x00000000 >0x00000000 
0x09892ff4:  0x00000000  0x00000000  0x00000000 
fatal error: unknown caller pc

Again, instead of finding goexit near the top of the stack, we find corruption and a PC in sync/atomic_test.TestNilDeref.func59.

I wonder whether the Windows fault handler is pushing a larger-than-expected fault descriptor onto the goroutine stack (should it be using the system stack), although func59 is a few frames above the actual fault, so I don't know why the fault descriptor would mention it.

Definitely looks like stack corruption somehow.

@rsc
Copy link
Contributor

rsc commented Nov 12, 2024

Here is a failure at Go 1.19.

% GOOS=windows GOARCH=386 go test sync/atomic -run=NilDeref -timeout=1h -count=10000
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x1 addr=0x10 pc=0x76837a]

goroutine 38 [running]:
runtime.throw({0x8987fc, 0x2a})
	/Users/rsc/go/src/runtime/panic.go:1047 +0x4d fp=0x11904f70 sp=0x11904f5c pc=0x7898bd
runtime.sigpanic()
	/Users/rsc/go/src/runtime/signal_windows.go:249 +0x23f fp=0x11904f94 sp=0x11904f70 pc=0x79d45f
runtime: g 38: unexpected return pc for runtime.gcBgMarkWorker called from 0x0
stack: frame={sp:0x11904f94, fp:0x11904ff0} stack=[0x11904000,0x11905000)
0x11904f14:  0x00000000  0x00000000  0x00000000  0x00000000 
0x11904f24:  0x00000000  0x00000000  0x00000000  0x00000000 
0x11904f34:  0x11846d80  0x119023c0  0x007b5bb9 <runtime.systemstack+0x00000029>  0x00789bab <runtime.fatalthrow+0x0000006b> 
0x11904f44:  0x11904f48  0x00789bc0 <runtime.fatalthrow.func1+0x00000000>  0x119023c0  0x007898bd <runtime.throw+0x0000004d> 
0x11904f54:  0x11904f5c  0x007898bd <runtime.throw+0x0000004d>  0x00000002  0x007898d0 <runtime.throw.func1+0x00000000> 
0x11904f64:  0x008987fc  0x0000002a  0x0079d45f <runtime.sigpanic+0x0000023f>  0x008987fc 
0x11904f74:  0x0000002a  0x00000000  0x00000000  0x119023c0 
0x11904f84:  0x00000002  0x118466c0  0x119023c0  0x0076837a <runtime.gcBgMarkWorker+0x0000010a> 
0x11904f94: <0x0089b6b4  0x11812288  0xffff1418  0x00000000 
0x11904fa4:  0x00000000  0x00981b40  0x1a05a648  0x0010005f 
0x11904fb4:  0x00001fa0  0x002b0023  0x0053002b  0x002b002b 
0x11904fc4:  0x00010246  0x00000000  0x00000000  0x00000000 
0x11904fd4:  0x00000000  0x00000000  0x00000000  0x00000000 
0x11904fe4:  0x00000000  0x00000000 !0x00000000 >0x00000000 
0x11904ff4:  0x00000000  0x00000000  0x00000000 
runtime.(*muintptr).set(...)
	/Users/rsc/go/src/runtime/runtime2.go:305
runtime.gcBgMarkWorker()
	/Users/rsc/go/src/runtime/mgc.go:1270 +0x10a fp=0x11904ff0 sp=0x11904f94 pc=0x76837a
created by runtime.gcBgMarkStartWorkers
	/Users/rsc/go/src/runtime/mgc.go:1159 +0x25

...

goroutine 19 [wait for GC cycle]:
runtime.gopark(0x89b75c, 0x981d2c, 0x17, 0x14, 0x1)
	/Users/rsc/go/src/runtime/proc.go:363 +0xff fp=0x11861c84 sp=0x11861c70 pc=0x78c3df
runtime.goparkunlock(...)
	/Users/rsc/go/src/runtime/proc.go:369
runtime.gcWaitOnMark(0xb)
	/Users/rsc/go/src/runtime/mgc.go:505 +0x5f fp=0x11861c98 sp=0x11861c84 pc=0x765f4f
runtime.GC()
	/Users/rsc/go/src/runtime/mgc.go:448 +0x6e fp=0x11861cb8 sp=0x11861c98 pc=0x765dbe
sync/atomic_test.TestNilDeref.func59.1()
	/Users/rsc/go/src/sync/atomic/atomic_test.go:2601 +0x1a fp=0x11861cc8 sp=0x11861cb8 pc=0x85f4fa
runtime.deferCallSave(0x11861d38, 0x89b9c0)
	/Users/rsc/go/src/runtime/panic.go:796 +0x72 fp=0x11861cd0 sp=0x11861cc8 pc=0x789102
runtime.runOpenDeferFrame(0x118843c0, 0x118ba090)
	/Users/rsc/go/src/runtime/panic.go:769 +0x1a8 fp=0x11861cfc sp=0x11861cd0 pc=0x788f68
panic({0x878980, 0x97cc48})
	/Users/rsc/go/src/runtime/panic.go:884 +0x1ba fp=0x11861d58 sp=0x11861cfc pc=0x7892fa
runtime.panicmem(...)
	/Users/rsc/go/src/runtime/panic.go:260
runtime.sigpanic()
	/Users/rsc/go/src/runtime/signal_windows.go:255 +0x177 fp=0x11861d7c sp=0x11861d58 pc=0x79d397
runtime/internal/atomic.Cas(0x0, 0x0, 0x0)
	/Users/rsc/go/src/runtime/internal/atomic/atomic_386.s:19 +0xc fp=0x11861d80 sp=0x11861d7c pc=0x7522dc
sync/atomic.CompareAndSwapPointer(0x0, 0x0, 0x0)
	/Users/rsc/go/src/runtime/atomic_pointer.go:76 +0x4a fp=0x11861d9c sp=0x11861d80 pc=0x7b29ca
sync/atomic_test.TestNilDeref.func11()
	/Users/rsc/go/src/sync/atomic/atomic_test.go:2549 +0x31 fp=0x11861db0 sp=0x11861d9c pc=0x85e9b1
sync/atomic_test.TestNilDeref.func59(0x11905dc4)
	/Users/rsc/go/src/sync/atomic/atomic_test.go:2604 +0x3c fp=0x11861dbc sp=0x11861db0 pc=0x85f87c
sync/atomic_test.TestNilDeref(0x118842d0)
	/Users/rsc/go/src/sync/atomic/atomic_test.go:2605 +0x2f8 fp=0x11861f9c sp=0x11861dbc pc=0x85f818
testing.tRunner(0x118842d0, 0x89b9d8)
	/Users/rsc/go/src/testing/testing.go:1446 +0x113 fp=0x11861fe4 sp=0x11861f9c pc=0x814073
testing.(*T).Run.func1()
	/Users/rsc/go/src/testing/testing.go:1493 +0x2e fp=0x11861ff0 sp=0x11861fe4 pc=0x814e6e
runtime.goexit()
	/Users/rsc/go/src/runtime/asm_386.s:1326 +0x1 fp=0x11861ff4 sp=0x11861ff0 pc=0x7b7051
created by testing.(*T).Run
	/Users/rsc/go/src/testing/testing.go:1493 +0x374

In this case the test goroutine stack (0x11861c70-0x11861ff0) is not directly above the corrupt stack (0x11904000-0x11905000). However, note that the "argument" to func59 is 0x11905dc4. That appears to be an on-stack closure pointer argument, looking at the generated code for both func59 and TestNilDeref, but the right value for the stack we are on would be 0x11861dc4. That implies the test goroutine stack used to be directly above the corrupt stack but was moved. It looks like the closure pointer argument was not adjusted when the stack moved, because it is dead. But that's very important to know: the stack moved.

I don't see why it would have moved. As I read runtime/stack.go _FixedStack (the initial stack size) is 4096 bytes, and none of the stack we can see look like they are using anywhere near that. However:

// StackSystem is a number of additional bytes to add
// to each stack below the usual guard area for OS-specific
// purposes like signal handling. Used on Windows, Plan 9,
// and iOS because they do not use a separate stack.
_StackSystem = goos.IsWindows*512*goarch.PtrSize + goos.IsPlan9*512 + goos.IsIos*goarch.IsArm64*1024

Perhaps 2kB is not enough signal handling space on windows-386. I think our windows-386 builders are actually 64-bit Windows now. Maybe that matters?

I tried adding some debug prints about the stack during the exceptions. The low bits of the stack pointer we see Go's exception handler running on are around 0x128. It should not go lower after that in Go, and that's still almost 300 bytes of room. That said, this is being called from some DLL, so maybe it went lower before that in the DLL, or even after that as we return back out. It seems too close for comfort either way.

I tried doubling the reserved system stack on 386, simplifying the expression to IsWindows*4096. I'm waiting on proper statistics but preliminary runs make it look like the problem goes away with the larger system stack.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/627375 mentions this issue: runtime: reserve 4kB for system stack on windows-386

@dmitshur dmitshur added the arch-386 Issues solely affecting the 32-bit x86 architecture label Nov 12, 2024
@rsc
Copy link
Contributor

rsc commented Nov 12, 2024

At current master the test fails 100/100 times running -count=10.
After I double stackSystem, the test passes 100/100 times with -count=1000.

That is, before, it consistently (100/100) fails at running 10 iterations.
Now it consistently (100/100) succeeds at running 1000 iterations.
I feel confident that it's fixed.

CL 627375 has the stack doubling.

@dmitshur dmitshur added FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 12, 2024
@dmitshur dmitshur moved this to Active in Test Flakes Nov 12, 2024
@dmitshur dmitshur moved this to In Progress in Go Compiler / Runtime Nov 12, 2024
@github-project-automation github-project-automation bot moved this from Active to Done in Test Flakes Nov 13, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Nov 13, 2024
@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
default <- builder ~ `windows-386` && pkg == "sync/atomic" && test == "TestNilDeref" && `traceback did not unwind completely`
2024-11-13 00:57 gotip-windows-386 go@ab554650 sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
runtime: g242: frame.sp=0x1492f8c top=0x1492ff0
	stack=[0x1492000-0x1493000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0xa5b5fb, 0x23})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x21b2f560 sp=0x21b2f54c pc=0x94d5f5
runtime.(*unwinder).finishInternal(0x21b2f5e8)
	C:/b/s/w/ir/x/w/goroot/src/runtime/traceback.go:566 +0x16d fp=0x21b2f580 sp=0x21b2f560 pc=0x93c57d
...
runtime.gopark(0xa609e4, 0x1412300, 0x1a, 0xa, 0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/proc.go:435 +0xfd fp=0x156bf8c sp=0x156bf78 pc=0x94d72d
runtime.gcBgMarkWorker(0x1502340)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1363 +0xe2 fp=0x156bfe8 sp=0x156bf8c pc=0x8f6aa2
runtime.gcBgMarkStartWorkers.gowrap1()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x25 fp=0x156bff0 sp=0x156bfe8 pc=0x8f69a5
runtime.goexit({})
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:1393 +0x1 fp=0x156bff4 sp=0x156bff0 pc=0x952c61
created by runtime.gcBgMarkStartWorkers in goroutine 286
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1279 +0x11a

watchflakes

@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
default <- builder ~ `windows-386` && pkg == "sync/atomic" && test == "TestNilDeref"
2024-11-11 17:11 gotip-windows-386 go@5a9aeef9 sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
2024-11-11 21:33 gotip-windows-386 go@73ac82f9 sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0x8c pc=0xc6c1d1]

runtime stack:
runtime.throw({0xd8d005, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x215ff7bc sp=0x215ff7a8 pc=0xc7d5f5
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x215ff7e0 sp=0x215ff7bc pc=0xc5fa61
runtime.(*unwinder).next(0x215ff848)
...
fatal error: unexpected signal during runtime execution
panic during panic
[signal 0xc0000005 code=0x0 addr=0x8c pc=0xc6c1d1]

runtime stack:
runtime.throw({0xd8d005, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x215ff3d8 sp=0x215ff3c4 pc=0xc7d5f5
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x215ff3fc sp=0x215ff3d8 pc=0xc5fa61
runtime.(*unwinder).next(0x215ff5f0)
...
runtime.markroot(0x142c974, 0x5, 0x1)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:212 +0x1ad fp=0x215ff948 sp=0x215ff8ec pc=0xc2916d
runtime.gcDrain(0x142c974, 0x3)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1188 +0x50f fp=0x215ff998 sp=0x215ff948 pc=0xc2b79f
runtime.gcDrainMarkWorkerDedicated(...)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1112
runtime.gcBgMarkWorker.func2()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1440 +0xa4 fp=0x215ff9c4 sp=0x215ff998 pc=0xc26f34
runtime.systemstack(0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:374 +0x3d fp=0x215ff9c8 sp=0x215ff9c4 pc=0xc8180d
2024-11-12 01:08 gotip-windows-386 go@c96939fb sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
2024-11-12 19:51 gotip-windows-386 go@3efbc30f sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0x8c pc=0x23c1d1]

runtime stack:
runtime.throw({0x35d005, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0xcff838 sp=0xcff824 pc=0x24d5f5
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0xcff85c sp=0xcff838 pc=0x22fa61
runtime.(*unwinder).next(0xcff8c4)
...
fatal error: unexpected signal during runtime execution
panic during panic
[signal 0xc0000005 code=0x0 addr=0x8c pc=0x23c1d1]

runtime stack:
runtime.throw({0x35d005, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0xcff454 sp=0xcff440 pc=0x24d5f5
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0xcff478 sp=0xcff454 pc=0x22fa61
runtime.(*unwinder).next(0xcff66c)
...
runtime.markroot(0x142dc74, 0x5, 0x1)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:212 +0x1ad fp=0xcff9c4 sp=0xcff968 pc=0x1f916d
runtime.gcDrain(0x142dc74, 0x3)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1188 +0x50f fp=0xcffa14 sp=0xcff9c4 pc=0x1fb79f
runtime.gcDrainMarkWorkerDedicated(...)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1112
runtime.gcBgMarkWorker.func2()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1440 +0xa4 fp=0xcffa40 sp=0xcffa14 pc=0x1f6f34
runtime.systemstack(0x251729)
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:374 +0x3d fp=0xcffa44 sp=0xcffa40 pc=0x25180d
2024-11-12 21:08 gotip-windows-386 go@1f8fa494 sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0x8c pc=0xb2c1d1]

runtime stack:
runtime.throw({0xc4d005, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x21bafca4 sp=0x21bafc90 pc=0xb3d5f5
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x21bafcc8 sp=0x21bafca4 pc=0xb1fa61
runtime.(*unwinder).next(0x21bafd30)
...
fatal error: unexpected signal during runtime execution
panic during panic
[signal 0xc0000005 code=0x0 addr=0x8c pc=0xb2c1d1]

runtime stack:
runtime.throw({0xc4d005, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1084 +0x35 fp=0x21baf8c0 sp=0x21baf8ac pc=0xb3d5f5
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x21baf8e4 sp=0x21baf8c0 pc=0xb1fa61
runtime.(*unwinder).next(0x21bafad8)
...
runtime.markroot(0x142c974, 0x5, 0x1)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:212 +0x1ad fp=0x21bafe30 sp=0x21bafdd4 pc=0xae916d
runtime.gcDrain(0x142c974, 0x3)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1188 +0x50f fp=0x21bafe80 sp=0x21bafe30 pc=0xaeb79f
runtime.gcDrainMarkWorkerDedicated(...)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1112
runtime.gcBgMarkWorker.func2()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1440 +0xa4 fp=0x21bafeac sp=0x21bafe80 pc=0xae6f34
runtime.systemstack(0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:374 +0x3d fp=0x21bafeb0 sp=0x21bafeac pc=0xb4180d
2024-11-19 17:52 go1.23-windows-386 release-branch.go1.23@3726f07c sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0x8c pc=0x105b231]

runtime stack:
runtime.throw({0x1177a37, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1067 +0x35 fp=0x21f0f8c4 sp=0x21f0f8b0 pc=0x106ee75
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x21f0f8e8 sp=0x21f0f8c4 pc=0x104f611
runtime.(*unwinder).next(0x21f0f950)
...
fatal error: unexpected signal during runtime execution
panic during panic
[signal 0xc0000005 code=0x0 addr=0x8c pc=0x105b231]

runtime stack:
runtime.throw({0x1177a37, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1067 +0x35 fp=0x21f0f4e0 sp=0x21f0f4cc pc=0x106ee75
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x21f0f504 sp=0x21f0f4e0 pc=0x104f611
runtime.(*unwinder).next(0x21f0f6f8)
...
runtime.markroot(0x182e974, 0x5, 0x1)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:212 +0x1ad fp=0x21f0fa50 sp=0x21f0f9f4 pc=0x101a29d
runtime.gcDrain(0x182e974, 0x3)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1188 +0x50f fp=0x21f0faa0 sp=0x21f0fa50 pc=0x101c8ef
runtime.gcDrainMarkWorkerDedicated(...)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1112
runtime.gcBgMarkWorker.func2()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1440 +0xa5 fp=0x21f0facc sp=0x21f0faa0 pc=0x1018045
runtime.systemstack(0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:374 +0x3d fp=0x21f0fad0 sp=0x21f0facc pc=0x107405d
2024-11-19 18:04 go1.23-windows-386 release-branch.go1.23@777f43ab sync/atomic.TestNilDeref [ABORT] (log)
=== RUN   TestNilDeref
fatal error: unexpected signal during runtime execution
[signal 0xc0000005 code=0x0 addr=0x8c pc=0xc6b231]

runtime stack:
runtime.throw({0xd87a37, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1067 +0x35 fp=0x21d7f62c sp=0x21d7f618 pc=0xc7ee75
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x21d7f650 sp=0x21d7f62c pc=0xc5f611
runtime.(*unwinder).next(0x21d7f6b8)
...
fatal error: unexpected signal during runtime execution
panic during panic
[signal 0xc0000005 code=0x0 addr=0x8c pc=0xc6b231]

runtime stack:
runtime.throw({0xd87a37, 0x2a})
	C:/b/s/w/ir/x/w/goroot/src/runtime/panic.go:1067 +0x35 fp=0x21d7f248 sp=0x21d7f234 pc=0xc7ee75
runtime.sigpanic()
	C:/b/s/w/ir/x/w/goroot/src/runtime/signal_windows.go:395 +0x2a1 fp=0x21d7f26c sp=0x21d7f248 pc=0xc5f611
runtime.(*unwinder).next(0x21d7f460)
...
runtime.markroot(0x142fc74, 0x5, 0x1)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:212 +0x1ad fp=0x21d7f7b8 sp=0x21d7f75c pc=0xc2a29d
runtime.gcDrain(0x142fc74, 0x3)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1188 +0x50f fp=0x21d7f808 sp=0x21d7f7b8 pc=0xc2c8ef
runtime.gcDrainMarkWorkerDedicated(...)
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgcmark.go:1112
runtime.gcBgMarkWorker.func2()
	C:/b/s/w/ir/x/w/goroot/src/runtime/mgc.go:1440 +0xa5 fp=0x21d7f834 sp=0x21d7f808 pc=0xc28045
runtime.systemstack(0x0)
	C:/b/s/w/ir/x/w/goroot/src/runtime/asm_386.s:374 +0x3d fp=0x21d7f838 sp=0x21d7f834 pc=0xc8405d

watchflakes

@github-project-automation github-project-automation bot moved this from Done to In Progress in Go Compiler / Runtime Nov 20, 2024
@dmitshur
Copy link
Contributor

dmitshur commented Nov 20, 2024

@gopherbot Please consider this issue for backport. It appears to be a case of stack corruption during fault handling (affecting windows/386 only), and we're now seeing this failure on both 1.23 and 1.22 release branches (e.g., here and here). We need to either fix the root problem, or otherwise at minimum skip the flaky test.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #70474 (for 1.22), #70475 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-386 Issues solely affecting the 32-bit x86 architecture compiler/runtime Issues related to the Go compiler and/or runtime. FixPending Issues that have a fix which has not yet been reviewed or submitted. NeedsFix The path to resolution is known, but the work has not been done. OS-Windows
Projects
Status: In Progress
Status: Active
Development

No branches or pull requests

6 participants