-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync/atomic: TestNilDeref flaky failure on windows-386 with runtime fatal error #70288
Comments
Related Issues (Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.) |
cc @golang/runtime |
Found new dashboard test flakes for:
2024-11-11 23:49 gotip-windows-386 go@f9159b11 sync/atomic.TestNilDeref [ABORT] (log)
2024-11-11 23:50 gotip-windows-386 go@4c8ab993 sync/atomic.TestNilDeref [ABORT] (log)
|
Found new dashboard test flakes for:
2024-11-12 17:16 gotip-windows-386 go@c969491e sync/atomic.TestNilDeref [ABORT] (log)
|
cc @golang/windows |
Found new dashboard test flakes for:
2024-11-12 18:35 gotip-windows-386 go@70f6c139 sync/atomic.TestNilDeref [ABORT] (log)
|
The test binary seems to die abruptly. The package tests are marked as failed, but there are JSON lines missing from the full log (https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8731450114802979457/+/u/step/11/log/3). That's also why the test is marked as "aborted": the JSON test parser loses track of what exactly happened. This also appears to have started yesterday based on the test history: https://ci.chromium.org/ui/test/golang/sync%2Fatomic.TestNilDeref?q=V%3Abuilder%3Dgotip-windows-386-test_only+V%3Ago_branch%3Dmaster+V%3Agoarch%3D386+V%3Agoos%3Dwindows+V%3Ahost_goarch%3D386+V%3Ahost_goos%3Dwindows. Though, there's no obvious culprit. On Linux, we've seen immediate deaths like this end up being a bad dereference during a signal handler. Does something similar happen for crashes in the exception handler? (At least we have a stack trace here.) CC @qmuntal who might have additional insights. |
I've reproduced this as early as Go 1.20, with -count=10000. It is a stack unwinding / stack corruption problem. Here is one error I saw at a later commit:
The panic is about not fully unwinding goroutine 1. You can see the traceback printer get it wrong too. main.main should be called from goexit, but it has been "called" from TestNilDeref.func59 instead. This suggests stack corruption near the top of goroutine 1's stack. The top of goroutine 1's stack is the bottom of goroutine 211's stack, which coincidentally (?) is the goroutine that did the nil dereference. func59 is also on that stack, but with a different caller pc (0x378b0a vs 0x378b18). At Go 1.20, I got this error:
Again, instead of finding goexit near the top of the stack, we find corruption and a PC in sync/atomic_test.TestNilDeref.func59. I wonder whether the Windows fault handler is pushing a larger-than-expected fault descriptor onto the goroutine stack (should it be using the system stack), although func59 is a few frames above the actual fault, so I don't know why the fault descriptor would mention it. Definitely looks like stack corruption somehow. |
Here is a failure at Go 1.19.
In this case the test goroutine stack (0x11861c70-0x11861ff0) is not directly above the corrupt stack (0x11904000-0x11905000). However, note that the "argument" to func59 is 0x11905dc4. That appears to be an on-stack closure pointer argument, looking at the generated code for both func59 and TestNilDeref, but the right value for the stack we are on would be 0x11861dc4. That implies the test goroutine stack used to be directly above the corrupt stack but was moved. It looks like the closure pointer argument was not adjusted when the stack moved, because it is dead. But that's very important to know: the stack moved. I don't see why it would have moved. As I read runtime/stack.go _FixedStack (the initial stack size) is 4096 bytes, and none of the stack we can see look like they are using anywhere near that. However:
Perhaps 2kB is not enough signal handling space on windows-386. I think our windows-386 builders are actually 64-bit Windows now. Maybe that matters? I tried adding some debug prints about the stack during the exceptions. The low bits of the stack pointer we see Go's exception handler running on are around 0x128. It should not go lower after that in Go, and that's still almost 300 bytes of room. That said, this is being called from some DLL, so maybe it went lower before that in the DLL, or even after that as we return back out. It seems too close for comfort either way. I tried doubling the reserved system stack on 386, simplifying the expression to IsWindows*4096. I'm waiting on proper statistics but preliminary runs make it look like the problem goes away with the larger system stack. |
Change https://go.dev/cl/627375 mentions this issue: |
At current master the test fails 100/100 times running -count=10. That is, before, it consistently (100/100) fails at running 10 iterations. CL 627375 has the stack doubling. |
Found new dashboard test flakes for:
2024-11-13 00:57 gotip-windows-386 go@ab554650 sync/atomic.TestNilDeref [ABORT] (log)
|
Found new dashboard test flakes for:
2024-11-11 17:11 gotip-windows-386 go@5a9aeef9 sync/atomic.TestNilDeref [ABORT] (log)
2024-11-11 21:33 gotip-windows-386 go@73ac82f9 sync/atomic.TestNilDeref [ABORT] (log)
2024-11-12 01:08 gotip-windows-386 go@c96939fb sync/atomic.TestNilDeref [ABORT] (log)
2024-11-12 19:51 gotip-windows-386 go@3efbc30f sync/atomic.TestNilDeref [ABORT] (log)
2024-11-12 21:08 gotip-windows-386 go@1f8fa494 sync/atomic.TestNilDeref [ABORT] (log)
2024-11-19 17:52 go1.23-windows-386 release-branch.go1.23@3726f07c sync/atomic.TestNilDeref [ABORT] (log)
2024-11-19 18:04 go1.23-windows-386 release-branch.go1.23@777f43ab sync/atomic.TestNilDeref [ABORT] (log)
|
@gopherbot Please consider this issue for backport. It appears to be a case of stack corruption during fault handling (affecting windows/386 only), and we're now seeing this failure on both 1.23 and 1.22 release branches (e.g., here and here). We need to either fix the root problem, or otherwise at minimum skip the flaky test. |
Backport issue(s) opened: #70474 (for 1.22), #70475 (for 1.23). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
sync/atomic.TestNilDeref has been flaky on windows-386 with runtime fatal error "traceback did not unwind completely"
It started since today (Nov 11).
https://ci.chromium.org/ui/test/golang/sync%2Fatomic.TestNilDeref?q=V%3Abuilder%3Dgotip-windows-386-test_only+V%3Ago_branch%3Dmaster+V%3Agoarch%3D386+V%3Agoos%3Dwindows+V%3Ahost_goarch%3D386+V%3Ahost_goos%3Dwindows
All other platforms look fine. Only windows-386.
I've seen it on trybot more than once, e.g. https://ci.chromium.org/ui/p/golang/builders/try/gotip-windows-386/b8731553101305354337/test-results?sortby=&groupby=
The text was updated successfully, but these errors were encountered: