Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on cluster GET during installation #6550

Open
almusil opened this issue Jul 9, 2024 · 12 comments
Open

Crash on cluster GET during installation #6550

almusil opened this issue Jul 9, 2024 · 12 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@almusil
Copy link
Member

almusil commented Jul 9, 2024

I've got the following crash in the logs when I try to access the cluster page during installation:

10.45.224.87 - - [09/Jul/2024:14:01:35 +0000] "GET /api/assisted-install/v2/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69 HTTP/1.1" 200 43515 "http://wsfd-advnetlab157.anl.eng.rdu2.dc.redhat.com:8080/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69" "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0" "-"
 2024/07/09 14:01:35 http: panic serving 127.0.0.1:43844: write tcp 127.0.0.1:8090->127.0.0.1:43844: write: broken pipe

 net/http.(*conn).serve.func1()
         /usr/local/go/src/net/http/server.go:1854 +0xbf
 panic({0x4630ee0, 0xc00143e000})
         /usr/local/go/src/runtime/panic.go:890 +0x263
 github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetClusterOK).WriteResponse(0xc001bae008, {0x52bdd90, 0xc00250cd80}, {0x52a5e00, 0x4d22ab8})
         /assisted-service/restapi/operations/installer/v2_get_cluster_responses.go:56 +0xa6
 github.com/go-openapi/runtime/middleware.(*Context).Respond(0xc002343110, {0x52bdd90?, 0xc00250cd80}, 0xc00087af00, {0xc000d9ffe0?, 0x1, 0x1}, 0xc00087ad00, {0x45bb0e0, 0xc001bae008})
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/context.go:510 +0x6b5
 github.com/openshift/assisted-service/restapi/operations/installer.(*V2GetCluster).ServeHTTP(0xc0016a3848, {0x52bdd90, 0xc00250cd80}, 0xc00087af00)
         /assisted-service/restapi/operations/installer/v2_get_cluster.go:67 +0x35e
 github.com/go-openapi/runtime/middleware.NewOperationExecutor.func1({0x52bdd90, 0xc00250cd80}, 0xc00087af00)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/operation.go:28 +0x59
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52bdd90?, 0xc00250cd80?}, 0x17ab757?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/internal/metrics.Handler.func1.1()
         /assisted-service/internal/metrics/reporter.go:37 +0x31
 github.com/slok/go-http-metrics/middleware.Middleware.Measure({{{0x52bdd60, 0xc00070e000}, {0x4a64bc2, 0x12}, 0x0, 0x0, 0x0}}, {0x0, 0x0}, {0x52d0880, ...}, ...)
         /assisted-service/vendor/github.com/slok/go-http-metrics/middleware/middleware.go:117 +0x30e
 github.com/openshift/assisted-service/internal/metrics.Handler.func1({0x52cbef0?, 0xc000a9a540}, 0xc00087af00)
         /assisted-service/internal/metrics/reporter.go:36 +0x35f
 net/http.HandlerFunc.ServeHTTP(0x52cd7b0?, {0x52cbef0?, 0xc000a9a540?}, 0xc001b92550?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/context.ContextHandler.func1.1({0x52cbef0, 0xc000a9a540}, 0xc00087ae00)
         /assisted-service/pkg/context/param.go:95 +0xc8
 net/http.HandlerFunc.ServeHTTP(0xc00213c448?, {0x52cbef0?, 0xc000a9a540?}, 0x31475a3235305661?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/go-openapi/runtime/middleware.NewRouter.func1({0x52cbef0, 0xc000a9a540}, 0xc00087ac00)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/router.go:77 +0x257
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52cbef0?, 0xc000a9a540?}, 0xc00213c688?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/go-openapi/runtime/middleware.Redoc.func1({0x52cbef0, 0xc000a9a540}, 0x7?)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/redoc.go:72 +0x242
 net/http.HandlerFunc.ServeHTTP(0x43317c0?, {0x52cbef0?, 0xc000a9a540?}, 0x171ec6a?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/go-openapi/runtime/middleware.Spec.func1({0x52cbef0, 0xc000a9a540}, 0xc001dc61c0?)
         /assisted-service/vendor/github.com/go-openapi/runtime/middleware/spec.go:46 +0x18c
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52cbef0?, 0xc000a9a540?}, 0x4?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1({0x52cb9b0, 0xc001dc61c0}, 0xc00213c8f0?)
         /assisted-service/vendor/github.com/NYTimes/gziphandler/gzip.go:336 +0x24e
 net/http.HandlerFunc.ServeHTTP(0x10?, {0x52cb9b0?, 0xc001dc61c0?}, 0xc00213c8d0?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/app.WithMetricsResponderMiddleware.func1({0x52cb9b0?, 0xc001dc61c0?}, 0xc00012c0a0?)
         /assisted-service/pkg/app/middleware.go:34 +0xb0
 net/http.HandlerFunc.ServeHTTP(0x7f2bf3656108?, {0x52cb9b0?, 0xc001dc61c0?}, 0xc000b8cb70?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/app.WithHealthMiddleware.func1({0x52cb9b0?, 0xc001dc61c0?}, 0x5291801?)
         /assisted-service/pkg/app/middleware.go:57 +0x162
 net/http.HandlerFunc.ServeHTTP(0x52cd708?, {0x52cb9b0?, 0xc001dc61c0?}, 0x52918c0?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 github.com/openshift/assisted-service/pkg/requestid.handler.ServeHTTP({{0x52a8280?, 0xc00139b180?}}, {0x52cb9b0, 0xc001dc61c0}, 0xc00087ab00)
         /assisted-service/pkg/requestid/requestid.go:69 +0x1ad
 github.com/openshift/assisted-service/internal/spec.WithSpecMiddleware.func1({0x52cb9b0?, 0xc001dc61c0?}, 0xc00087ab00?)
         /assisted-service/internal/spec/spec.go:38 +0x9b
 net/http.HandlerFunc.ServeHTTP(0x0?, {0x52cb9b0?, 0xc001dc61c0?}, 0x17820ce?)
         /usr/local/go/src/net/http/server.go:2122 +0x2f
 net/http.serverHandler.ServeHTTP({0xc000b8cab0?}, {0x52cb9b0, 0xc001dc61c0}, 0xc00087ab00)
         /usr/local/go/src/net/http/server.go:2936 +0x316
 net/http.(*conn).serve(0xc001caabd0, {0x52cd7b0, 0xc000ea7a10})
         /usr/local/go/src/net/http/server.go:1995 +0x612
 created by net/http.(*Server).Serve
         /usr/local/go/src/net/http/server.go:3089 +0x5ed

The container version of AI:

quay.io/edge-infrastructure/assisted-image-service  latest            ef3637bb83cf  13 hours ago   179 MB
quay.io/edge-infrastructure/assisted-service        latest            bf3148e62279  16 hours ago   956 MB
quay.io/edge-infrastructure/assisted-installer-ui   latest            9d95da3d91dd  30 hours ago   462 MB
@rccrdpccl
Copy link
Contributor

rccrdpccl commented Jul 9, 2024

Thanks for reporting this @almusil

Can you please provide the version of the service that crashed? Looks like it's the assisted-service, you can extract the version by podman inspect and paste here ImageDigest field
How was the service deployed? Is this a podman deployment? Where is the output you have pasted above coming from? (what's the access log line, is it the UI component logs interleaved with assisted-service's logs?)

@almusil
Copy link
Member Author

almusil commented Jul 10, 2024

Thanks for reporting this @almusil

Can you please provide the version of the service that crashed? Looks like it's the assisted-service, you can extract the version by podman inspect and paste here ImageDigest field How was the service deployed?

ImageDigest for all the AI images, even though the log suggests that it is the assisted-service.

# podman inspect quay.io/edge-infrastructure/assisted-image-service -f "{{ .Digest }}"
sha256:3b6baaeef183b158872468e4027f7851acb8c6dd6f45d80479a86360ab8c2682

# podman inspect quay.io/edge-infrastructure/assisted-service -f "{{ .Digest }}"
sha256:b5e0616e829acef9592fd87936b2f1da709426b24b4273879baad13c1d894db1

# podman inspect quay.io/edge-infrastructure/assisted-installer-ui -f "{{ .Digest }}"
sha256:38f20bd8214ed7a414e813daf1531b5e98db55433fc423024ddea87adb6d4798

Is this a podman deployment?

Yes it is podman deployment.

Where is the output you have pasted above coming from?

This was observed in the journal. The AI is running as part of bigger project to automate openshift deployments.

(what's the access log line, is it the UI component logs interleaved with assisted-service's logs?)

Not sure if I understand this one, can you please elaborate?

@rccrdpccl
Copy link
Contributor

rccrdpccl commented Jul 10, 2024

Not sure if I understand this one, can you please elaborate?

I'm interested in those lines:

10.45.224.87 - - [09/Jul/2024:14:01:35 +0000] "GET /api/assisted-install/v2/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69 HTTP/1.1" 200 43515 "http://wsfd-advnetlab157.anl.eng.rdu2.dc.redhat.com:8080/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69" "Mozilla/5.0 (X11; Linux x86_64; rv:127.0) Gecko/20100101 Firefox/127.0" "-"
 2024/07/09 14:01:35 http: panic serving 127.0.0.1:43844: write tcp 127.0.0.1:8090->127.0.0.1:43844: write: broken pipe

The first line seems to be an access log, which should not have been emitted by assisted-service.
The second line is a panic for unhandled broken connection, which might be assisted service. Can you confirm it's the assisted service emitting the panic log, and not other applications?

If that's the case, could you describe what is "accessing the cluster page"? Is it accessing assisted-service-ui at the cluster page? If so, could you also try with curl and the API if we get the same crash?

It seems that the service could be trying to write to a closed connection.

@almusil
Copy link
Member Author

almusil commented Jul 10, 2024

From the journal it is the assisted-installer-service:
assisted-installer-service[2211655]: 2024/07/10 13:09:40 http: panic serving 127.0.0.1:42856: write tcp 127.0.0.1:8090->127.0.0.1:42856: write: broken pipe

I cannot pinpoint exact URL that causes this with CURL.

Screenshot from 2024-07-10 15-15-06

Screenshot from 2024-07-10 15-15-20

Clicking the ovnscale causes the failure, not sure how to better describe it.

@rccrdpccl
Copy link
Contributor

the path is GET /api/assisted-install/v2/clusters/20e31b11-3aba-4457-9e44-7c5f9efa3a69 (as in the access logs), but you probably need authentication, would you be able to run that and see if it causes the same issue?

@rccrdpccl
Copy link
Contributor

You can also inspect what the UI is doing with developer tools and replay the same request, but from curl instead

@almusil
Copy link
Member Author

almusil commented Jul 11, 2024

Tried both without any success unfortunately

@rccrdpccl
Copy link
Contributor

WDYM "without any success"? Still crashing? Can you try to paste the curl's output?

@almusil
Copy link
Member Author

almusil commented Jul 11, 2024

The curl doesn't crash, I don't have the cluster anymore unfortunately

@rccrdpccl
Copy link
Contributor

Do you have a minimal example/setup where we can reproduce this issue?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants