Liveness probe failing with 400 #12462

Shashankft9 · 2021-12-29T08:54:30Z

What version of Knative?

v1.1.0

Expected Behavior

I have defined liveness and readiness probe on the ksvc, looks like this:

          livenessProbe:
            httpGet:
              path: /health/liveness
              port: 8080
          name: user-container
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 8080
            successThreshold: 1

I am handling these paths in the go code. I was expecting that it should work out like any other pod.

Actual Behavior

It all works in v0.25 knative, but when I am testing this on the newer v1.1.0, the ksvc pod looks fine for first 5-10 seconds, but after that the liveness probe fails and then the readiness probe as well, with below events:

  Warning  Unhealthy  11s (x3 over 31s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 400
  Normal   Killing    11s                kubelet            Container user-container failed liveness probe, will be restarted
  Warning  Unhealthy  1s (x10 over 10s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

maybe the timing of the occurence has something to do with PeriodSeconds?
I tried to compare the pod yaml of the ksvc deployed with v0.25 and v1.1.0, and I can see this bit missing in the pod with the v1.1.0:

    startupProbe:	
      exec:	
        command:	
        - /ko-app/queue	
        - -probe-timeout	
        - 10m0s	
      failureThreshold: 1	
      periodSeconds: 1	
      timeoutSeconds: 600

Is this somehow related?

Steps to Reproduce the Problem

I first noticed this error while deploying functions using the knfunc cli, but then I was able to reproduce the issue with a sample ksvc with probe configuration added and deploying it in the latest release of knative.

The text was updated successfully, but these errors were encountered:

julz · 2022-01-04T12:24:56Z

Hi @Shashankft9 - could you share details of how /health/liveness and /health/readiness are implemented. In particular are these endpoints working if accessed directly, is there any delay before they start reporting success etc, and do they listen on the same port as other endpoints in the user container? Also do you have Istio mesh enabled in this environment? If you could share an (ideally minimal) image that reproduces the problem that would be super helpful too. Thanks!

Shashankft9 · 2022-01-04T13:25:01Z

Hey,
Yeah both the paths are configured on the same port as the port of user-container 8080, and yes both the paths are accessible through the pod IP and the port until the pod goes into CrashLoopBackOff.
Regarding implementation, I have:

        r := mux.NewRouter()
        r.HandleFunc("/", handler)
        r.HandleFunc("/health/liveness", healthz)
        r.HandleFunc("/health/readiness", readyz))

and then these 2 handlers are just doing w.WriteHeader(http.StatusOK) for simplicity.

You can try reproducing this issue through this yaml:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: datafetch-go
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/window: "180s"
        autoscaling.knative.dev/metric: "rps"
        autoscaling.knative.dev/target: "100"
    spec:
      containers:
        - image: shashankft/datafetch-go:probes
          env:
            - name: TARGET
              value: "Go Sample v1"
          livenessProbe:
            httpGet:
              path: /health/liveness
              port: 8080
          name: user-container
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 8080
            successThreshold: 1

couple of dumb questions, might not be related but:

whats the implication of giving the port here as 0? As I noticed that when using the func cli, the ports have 0 as value.
I just noticed that I can't give the port in v0.25 of serving (I was using 0 instead of 8080 before), it gives me this error:

Error from server (BadRequest): error when creating "datafetch.yaml": admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.containers[0].livenessProbe.httpGet.port, spec.template.spec.containers[0].readinessProbe.httpGet.port

related to above, giving the port is allowed in v1.1.0 as in my yaml file above.
also I notice that if I check the spec of the underlying pods that come up from the ksvc, I see things modified for liveness and readiness probes, why is that done? Is there some related github issue which can give more context on this?

thanks for helping and apologies for the noise if any.

julz · 2022-01-04T13:57:38Z

assuming the fact you have a separate healthz and readyz implies they both do something more interesting than return 200 in all cases, are you able to share the code behind them (or an outline of what they do)? If you replace them with an endpoint that does always return 200, does anything change?

Also could you confirm whether Istio mesh is enabled or not? Thanks!

The port should be automatically set by knative so you shouldn't need to specify one (so setting 0 will result in the correct port being added). The readiness and liveness probes are rewritten so that the readiness probe goes via the Queue Proxy (so that a single endpoint can report readiness for both, which is needed by some optimisations in Activator).

Shashankft9 · 2022-01-04T14:18:04Z

oh sorry, I thought I mentioned about mesh in the previous message - no I dont have istio mesh enabled.
In the image I shared above, it actually always returns 200:

func healthz(w http.ResponseWriter, _ *http.Request) {
        w.WriteHeader(http.StatusOK)
        fmt.Fprintf(w, "liveness check done")
}

func readyz(w http.ResponseWriter, _ *http.Request) {
        w.WriteHeader(http.StatusOK)
        fmt.Fprintf(w, "readiness check done")
}

skonto · 2022-01-04T16:02:07Z

@julz I am able to reproduce it on minikube with a trivial service. Here is the probe log entries at the queue proxy side.
At some point I see:

{"httpRequest": {"requestMethod": "GET", "requestUrl": "/healthz", "requestSize": "0", "status": 400, "responseSize": "32", "userAgent": "kube-probe/1.21", "remoteIp": "172.17.0.1:51746", "serverIp": "172.17.0.14", "referer": "", "latency": "2.3956e-05s", "protocol": "HTTP/1.1"}, "traceId": "[]"}

Code and service def here.

skonto · 2022-01-04T22:09:20Z

The error I see is the following (used the original service and set --v=4 at the kubelet side):

Jan 04 21:59:12 minikube kubelet[20845]: I0104 21:59:12.884475   20845 http.go:137] Probe failed for http://172.17.0.14:8012/health/liveness with request headers map[Accept:[*/*] K-Kubelet-Probe:[queue] User-Agent:[kube-probe/1.21]], response body: unexpected probe header value:
Jan 04 21:59:12 minikube kubelet[20845]: I0104 21:59:12.884519   20845 prober.go:116] "Probe failed" probeType="Liveness" pod="default/datafetch-go-00001-deployment-56f67d7f64-qg629" podUID=4855bf7b-a691-47d5-a10a-970d68bf2a6f containerName="user-container" probeResult=failure output="HTTP probe failed with statuscode: 400"
Jan 04 21:59:12 minikube kubelet[20845]: I0104 21:59:12.884595   20845 kubelet.go:2033] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="default/datafetch-go-00001-deployment-56f67d7f64-qg629"

It seems that the header value is not ok. Error comes from here.
I suspect in this PR removing this part

	if ph == "" {
			next.ServeHTTP(w, r)
			return
		}

causes this. wdyth @julz, just returning if ph=="" fixed this.

julz · 2022-01-05T11:15:07Z

interesting, thanks for digging in to this @skonto! What I don't quite follow is if I'm reading it right previously if ph was "" it would call next.ServeHTTP but next (the removed final parameter) was nil before the PR 🤔. Do you understand how/why that works? (Or what I'm missing 😅)

julz · 2022-01-05T16:25:49Z

right, I understand why it fails, I just don't think I quite follow why the old code worked in this case. Previously in this case ph was == to "" so we called next.ServeHTTP, but wasn't next nil (since it was set to var healthcheckHandler http.Handler)? Does a nil http.Handler happen to do something sensible somehow?

skonto · 2022-01-05T16:38:19Z

Yeah sorry misread, also deleted my comment not fast enough I guess. My understanding is that we dont need the handler next since we can stop at the healthcheck side of the drainer, it is a probe and can be handled there by just returning, the drainer will either return here or serve the next handler here. In the past though we didnt have the logic of the drainer and we directly passed the composedHandler in that probeHandler here.
I dont think nil does anything or is passed there. Btw do we have any tests for this case (it seems no)?

julz · 2022-01-05T16:46:43Z

ahhh ok I think I see now: suspect this was broken by #12033 not #12318. As of #12033 we could end up in drainer with ph == "", and it would have then hit a NPE. #12318 changed that so instead of a NPE we get a 400 error, but the correct behaviour was before the introduction of drainer in the first place -- and we never realised any of this because it's all missing tests :(.

Do you want to PR the fix and I'll review? (or can do other way around, just let me know!)

skonto · 2022-01-05T17:23:41Z

I am off for the rest of the week so feel free, have limited access :)

dprotaso · 2022-01-05T23:45:30Z

cc @nader-ziada

dprotaso · 2022-01-05T23:46:18Z

/triage accepted
/milestone Icebox

julz · 2022-01-06T08:19:14Z

/assign

Shashankft9 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 29, 2021

knative-prow-robot added the triage/accepted Issues which should be fixed (post-triage) label Jan 5, 2022

dprotaso added this to the Icebox milestone Jan 5, 2022

knative-prow-robot assigned julz Jan 6, 2022

This was referenced Jan 6, 2022

Move liveness checks to user container port rather than QP port #12479

Merged

No test coverage for liveness probes #12480

Closed

dprotaso modified the milestones: Icebox, v1.2.0 Jan 6, 2022

knative-prow-robot closed this as completed in #12479 Jan 6, 2022

julz mentioned this issue Jan 21, 2022

Adding tests for Liveness probes #12497

Closed

evankanderson mentioned this issue Jan 24, 2022

Add port spec in probes #8288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness probe failing with 400 #12462

Liveness probe failing with 400 #12462

Shashankft9 commented Dec 29, 2021 •

edited

Loading

julz commented Jan 4, 2022

Shashankft9 commented Jan 4, 2022 •

edited

Loading

julz commented Jan 4, 2022

Shashankft9 commented Jan 4, 2022 •

edited

Loading

skonto commented Jan 4, 2022

skonto commented Jan 4, 2022 •

edited

Loading

julz commented Jan 5, 2022

julz commented Jan 5, 2022

skonto commented Jan 5, 2022 •

edited

Loading

julz commented Jan 5, 2022

skonto commented Jan 5, 2022 •

edited

Loading

dprotaso commented Jan 5, 2022

dprotaso commented Jan 5, 2022

julz commented Jan 6, 2022

Liveness probe failing with 400 #12462

Liveness probe failing with 400 #12462

Comments

Shashankft9 commented Dec 29, 2021 • edited Loading

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

julz commented Jan 4, 2022

Shashankft9 commented Jan 4, 2022 • edited Loading

julz commented Jan 4, 2022

Shashankft9 commented Jan 4, 2022 • edited Loading

skonto commented Jan 4, 2022

skonto commented Jan 4, 2022 • edited Loading

julz commented Jan 5, 2022

julz commented Jan 5, 2022

skonto commented Jan 5, 2022 • edited Loading

julz commented Jan 5, 2022

skonto commented Jan 5, 2022 • edited Loading

dprotaso commented Jan 5, 2022

dprotaso commented Jan 5, 2022

julz commented Jan 6, 2022

Shashankft9 commented Dec 29, 2021 •

edited

Loading

Shashankft9 commented Jan 4, 2022 •

edited

Loading

Shashankft9 commented Jan 4, 2022 •

edited

Loading

skonto commented Jan 4, 2022 •

edited

Loading

skonto commented Jan 5, 2022 •

edited

Loading

skonto commented Jan 5, 2022 •

edited

Loading