Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness probe failing with 400 #12462

Closed
Shashankft9 opened this issue Dec 29, 2021 · 14 comments · Fixed by #12479
Closed

Liveness probe failing with 400 #12462

Shashankft9 opened this issue Dec 29, 2021 · 14 comments · Fixed by #12479
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@Shashankft9
Copy link
Member

Shashankft9 commented Dec 29, 2021

What version of Knative?

v1.1.0

Expected Behavior

I have defined liveness and readiness probe on the ksvc, looks like this:

          livenessProbe:
            httpGet:
              path: /health/liveness
              port: 8080
          name: user-container
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 8080
            successThreshold: 1

I am handling these paths in the go code. I was expecting that it should work out like any other pod.

Actual Behavior

It all works in v0.25 knative, but when I am testing this on the newer v1.1.0, the ksvc pod looks fine for first 5-10 seconds, but after that the liveness probe fails and then the readiness probe as well, with below events:

  Warning  Unhealthy  11s (x3 over 31s)  kubelet            Liveness probe failed: HTTP probe failed with statuscode: 400
  Normal   Killing    11s                kubelet            Container user-container failed liveness probe, will be restarted
  Warning  Unhealthy  1s (x10 over 10s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503

maybe the timing of the occurence has something to do with PeriodSeconds?
I tried to compare the pod yaml of the ksvc deployed with v0.25 and v1.1.0, and I can see this bit missing in the pod with the v1.1.0:

    startupProbe:	
      exec:	
        command:	
        - /ko-app/queue	
        - -probe-timeout	
        - 10m0s	
      failureThreshold: 1	
      periodSeconds: 1	
      timeoutSeconds: 600

Is this somehow related?

Steps to Reproduce the Problem

I first noticed this error while deploying functions using the knfunc cli, but then I was able to reproduce the issue with a sample ksvc with probe configuration added and deploying it in the latest release of knative.

@Shashankft9 Shashankft9 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 29, 2021
@julz
Copy link
Member

julz commented Jan 4, 2022

Hi @Shashankft9 - could you share details of how /health/liveness and /health/readiness are implemented. In particular are these endpoints working if accessed directly, is there any delay before they start reporting success etc, and do they listen on the same port as other endpoints in the user container? Also do you have Istio mesh enabled in this environment? If you could share an (ideally minimal) image that reproduces the problem that would be super helpful too. Thanks!

@Shashankft9
Copy link
Member Author

Shashankft9 commented Jan 4, 2022

Hey,
Yeah both the paths are configured on the same port as the port of user-container 8080, and yes both the paths are accessible through the pod IP and the port until the pod goes into CrashLoopBackOff.
Regarding implementation, I have:

        r := mux.NewRouter()
        r.HandleFunc("/", handler)
        r.HandleFunc("/health/liveness", healthz)
        r.HandleFunc("/health/readiness", readyz))

and then these 2 handlers are just doing w.WriteHeader(http.StatusOK) for simplicity.

You can try reproducing this issue through this yaml:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: datafetch-go
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/window: "180s"
        autoscaling.knative.dev/metric: "rps"
        autoscaling.knative.dev/target: "100"
    spec:
      containers:
        - image: shashankft/datafetch-go:probes
          env:
            - name: TARGET
              value: "Go Sample v1"
          livenessProbe:
            httpGet:
              path: /health/liveness
              port: 8080
          name: user-container
          readinessProbe:
            httpGet:
              path: /health/readiness
              port: 8080
            successThreshold: 1

couple of dumb questions, might not be related but:

  • whats the implication of giving the port here as 0? As I noticed that when using the func cli, the ports have 0 as value.
  • I just noticed that I can't give the port in v0.25 of serving (I was using 0 instead of 8080 before), it gives me this error:
Error from server (BadRequest): error when creating "datafetch.yaml": admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.containers[0].livenessProbe.httpGet.port, spec.template.spec.containers[0].readinessProbe.httpGet.port
  • related to above, giving the port is allowed in v1.1.0 as in my yaml file above.
  • also I notice that if I check the spec of the underlying pods that come up from the ksvc, I see things modified for liveness and readiness probes, why is that done? Is there some related github issue which can give more context on this?

thanks for helping and apologies for the noise if any.

@julz
Copy link
Member

julz commented Jan 4, 2022

assuming the fact you have a separate healthz and readyz implies they both do something more interesting than return 200 in all cases, are you able to share the code behind them (or an outline of what they do)? If you replace them with an endpoint that does always return 200, does anything change?

Also could you confirm whether Istio mesh is enabled or not? Thanks!

The port should be automatically set by knative so you shouldn't need to specify one (so setting 0 will result in the correct port being added). The readiness and liveness probes are rewritten so that the readiness probe goes via the Queue Proxy (so that a single endpoint can report readiness for both, which is needed by some optimisations in Activator).

@Shashankft9
Copy link
Member Author

Shashankft9 commented Jan 4, 2022

oh sorry, I thought I mentioned about mesh in the previous message - no I dont have istio mesh enabled.
In the image I shared above, it actually always returns 200:

func healthz(w http.ResponseWriter, _ *http.Request) {
        w.WriteHeader(http.StatusOK)
        fmt.Fprintf(w, "liveness check done")
}

func readyz(w http.ResponseWriter, _ *http.Request) {
        w.WriteHeader(http.StatusOK)
        fmt.Fprintf(w, "readiness check done")
}

@skonto
Copy link
Contributor

skonto commented Jan 4, 2022

@julz I am able to reproduce it on minikube with a trivial service. Here is the probe log entries at the queue proxy side.
At some point I see:

{"httpRequest": {"requestMethod": "GET", "requestUrl": "/healthz", "requestSize": "0", "status": 400, "responseSize": "32", "userAgent": "kube-probe/1.21", "remoteIp": "172.17.0.1:51746", "serverIp": "172.17.0.14", "referer": "", "latency": "2.3956e-05s", "protocol": "HTTP/1.1"}, "traceId": "[]"}

Code and service def here.

@skonto
Copy link
Contributor

skonto commented Jan 4, 2022

The error I see is the following (used the original service and set --v=4 at the kubelet side):

Jan 04 21:59:12 minikube kubelet[20845]: I0104 21:59:12.884475   20845 http.go:137] Probe failed for http://172.17.0.14:8012/health/liveness with request headers map[Accept:[*/*] K-Kubelet-Probe:[queue] User-Agent:[kube-probe/1.21]], response body: unexpected probe header value:
Jan 04 21:59:12 minikube kubelet[20845]: I0104 21:59:12.884519   20845 prober.go:116] "Probe failed" probeType="Liveness" pod="default/datafetch-go-00001-deployment-56f67d7f64-qg629" podUID=4855bf7b-a691-47d5-a10a-970d68bf2a6f containerName="user-container" probeResult=failure output="HTTP probe failed with statuscode: 400"
Jan 04 21:59:12 minikube kubelet[20845]: I0104 21:59:12.884595   20845 kubelet.go:2033] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="default/datafetch-go-00001-deployment-56f67d7f64-qg629"

It seems that the header value is not ok. Error comes from here.
I suspect in this PR removing this part

	if ph == "" {
			next.ServeHTTP(w, r)
			return
		}

causes this. wdyth @julz, just returning if ph=="" fixed this.

@julz
Copy link
Member

julz commented Jan 5, 2022

interesting, thanks for digging in to this @skonto! What I don't quite follow is if I'm reading it right previously if ph was "" it would call next.ServeHTTP but next (the removed final parameter) was nil before the PR 🤔. Do you understand how/why that works? (Or what I'm missing 😅)

image

image

@julz
Copy link
Member

julz commented Jan 5, 2022

right, I understand why it fails, I just don't think I quite follow why the old code worked in this case. Previously in this case ph was == to "" so we called next.ServeHTTP, but wasn't next nil (since it was set to var healthcheckHandler http.Handler)? Does a nil http.Handler happen to do something sensible somehow?

@skonto
Copy link
Contributor

skonto commented Jan 5, 2022

Yeah sorry misread, also deleted my comment not fast enough I guess. My understanding is that we dont need the handler next since we can stop at the healthcheck side of the drainer, it is a probe and can be handled there by just returning, the drainer will either return here or serve the next handler here. In the past though we didnt have the logic of the drainer and we directly passed the composedHandler in that probeHandler here.
I dont think nil does anything or is passed there. Btw do we have any tests for this case (it seems no)?

@julz
Copy link
Member

julz commented Jan 5, 2022

ahhh ok I think I see now: suspect this was broken by #12033 not #12318. As of #12033 we could end up in drainer with ph == "", and it would have then hit a NPE. #12318 changed that so instead of a NPE we get a 400 error, but the correct behaviour was before the introduction of drainer in the first place -- and we never realised any of this because it's all missing tests :(.

Do you want to PR the fix and I'll review? (or can do other way around, just let me know!)

@skonto
Copy link
Contributor

skonto commented Jan 5, 2022

I am off for the rest of the week so feel free, have limited access :)

@dprotaso
Copy link
Member

dprotaso commented Jan 5, 2022

cc @nader-ziada

@dprotaso
Copy link
Member

dprotaso commented Jan 5, 2022

/triage accepted
/milestone Icebox

@knative-prow-robot knative-prow-robot added the triage/accepted Issues which should be fixed (post-triage) label Jan 5, 2022
@dprotaso dprotaso added this to the Icebox milestone Jan 5, 2022
@julz
Copy link
Member

julz commented Jan 6, 2022

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants