Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user pkg drain for draining in queue proxy #12033

Merged
merged 2 commits into from
Nov 5, 2021

Conversation

nader-ziada
Copy link
Member

Fixes #10783

Proposed Changes

  • use pkg/drain for draining in queue proxy instead of writing drain again

Release Note

Use pkg/drain in queue proxy

@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Sep 22, 2021
@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/autoscale area/networking labels Sep 22, 2021
@nader-ziada nader-ziada changed the title user pkg drain for draining in queue proxy [WIP] user pkg drain for draining in queue proxy Sep 22, 2021
@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 22, 2021
@codecov
Copy link

codecov bot commented Sep 23, 2021

Codecov Report

Merging #12033 (69a0b5c) into main (cad72a3) will decrease coverage by 0.06%.
The diff coverage is 28.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #12033      +/-   ##
==========================================
- Coverage   87.36%   87.30%   -0.07%     
==========================================
  Files         196      195       -1     
  Lines        9611     9569      -42     
==========================================
- Hits         8397     8354      -43     
  Misses        935      935              
- Partials      279      280       +1     
Impacted Files Coverage Δ
cmd/queue/main.go 0.49% <0.00%> (+<0.01%) ⬆️
pkg/queue/health/handler.go 100.00% <100.00%> (ø)
pkg/reconciler/revision/background.go 89.32% <0.00%> (-1.95%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cad72a3...69a0b5c. Read the comment docs.

@nader-ziada
Copy link
Member Author

/retest

@nader-ziada nader-ziada changed the title [WIP] user pkg drain for draining in queue proxy user pkg drain for draining in queue proxy Sep 27, 2021
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2021
@nader-ziada
Copy link
Member Author

seeing this flaker here as well #12064

pkg/queue/health/health_state.go Outdated Show resolved Hide resolved
cmd/queue/main.go Show resolved Hide resolved
pkg/queue/health/health_state.go Outdated Show resolved Hide resolved
@dprotaso
Copy link
Member

/assign @julz @markusthoemmes

@nader-ziada
Copy link
Member Author

will work on comments and update PR

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2021
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2021
@nader-ziada
Copy link
Member Author

I think the PR makes a little more sense now if anyone has the time to take a look, thanks

cmd/queue/main.go Outdated Show resolved Hide resolved
Comment on lines 47 to 51
adminMux := http.NewServeMux()
handler := state.DrainHandlerFunc()
adminMux.HandleFunc(queue.RequestQueueDrainPath, func(w http.ResponseWriter, r *http.Request) {
handler(w, r)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling the admin server in here sorta conflates the responsibilities of the health.State

I think you can leave the admin server as it was in queue proxy's main and simply just call Drain

pkg/queue/health/health_state.go Outdated Show resolved Hide resolved
cmd/queue/main.go Outdated Show resolved Hide resolved
cmd/queue/main.go Show resolved Hide resolved
cmd/queue/main.go Outdated Show resolved Hide resolved
@nader-ziada nader-ziada force-pushed the pkgDrain branch 2 times, most recently from 681b95f to ec3d057 Compare October 22, 2021 13:52
@nader-ziada
Copy link
Member Author

I think this PR is ready now

composedHandler = network.NewProbeHandler(composedHandler)
// We might sometimes want to capture the probes/healthchecks in the request
// logs. Hence we need to have RequestLogHandler to be the first one.
if env.ServingEnableRequestLog {
composedHandler = requestLogHandler(logger, composedHandler, env)
}

return pkgnet.NewServer(":"+env.QueueServingPort, composedHandler)
drainer.Inner = composedHandler
drainer.HealthCheck = composedHandler.ServeHTTP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to convince myself this is correct. It seems like this makes the health check run the whole handler chain for a health check - is that what we want? I guess is works because the chain presumably still ends up in the health check path due to the header being there, but it seems a bit duplicative/unexpected that the health check be the whole handler chain? Wonder if we can have the health check be just the health check here

I think there may also be a mismatch between the queue behaviour and how pkg/drain acts. Specifically in the existing code when a knative prober header is passed (as opposed to the Kubernetes probe header) the queue only responds with queue.Name when the health check passes, whereas in drain serveKProbe always returns the queue.Name, even if the health check isn't passing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically in the existing code when a knative prober header is passed (as opposed to the Kubernetes probe header) the queue only responds with queue.Name when the health check passes

The link here is actually the queue proxy health check code and not the kprobe logic.

KProbe logic is wired in prior earlier in the handler chain:

serving/cmd/queue/main.go

Lines 314 to 315 in 70c7fc0

composedHandler = health.ProbeHandler(healthState, probeContainer, tracingEnabled, composedHandler)
composedHandler = network.NewProbeHandler(composedHandler)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, good point - I think you're right, but I also tbh kind of think this highlights that the flow gets a bit confusing with the whole chain being passed as both Inner and HealthCheck (at minimum I'm finding it quite difficult to follow now!). Example: IIUC right now we have network.NewProbeHandler in the composedHandler chain (as linked above) and also we handle kprobes in pkg/drain - I think they both do the same thing so it's probably OK, but isn't one of them redundant? (And same question wrt the non-kprobe health checks, is health.ProbeHandler ever being called other than via pkg/drain calling its HealthCheck field?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I brought that up #12033 (review) - because it was unclear to me why the drainer has kprobe logic when it exists in knative.dev/networking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(And same question wrt the non-kprobe health checks, is health.ProbeHandler ever being called other than via pkg/drain calling its HealthCheck field?)

I would lump this into the figure out kprobe handling and then re-organize the handlers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the health.ProbeHandler is being called by the drainer when it calls healthcheck here which now has this line io.WriteString(w, queue.Name)

the TestRequestLogs would fail otherwsie

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue in the pkg repo knative/pkg#2324 to tackle the duplication, but will need to do that a separate PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the above issue would help with the duplication of kprobe handling. Do we also have/need a plan to deal with the other health check being duplicated via both pkgdrain's HealthCheck field and also accessible by the regular chain? TBCH even now with the context in my head about what drainer does I don't completely follow which paths the health check might be called by (both? only via drainer.healthcheck?), so I'd really like us to have a plan at least to clear that up

Copy link
Member Author

@nader-ziada nader-ziada Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julz I've made a change to use a different handler for the health check that only has the necessary handlers, does that make more sense?

Copy link
Member

@julz julz Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is starting to look pretty good to me - thanks! Left a couple extra comments where I think (if my understanding is right) we may be able to "complete the refactor" a bit by dropping the health check from the existing path too

@@ -311,15 +305,24 @@ func buildServer(ctx context.Context, env config, healthState *health.State, pro
composedHandler = tracing.HTTPSpanMiddleware(composedHandler)
}

composedHandler = health.ProbeHandler(healthState, probeContainer, tracingEnabled, composedHandler)
composedHandler = health.ProbeHandler(probeContainer, tracingEnabled, composedHandler)
Copy link
Member

@julz julz Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed? (Is there a case where we need to do the health check that drainer doesn't intercept?). If not it'd be great if we could drop this so there's only one path the healthcheck can get called from

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I try to remove this, I get failures in the tests, for example

stream.go:254: W 13:56:38.802 activator-669f74ccf4-lg59r [activator] [serving-tests/request-logs-fxrcfgjb-00001] Failed probing pods err=unexpected body: want "queue", got "Hello World! How about some tasty noodles?\n"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah interesting, looks like that's because the activator sometimes does the probing directly as an optimisation to try to detect readiness faster than kubernetes can, and when it does it passes Activator as the user agent but drainer only calls healthcheck if the UA is kube-probe. So that path will be directly hitting the health probe (not via drainer) and since this PR drops healthState from the main path I think that means it'll continue to see 200s while we're shutting down, which may not be what we want 🤔. Wonder if there's any nice way to get pkgdrain to answer activator's probes as well as kubelets.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just add a check in the pkg/drain.go to also check if activator probe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable to me, yeah, it's a bit knative specific, but it is in knative/pkg so maybe that's fine :) If not I guess we could add a list of user agents we'll accept health checks from, but I'd go for the simple thing if no-one hates that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, will make a PR in pkg

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think activator specific logic should be in pkg/drain - that package is used in eventing as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really activator specific logic, just checking if header has a specific value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var healthcheckHandler http.Handler = httpProxy
healthcheckHandler = health.ProbeHandler(probeContainer, tracingEnabled, healthcheckHandler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be no case where we'd want to delegate down to the proxy from the health check now (drain is doing the switching between the healthcheck/inner paths), so maybe this could become:

Suggested change
var healthcheckHandler http.Handler = httpProxy
healthcheckHandler = health.ProbeHandler(probeContainer, tracingEnabled, healthcheckHandler)
var healthcheckHandler healthcheckHandler = health.ProbeHandler(probeContainer, tracingEnabled)

(just trying to make it so there's two independent paths, to the extent possible)

@nader-ziada
Copy link
Member Author

/hold
for new release of pkg that includes knative/pkg#2331

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 29, 2021
@nader-ziada nader-ziada force-pushed the pkgDrain branch 2 times, most recently from b159476 to 0a4481b Compare November 4, 2021 16:35
@nader-ziada
Copy link
Member Author

/hold cancel

@dprotaso @julz picked up the latest pkg change and update the queue proxy to use the new drainer extra headers field

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 4, 2021
@nader-ziada
Copy link
Member Author

/test pull-knative-serving-istio-stable-no-mesh

Copy link
Member

@julz julz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me bar the typo! thanks for putting up with all the back and forth on this one :)

Comment on lines 315 to 316
// Add Activator probe header to the drainer so it can handle probes directly dfrom activator
drainer.HealthCheckUAPrefixes = []string{network.ActivatorUserAgent}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very important, but I'm wondering why we do this here rather than when we create the drainer struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason, moved it to be set in struct creation similar to the QuietPeriod

}

return pkgnet.NewServer(":"+env.QueueServingPort, composedHandler)
drainer.Inner = composedHandler
// Add Activator probe header to the drainer so it can handle probes directly dfrom activator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Add Activator probe header to the drainer so it can handle probes directly dfrom activator
// Add Activator probe header to the drainer so it can handle probes directly from activator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Member

@julz julz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2021
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: julz, nader-ziada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2021
@knative-prow-robot knative-prow-robot merged commit 525a4cb into knative:main Nov 5, 2021
if env.ServingEnableRequestLog {
composedHandler = requestLogHandler(logger, composedHandler, env)
healthcheckHandler = requestLogHandler(logger, healthcheckHandler, env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

composedHandler needs to be wrapped with the requestLogHandler here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*in addition to the healthcheckHandler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its required in both? the RequestLogTest is passing without it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should - otherwise normal requests to the user container aren't logged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/autoscale area/networking cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use pkg/drain for draining in Queue Proxy
5 participants