chore: using domain-qualified finalizers #6023

trutx · 2024-11-18T15:05:38Z

Tracking issue

Why are the changes needed?

Switching to domain-qualified finalizers. Kubernetes introduced a warning in kubernetes/kubernetes#119508 so using old finalizers is harmless today, but updating them for the sake of clean Flyte admin logs and in advance of possible future enforcements of such finalizers.

What changes were proposed in this pull request?

Switch to a global domain-qualified finalizer: from flyte-finalizer to flyte.lyft.com/finalizer
Switch the k8s plugin finalizer from flyte/flytek8s to flyte.lyft.com/finalizer-k8s
Switch the array plugin finalizer from flyte/array to flyte.lyft.com/finalizer-array
Remove the finalizers.go and finalizers_test.go files and start leveraging the finalizer goodies in the upstream controllerutil package
Keep the removal of the old finalizer for backwards compatibility. It will no longer be added. This part should be eventually removed
Stop removing all finalizers and only remove Flyte's instead. This allows users to add their own finalizers

How was this patch tested?

Unit tests were modified to check for the presence/absence of the new finalizer. Some new tests were added and some existing tests were fixed so that clearFinalizer() func was actually run.

Setup process

Screenshots

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>

codecov · 2024-11-18T15:09:01Z

Codecov Report

Attention: Patch coverage is 53.65854% with 19 lines in your changes missing coverage. Please review.

Project coverage is 36.99%. Comparing base (d1a723e) to head (202cce9).
Report is 54 commits behind head on master.

Files with missing lines	Patch %	Lines
flyteplugins/go/tasks/plugins/array/k8s/subtask.go	0.00%	12 Missing and 1 partial ⚠️
...er/pkg/controller/nodes/task/k8s/plugin_manager.go	62.50%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6023      +/-   ##
==========================================
- Coverage   37.03%   36.99%   -0.04%     
==========================================
  Files        1313     1317       +4     
  Lines      131622   132471     +849     
==========================================
+ Hits        48742    49006     +264     
- Misses      78652    79210     +558     
- Partials     4228     4255      +27

Flag	Coverage Δ
unittests-datacatalog	`51.58% <ø> (ø)`
unittests-flyteadmin	`54.05% <100.00%> (-0.03%)`	⬇️
unittests-flytecopilot	`30.99% <ø> (+8.76%)`	⬆️
unittests-flytectl	`62.29% <ø> (-0.18%)`	⬇️
unittests-flyteidl	`7.24% <ø> (-0.02%)`	⬇️
unittests-flyteplugins	`53.84% <0.00%> (+0.16%)`	⬆️
unittests-flytepropeller	`42.64% <77.77%> (-0.47%)`	⬇️
unittests-flytestdlib	`55.18% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>

trutx · 2024-11-20T08:54:03Z

flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager_test.go

 			},
 		}

 		assert.NoError(t, fakeKubeClient.GetClient().Create(ctx, o))

 		p.OnBuildIdentityResource(ctx, tctx.TaskExecutionMetadata()).Return(o, nil)
-		pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient}
+		pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient, updateBackoffRetries: 5}


This updateBackoffRetries parameter ends up being e.updateBackoffRetries in

retryBackoff := wait.Backoff{ Duration: time.Duration(e.updateBaseBackoffDuration) * time.Millisecond, Factor: 2.0, Jitter: 0.1, Steps: e.updateBackoffRetries, }

If this is unset (which it was in tests), then Steps: 0 which means _ = wait.ExponentialBackoff(retryBackoff, func() (bool, error) { inside Finalize() times out almost immediately, which effectively means e.clearFinalizer() is never called and the finalizer is never removed. The default setting is 5 but I think it's too dangerous a) to allow users to tweak these backoff settings, or b) to use an exponential backoff at all.

Thoughts?

The original context of this change had to do with informers getting stale info in the case of array nodes.

cc: @pvditt

to allow users to tweak these backoff settings

Do you mean we should validate that it's a strictly positive value, right?

to use an exponential backoff at all.

Can you expand on that?

flytepropeller/pkg/controller/controller.go

flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager.go

eapolinario · 2024-12-06T16:03:54Z

flytepropeller/pkg/controller/nodes/task/k8s/plugin_manager_test.go

 			},
 		}

 		assert.NoError(t, fakeKubeClient.GetClient().Create(ctx, o))

 		p.OnBuildIdentityResource(ctx, tctx.TaskExecutionMetadata()).Return(o, nil)
-		pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient}
+		pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient, updateBackoffRetries: 5}


The original context of this change had to do with informers getting stale info in the case of array nodes.

cc: @pvditt

to allow users to tweak these backoff settings

Do you mean we should validate that it's a strictly positive value, right?

to use an exponential backoff at all.

Can you expand on that?

trutx · 2024-12-12T16:02:33Z

Do you mean we should validate that it's a strictly positive value, right?

That, or completely remove the ability to set an arbitrary value there. I can picture situations where a too short backoff expires when the k8s client inside the backoff func tries to talk to a busy/lagged/high network latency k8s apiserver.

Can you expand on that?

What I meant is we have to decide first if it's ok to leave to-be-deleted objects in the cluster with a finalizer nobody is going to remove, or not. If that's ok we don't need a backoff, we just need to try once and we have to make sure no ultrafast backoff gives up before the one try has even been performed.

But OTOH, if we don't want to leave any garbage behind maybe we should retry infinitely instead. I am up for exponentially spacing the retries not to hammer the apiserver, but I think such retries should keep happening until the removal finally succeeds.

However I think all of this is outside the scope of this PR :D

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>

chore: using domain-qualified finalizers

9d180a8

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>

trutx force-pushed the master branch from e3c3910 to 63f682c Compare November 19, 2024 16:38

test: adding test, fixing others

eb45eb2

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>

trutx force-pushed the master branch from 63f682c to eb45eb2 Compare November 19, 2024 16:47

trutx commented Nov 20, 2024

View reviewed changes

eapolinario reviewed Dec 6, 2024

View reviewed changes

trutx force-pushed the master branch from d8c2cb8 to e41667b Compare December 6, 2024 17:45

chore: use flyte.org/finalizer instead

202cce9

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>

trutx force-pushed the master branch from 46a541a to 202cce9 Compare December 18, 2024 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: using domain-qualified finalizers #6023

chore: using domain-qualified finalizers #6023

trutx commented Nov 18, 2024 •

edited

Loading

codecov bot commented Nov 18, 2024 •

edited

Loading

trutx Nov 20, 2024

eapolinario Dec 6, 2024

eapolinario Dec 6, 2024

trutx commented Dec 12, 2024 •

edited

Loading

chore: using domain-qualified finalizers #6023

Are you sure you want to change the base?

chore: using domain-qualified finalizers #6023

Conversation

trutx commented Nov 18, 2024 • edited Loading

Tracking issue

Why are the changes needed?

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

Related PRs

Docs link

codecov bot commented Nov 18, 2024 • edited Loading

Codecov Report

trutx Nov 20, 2024

Choose a reason for hiding this comment

eapolinario Dec 6, 2024

Choose a reason for hiding this comment

eapolinario Dec 6, 2024

Choose a reason for hiding this comment

trutx commented Dec 12, 2024 • edited Loading

trutx commented Nov 18, 2024 •

edited

Loading

codecov bot commented Nov 18, 2024 •

edited

Loading

trutx commented Dec 12, 2024 •

edited

Loading