Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle stale BeforeHookCreation resources as progressing #461

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nazarewk
Copy link

@nazarewk nazarewk commented Sep 13, 2022

TLDR; DELETE event during handling of BeforeHookCreation deletion policy can be delayed while etcd is running garbage collection:
1. at the same time "finished" hook is present in the cache,
2. gitops-engine does not verify creation timestamp of BeforeHookCreation objects against start time of sync operation,
3. gitops-engine is unaware (old) object should be purged from cache and happily proceeds to next Sync Wave

The PR fixes this specific failure mode with minimal changes to the code.

The change is impossible to reproduce as etcd's GC usually does not take long enough to trigger the issue. At the same time it is very difficult to debug.

the EKS cluster I observed issues at had tens of thousands objects with >1k objects churned every few minutes and still GC triggered the issue only few times a week

fixes #446
closes argoproj/argo-cd#10579
original issue argoproj/argo-cd#10077

@codecov
Copy link

codecov bot commented Sep 13, 2022

Codecov Report

Base: 55.38% // Head: 55.59% // Increases project coverage by +0.21% 🎉

Coverage data is based on head (e4d98aa) compared to base (517c1ff).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #461      +/-   ##
==========================================
+ Coverage   55.38%   55.59%   +0.21%     
==========================================
  Files          41       41              
  Lines        4478     4493      +15     
==========================================
+ Hits         2480     2498      +18     
+ Misses       1807     1804       -3     
  Partials      191      191              
Impacted Files Coverage Δ
pkg/sync/sync_context.go 73.41% <100.00%> (+0.79%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

- fixes argoproj#446
- closes argoproj/argo-cd#10579
- original issue argoproj/argo-cd#10077

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
@nazarewk nazarewk marked this pull request as draft September 14, 2022 11:55
@nazarewk

This comment was marked as outdated.

Signed-off-by: Krzysztof Nazarewski <3494992+nazarewk@users.noreply.github.com>
@nazarewk nazarewk marked this pull request as ready for review September 14, 2022 13:30
@sonarcloud
Copy link

sonarcloud bot commented Sep 14, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@nazarewk
Copy link
Author

managed to fully test expected flow, PR is ready for review

@rbreeze rbreeze requested a review from alexmt October 20, 2022 15:50
@crenshaw-dev crenshaw-dev changed the title handle stale BeforeHookCreation resources as progressing fix: handle stale BeforeHookCreation resources as progressing Dec 9, 2022
Copy link
Contributor

@leoluz leoluz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @nazarewk for this work!

Please make sure to create a PR in Argo CD repo pointing to this change so we can validate that all integration and e2e tests are passing. It would be great to make sure that there is an existing integration or e2e test that validates this scenario. If such test is not yet implemented it would be awesome to add it in Argo CD test suite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: cache does not get invalidated on resource deletion?
3 participants