-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85
Open
skunkworker
wants to merge
6
commits into
digitalocean:master
Choose a base branch
from
mxenabled:add_heartbeat_lash_pushed_at
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
50ce269
add heartbeat last pushed at to Stats
skunkworker f19a461
readd expiring workers
skunkworker 1b607c5
remove inline worker heartbeats
skunkworker 308ad15
fix err var
skunkworker 1bd2fb9
revert more changes
skunkworker ecd6179
more cleanup
skunkworker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, what should be done with the code that was commented out? Do you recommend to move that into an new
afterheartbeathook
? Without that code work in progress from workers with expired heartbeats will never be enqueued back for processing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to separate out the visual heartbeat ui code from the recoverable heartbeat (eg, what a node is working on) .
From what I've observed in the Sidekiq ruby behavior. (At least v6)
Heartbeats are for visual presentation of what a node is working on and utilize redis TTL expiration, and AFAIK are never read for consumption and resumption. Utilizing this in a mixed ruby/sidekiq environment is very useful for at-a-glance observability.
A specially named queue is created by each node to contain the in-progress work that is looked for upon boot, in order to properly recover a killed process. BRPOPLPUSH now BLMOVE, should be utilized to move the item from the queue into the specific worker's queue, after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted (along with the base assumption that all jobs are idempotent). This is where the additional logic that was attached to the heartbeat code could live. I will need to reread what the goals of the heartbeat hook logic are intended for (as my current mental model is more event based then poll based).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into Sidekiq's behavior. When you mention "after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted," does this mean that work will resume when the corresponding manager of the queue is restarted? We have a use case where we have a cluster of managers, and if a given manager goes down with work in its active queue for a prolonged period of time, we would like for that work that to be picked up by another manager.
For what it's worth, our current stance is we are okay with deviating from an exact port of Sidekiq, and this is why we introduced a use case for reading the heartbeat. That said, if there are things we can do to maintain compatibility with the Sidekiq UI, we agree that's preferable. Is there anything with the new heartbeat changes, such as expiring via polling instead of TTL, which breaks compatibility with the UI somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand of how sidekiq works:
The current jobs that a given process is processing should be atomically moved into another data structure (not around heartbeat) that can be read upon the process being restarted.
Having the heartbeat separated from the manager's queue allows you to check for orphaned jobs easier, because if the heartbeat has expired, but a cron process discovers a
LIST
of jobs in memory, you can assume that those jobs should be restarted by another manager.This also has the added benefit of maintaining compatibility with the sidekiq UI, as current jobs for a given process do not live in the same key as the heartbeat and current job details.
Looking into the docs around
super_fetch
there is logic around orphaned job discovery and reprocessing. I believe it will do aSCAN
looking for job sets for any host where the accompanying heartbeat has expired (by ttl), as this indicates that there are orphaned jobs to be picked up.https://github.com/sidekiq/sidekiq/wiki/Reliability#recovering-jobs
I can refactor this to a more middle ground approach, but my end goal is to allow for discovery of orphaned jobs from persistent hosts (same name upon restart) along with dynamically named hosts (eg: replicas in k8s).