-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle scenarios where registered processes disappear outside regular flow #337
Conversation
3a46abf
to
29f297c
Compare
@@ -53,7 +53,10 @@ def stop_heartbeat | |||
end | |||
|
|||
def heartbeat | |||
process.heartbeat | |||
process.with_lock(&:heartbeat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like with_lock doesnt pass the object to the block. Seeing a SolidQueue-0.8.2 Error in thread (0.0ms) error: "ArgumentError no receiver given"
error.
process.with_lock { process.heartbeat }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh, I'm dumb, seriously. This was working on Sunday and I decided to break it yesterday without testing 😅 I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha no worries!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, no, that was correct 😕 I was wondering why the tests were passing... and this is how with_lock
is supposed to be used. The error must come from something else, I'll investigate and fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, no, your version is the correct one 😆 And it doesn't break the tests because the error doesn't happen
there! The first time the heartbeat happens, the process is gone so the block is never called. Ahh, I need a holiday soon 😅 😅
Ahhh! Yes, I was only considering the case where the processes get deregistered because they lost their heartbeats, not the case where you go and delete them all. I made sure the supervisor would never get deregistered by itself, so it should work... it might be the case that you are running multiple supervisors and one of them deregisters the other but I think you wouldn't do that in development (which is the case I'm handling here, as you'd expect things to break in production if suddenly the server running your jobs decides to go to sleep) 🤔 |
Thats a good point. i was just impatient and decided to nuke the records instead waiting for my mac to hibernate. Going the hibernation route this never gets raised so i think you can ignore more here |
For example, when coming back from being suspended and having its heartbeat expired.
As these are all common between all processes.
If, for some reason, the process failed a heartbeat and the supervisor pruned it, we shouldn't continue running. Just stop as if we had received a signal. This could be used in the future from Mission Control to stop a worker.
To guard against race conditions of the record being deleted precisely then.
Thanks to @npezza93 for catching this ^_^U
fdb7bc0
to
6d7bc6f
Compare
Left-over from somethign I was rewriting.
We're now running this in production in HEY to make sure everything is ok. |
This is a simple way to mitigate the effects of a computer going to sleep and then coming back to life, with processes with expired heartbeats that are, however, alive. It came up as part of the quest to reproduce another unrelated error in #324. It's not trivial for the supervisor to know whether a process is actually alive or not because the pid might have been reassigned by the OS to a different process. It can't rely on its registered supervisees either because we might be running multiple supervisors. We could possibly check its current forks, whether the
pid
matches, and in that case, skip the pruning, but I'm wary about introducing complicated logic there because the risk is ending up with dead registered processes and locked jobs.Since this scenario should happen only in development, I opted for keeping the pruning logic the same (except from preventing the supervisor from pruning itself, as in that case, the supervisor running the pruning is very much alive and running) and just have each process check whether their registered process record is gone when they heartbeat. If it's gone, just stop as if a
TERM
orINT
signal had been received. If that process was a supervised fork, its supervisor will replace it as soon as it realises it's gone.This might be handy in the future as well, to stop a given worker from Mission Control.
Big thanks to @npezza93 for catching this scenario.