Handle scenarios where registered processes disappear outside regular flow #337

rosa · 2024-09-08T20:50:12Z

This is a simple way to mitigate the effects of a computer going to sleep and then coming back to life, with processes with expired heartbeats that are, however, alive. It came up as part of the quest to reproduce another unrelated error in #324. It's not trivial for the supervisor to know whether a process is actually alive or not because the pid might have been reassigned by the OS to a different process. It can't rely on its registered supervisees either because we might be running multiple supervisors. We could possibly check its current forks, whether the pid matches, and in that case, skip the pruning, but I'm wary about introducing complicated logic there because the risk is ending up with dead registered processes and locked jobs.

Since this scenario should happen only in development, I opted for keeping the pruning logic the same (except from preventing the supervisor from pruning itself, as in that case, the supervisor running the pruning is very much alive and running) and just have each process check whether their registered process record is gone when they heartbeat. If it's gone, just stop as if a TERM or INT signal had been received. If that process was a supervised fork, its supervisor will replace it as soon as it realises it's gone.

This might be handy in the future as well, to stop a given worker from Mission Control.

Big thanks to @npezza93 for catching this scenario.

npezza93 · 2024-09-09T22:35:39Z

lib/solid_queue/processes/registrable.rb

@@ -53,7 +53,10 @@ def stop_heartbeat
      end

      def heartbeat
-        process.heartbeat
+        process.with_lock(&:heartbeat)


Looks like with_lock doesnt pass the object to the block. Seeing a SolidQueue-0.8.2 Error in thread (0.0ms) error: "ArgumentError no receiver given" error.

process.with_lock { process.heartbeat }

D'oh, I'm dumb, seriously. This was working on Sunday and I decided to break it yesterday without testing 😅 I'll fix it.

Haha no worries!

Huh, no, that was correct 😕 I was wondering why the tests were passing... and this is how with_lock is supposed to be used. The error must come from something else, I'll investigate and fix it.

Ah, no, no, your version is the correct one 😆 And it doesn't break the tests because the error doesn't happen
there! The first time the heartbeat happens, the process is gone so the block is never called. Ahh, I need a holiday soon 😅 😅

npezza93 · 2024-09-09T23:00:53Z

If i manually go into the db and trash all the processes i get this error:

supervisor.rb

        if registered_process = process&.supervisees&.find_by(name: terminated_fork.name)

Adding this patch fixes the error but a new supervisor never gets restarted.

rosa · 2024-09-10T08:21:11Z

If i manually go into the db and trash all the processes i get this error:

Ahhh! Yes, I was only considering the case where the processes get deregistered because they lost their heartbeats, not the case where you go and delete them all. I made sure the supervisor would never get deregistered by itself, so it should work... it might be the case that you are running multiple supervisors and one of them deregisters the other but I think you wouldn't do that in development (which is the case I'm handling here, as you'd expect things to break in production if suddenly the server running your jobs decides to go to sleep) 🤔

npezza93 · 2024-09-10T15:01:05Z

Ahhh! Yes, I was only considering the case where the processes get deregistered because they lost their heartbeats, not the case where you go and delete them all. I made sure the supervisor would never get deregistered by itself, so it should work... it might be the case that you are running multiple supervisors and one of them deregisters the other but I think you wouldn't do that in development (which is the case I'm handling here, as you'd expect things to break in production if suddenly the server running your jobs decides to go to sleep) 🤔

Thats a good point. i was just impatient and decided to nuke the records instead waiting for my mac to hibernate. Going the hibernation route this never gets raised so i think you can ignore more here

For example, when coming back from being suspended and having its heartbeat expired.

As these are all common between all processes.

If, for some reason, the process failed a heartbeat and the supervisor pruned it, we shouldn't continue running. Just stop as if we had received a signal. This could be used in the future from Mission Control to stop a worker.

To guard against race conditions of the record being deleted precisely then.

@npezza93

Thanks to @npezza93 for catching this ^_^U

Left-over from somethign I was rewriting.

rosa · 2024-09-11T14:55:18Z

We're now running this in production in HEY to make sure everything is ok.

rosa mentioned this pull request Sep 8, 2024

SQLite queue database corruption #324

Closed

rosa force-pushed the improve-ungracious-termination branch 2 times, most recently from 3a46abf to 29f297c Compare September 9, 2024 17:22

npezza93 reviewed Sep 9, 2024

View reviewed changes

rosa added 5 commits September 11, 2024 13:55

Prevent supervisor from pruning itself

fe2d47f

For example, when coming back from being suspended and having its heartbeat expired.

Move #stop and #stopped? to Processes:Base

f889818

As these are all common between all processes.

Lock the process record before heartbeating

d58c230

To guard against race conditions of the record being deleted precisely then.

Fix block for with_lock, that doesn't yield the receiver

6d7bc6f

Thanks to @npezza93 for catching this ^_^U

rosa force-pushed the improve-ungracious-termination branch from fdb7bc0 to 6d7bc6f Compare September 11, 2024 11:59

Remove duplicate heartbeat launch for the supervisor

8957e00

Left-over from somethign I was rewriting.

rosa merged commit 439202c into main Sep 11, 2024
8 checks passed

rosa deleted the improve-ungracious-termination branch September 11, 2024 17:43

rosa mentioned this pull request Sep 21, 2024

How would you kill a specific job? #124

Closed

yahonda mentioned this pull request Oct 1, 2024

Remove redundant run methods in SolidQueue::Processes #367

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle scenarios where registered processes disappear outside regular flow #337

Handle scenarios where registered processes disappear outside regular flow #337

rosa commented Sep 8, 2024

npezza93 Sep 9, 2024 •

edited

Loading

rosa Sep 10, 2024

npezza93 Sep 10, 2024

rosa Sep 10, 2024

rosa Sep 10, 2024 •

edited

Loading

npezza93 commented Sep 9, 2024

rosa commented Sep 10, 2024

npezza93 commented Sep 10, 2024

rosa commented Sep 11, 2024

Handle scenarios where registered processes disappear outside regular flow #337

Handle scenarios where registered processes disappear outside regular flow #337

Conversation

rosa commented Sep 8, 2024

npezza93 Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

rosa Sep 10, 2024

Choose a reason for hiding this comment

npezza93 Sep 10, 2024

Choose a reason for hiding this comment

rosa Sep 10, 2024

Choose a reason for hiding this comment

rosa Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

npezza93 commented Sep 9, 2024

rosa commented Sep 10, 2024

npezza93 commented Sep 10, 2024

rosa commented Sep 11, 2024

npezza93 Sep 9, 2024 •

edited

Loading

rosa Sep 10, 2024 •

edited

Loading