Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85

skunkworker · 2024-04-18T20:13:06Z

This MR does two things.

Fixes compatibility with the sidekiq web UI, heartbeats were not autoexpiring, and jobs for an individual worker were not being set to the correct key.
Adds HeartBeatLastPushedAt to manager stats so you can build a liveness probe in case the worker process crashes.

Note:
I've commented out some code that reads the heartbeats, from my observations of sidekiq.rb's heartbeat code it is never read, just pushed to. If the recent code that was added that modifies heartbeats to be read is necessary, I would recommend it being moved into a different data structure.

skunkworker · 2024-04-18T22:12:38Z

I successfully implemented this check. The one negative is that you are required to do a IsZero() check as it is initialized but would be incorrect to set it to Now() upon initialization.

func (p *sidekiqChecker) Check(ctx context.Context) error {
	stats, err := p.manager.GetStats()
	if err != nil {
		return err
	}
	if !stats.HeartbeatLastPushedAt.IsZero() && time.Now().Add(time.Second*-30).After(stats.HeartbeatLastPushedAt) {
		msg := fmt.Sprintf("heartbeat last pushed at time is too old, and probably lost redis connection. heartBeatLastPushedAt %s", stats.HeartbeatLastPushedAt)
		log.Println(msg)

		return errors.New(msg)
	} else {
		return nil
	}
}

stefannegrea · 2024-05-03T14:08:15Z

Thank you for the changes, really solid update. However, have a question about the commented out code.

stefannegrea · 2024-05-03T14:08:24Z

manager.go

-				}
-			}
+
+			//expireTS := heartbeatTime.Add(-m.opts.Heartbeat.HeartbeatTTL).Unix()


However, what should be done with the code that was commented out? Do you recommend to move that into an new afterheartbeathook? Without that code work in progress from workers with expired heartbeats will never be enqueued back for processing.

I think we need to separate out the visual heartbeat ui code from the recoverable heartbeat (eg, what a node is working on) .

From what I've observed in the Sidekiq ruby behavior. (At least v6)

Heartbeats are for visual presentation of what a node is working on and utilize redis TTL expiration, and AFAIK are never read for consumption and resumption. Utilizing this in a mixed ruby/sidekiq environment is very useful for at-a-glance observability.

A specially named queue is created by each node to contain the in-progress work that is looked for upon boot, in order to properly recover a killed process. BRPOPLPUSH now BLMOVE, should be utilized to move the item from the queue into the specific worker's queue, after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted (along with the base assumption that all jobs are idempotent). This is where the additional logic that was attached to the heartbeat code could live. I will need to reread what the goals of the heartbeat hook logic are intended for (as my current mental model is more event based then poll based).

Thank you for looking into Sidekiq's behavior. When you mention "after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted," does this mean that work will resume when the corresponding manager of the queue is restarted? We have a use case where we have a cluster of managers, and if a given manager goes down with work in its active queue for a prolonged period of time, we would like for that work that to be picked up by another manager.

For what it's worth, our current stance is we are okay with deviating from an exact port of Sidekiq, and this is why we introduced a use case for reading the heartbeat. That said, if there are things we can do to maintain compatibility with the Sidekiq UI, we agree that's preferable. Is there anything with the new heartbeat changes, such as expiring via polling instead of TTL, which breaks compatibility with the UI somehow?

From what I understand of how sidekiq works:

The current jobs that a given process is processing should be atomically moved into another data structure (not around heartbeat) that can be read upon the process being restarted.
Having the heartbeat separated from the manager's queue allows you to check for orphaned jobs easier, because if the heartbeat has expired, but a cron process discovers a LIST of jobs in memory, you can assume that those jobs should be restarted by another manager.

This also has the added benefit of maintaining compatibility with the sidekiq UI, as current jobs for a given process do not live in the same key as the heartbeat and current job details.

Looking into the docs around super_fetch there is logic around orphaned job discovery and reprocessing. I believe it will do a SCAN looking for job sets for any host where the accompanying heartbeat has expired (by ttl), as this indicates that there are orphaned jobs to be picked up.

https://github.com/sidekiq/sidekiq/wiki/Reliability#recovering-jobs

I can refactor this to a more middle ground approach, but my end goal is to allow for discovery of orphaned jobs from persistent hosts (same name upon restart) along with dynamically named hosts (eg: replicas in k8s).

add heartbeat last pushed at to Stats

50ce269

skunkworker changed the title ~~add heartbeat last pushed at to Stats~~ Add HeartBeatLastPushedAt to Manager Stats Apr 18, 2024

readd expiring workers

f19a461

skunkworker changed the title ~~Add HeartBeatLastPushedAt to Manager Stats~~ Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats Apr 19, 2024

skunkworker added 4 commits April 19, 2024 00:35

remove inline worker heartbeats

1b607c5

fix err var

308ad15

revert more changes

1bd2fb9

more cleanup

ecd6179

stefannegrea reviewed May 3, 2024

View reviewed changes

skunkworker mentioned this pull request Aug 7, 2024

Draft: Add unique enqueue #89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85

Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85

skunkworker commented Apr 18, 2024 •

edited

Loading

skunkworker commented Apr 18, 2024

stefannegrea commented May 3, 2024

stefannegrea May 3, 2024

skunkworker May 6, 2024

keitabyte Jun 5, 2024

skunkworker Aug 7, 2024

Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85

Are you sure you want to change the base?

Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85

Conversation

skunkworker commented Apr 18, 2024 • edited Loading

skunkworker commented Apr 18, 2024

stefannegrea commented May 3, 2024

stefannegrea May 3, 2024

Choose a reason for hiding this comment

skunkworker May 6, 2024

Choose a reason for hiding this comment

keitabyte Jun 5, 2024

Choose a reason for hiding this comment

skunkworker Aug 7, 2024

Choose a reason for hiding this comment

skunkworker commented Apr 18, 2024 •

edited

Loading