Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cron job to do HTTP PUT /jobtimeoutcheck that checks for timeout conditions and performs corrective action #80

Open
4 tasks
Tracked by #70
ericpassmore opened this issue Dec 16, 2023 · 2 comments

Comments

@ericpassmore
Copy link
Collaborator

ericpassmore commented Dec 16, 2023

Timeout check needs to

Tasks

@ericpassmore ericpassmore mentioned this issue Dec 16, 2023
3 tasks
@ericpassmore ericpassmore added this to the Leap v6.0.0 milestone Dec 16, 2023
@ericpassmore
Copy link
Collaborator Author

To support AWS spot instances we need to recover jobs that are orphaned. This is accomplished via a timeout check. Jobs status and times are updated every few minutes with most recent block. So every WORKING job should have a recent update time.
Add cron job to do HTTP PUT /jobtimeoutcheck that checks for timeout conditions and performs corrective action

@ericpassmore
Copy link
Collaborator Author

Tagged with faster-replay because spot instances should allow 2x more nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant