Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some "TIMED OUT" jobs not restarting properly #425

Open
brendon-cavainolo opened this issue Jul 19, 2023 · 0 comments
Open

Some "TIMED OUT" jobs not restarting properly #425

brendon-cavainolo opened this issue Jul 19, 2023 · 0 comments

Comments

@brendon-cavainolo
Copy link

Hello,

I'm having a weird bug happen when Maestro tries to restart jobs.

Say, I've launched one study with 8 jobs in the study. When these 8 jobs timeout, maybe half of them restart successfully (meaning they are resubmitted to Slurm, and the "TIMED OUT" status changes back to "RUNNING" and the number of restarts goes up).

The other half of the jobs never register as "TIMED OUT",and are never resubmitted back to Slurm. The maestro status command still shows them as "RUNNING", but does not increment the number of restarts. The jobs also no longer show up in the study.log file.

Something to note is that the initial runs of these jobs typically all end within a few minutes of each other.

Hopefully I've provided enough information here to help figure this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant