You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having a weird bug happen when Maestro tries to restart jobs.
Say, I've launched one study with 8 jobs in the study. When these 8 jobs timeout, maybe half of them restart successfully (meaning they are resubmitted to Slurm, and the "TIMED OUT" status changes back to "RUNNING" and the number of restarts goes up).
The other half of the jobs never register as "TIMED OUT",and are never resubmitted back to Slurm. The maestro status command still shows them as "RUNNING", but does not increment the number of restarts. The jobs also no longer show up in the study.log file.
Something to note is that the initial runs of these jobs typically all end within a few minutes of each other.
Hopefully I've provided enough information here to help figure this out.
The text was updated successfully, but these errors were encountered:
Hello,
I'm having a weird bug happen when Maestro tries to restart jobs.
Say, I've launched one study with 8 jobs in the study. When these 8 jobs timeout, maybe half of them restart successfully (meaning they are resubmitted to Slurm, and the "TIMED OUT" status changes back to "RUNNING" and the number of restarts goes up).
The other half of the jobs never register as "TIMED OUT",and are never resubmitted back to Slurm. The
maestro status
command still shows them as "RUNNING", but does not increment the number of restarts. The jobs also no longer show up in thestudy.log
file.Something to note is that the initial runs of these jobs typically all end within a few minutes of each other.
Hopefully I've provided enough information here to help figure this out.
The text was updated successfully, but these errors were encountered: