You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our scr_run.py script currently launches the user job with the launcher process via subprocess.Popen. There are a few challenges with this:
Currently, we buffer all stdout and stderr and only print those out at the end. Users will want us to at least print this more frequently as the job runs, since people want to monitor their output while the job is running.
In some cases, users may also need to forward stdin?
Running with profilers/debuggers may be complicated, since those need to wrap the launcher like totalview srun -a ...
It would be good to look into solutions for the above.
As a fallback, and perhaps as the recommended approach, we should also ensure that people can continue to use their existing job scripts and just add a few additional commands to integrate with SCR. At the least, I think we want to allow users to invoke:
scr_prerun - to prepare the allocation for SCR
scr_list_down_nodes - to rely on SCR to test for node health and return a list of down or heathly nodes. Leave it to the user to then incorporate that list into a relaunch command. Documentation here can help, e.g., pointing users to srun -x <downnodes> as a way to avoid certain nodes with srun.
scr_should_exit - to determine whether to stop the run. This will check that there are enough healthy nodes, enough time, and verify that an SCR halt condition has not been set.
scr_postrun - to check for and scavenge any cached datasets
For users with bash job scripts, we want these commands to return 0/1 exit codes. Output like the node list should be printed to stdout, and it should be formatted in a way to make it easy for the user to integrate, e.g., potentially format the down node list differently for srun vs jsrun.
For users with python job scripts, we get bonus points if they can import and use SCR modules. For the first pass, let's just stick with requiring the user's python job script to invoke these as commands like the bash job scripts do.
The text was updated successfully, but these errors were encountered:
Our
scr_run.py
script currently launches the user job with the launcher process viasubprocess.Popen
. There are a few challenges with this:stdout
andstderr
and only print those out at the end. Users will want us to at least print this more frequently as the job runs, since people want to monitor their output while the job is running.stdin
?totalview srun -a ...
It would be good to look into solutions for the above.
As a fallback, and perhaps as the recommended approach, we should also ensure that people can continue to use their existing job scripts and just add a few additional commands to integrate with SCR. At the least, I think we want to allow users to invoke:
scr_prerun
- to prepare the allocation for SCRscr_list_down_nodes
- to rely on SCR to test for node health and return a list of down or heathly nodes. Leave it to the user to then incorporate that list into a relaunch command. Documentation here can help, e.g., pointing users tosrun -x <downnodes>
as a way to avoid certain nodes with srun.scr_should_exit
- to determine whether to stop the run. This will check that there are enough healthy nodes, enough time, and verify that an SCR halt condition has not been set.scr_postrun
- to check for and scavenge any cached datasetsFor users with bash job scripts, we want these commands to return 0/1 exit codes. Output like the node list should be printed to stdout, and it should be formatted in a way to make it easy for the user to integrate, e.g., potentially format the down node list differently for
srun
vsjsrun
.For users with python job scripts, we get bonus points if they can import and use SCR modules. For the first pass, let's just stick with requiring the user's python job script to invoke these as commands like the bash job scripts do.
The text was updated successfully, but these errors were encountered: