You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After an async flush has started, an application must make another SCR call to finalize that flush. Even after the async flush has copied all files, the output set is not valid until it has been finalized. The calls that finalize async flushes are: SCR_Start_output, SCR_Complete_output, and SCR_Finalize.
If an application using async flush does not write checkpoints frequently, then it could be likely that a failure occurs after all files have been copied but before the flush has been finalized. In this case, SCR will roll back to an earlier checkpoint when restarting the application. This is a shame since the hard work of copying all of the files is done.
It would be nice to extend SCR_Init so that SCR can detect an async flush is done but not yet marked as complete. To do this, we could use the file size of each file (if we trust POSIX semantics), or we could have each rank write an additional "done" flag to the file system. On restart, SCR_Init could look for these markers and update the status of the checkpoint if it finds that all files had been successfully copied. Something similar could be added to scavenge.
In the meantime, it could be useful to add checks to calls like SCR_Need_checkpoint and SCR_Should_exit, which an application may call more frequently. In that case, one might need to configure how often SCR checks, since polling for completion may be expensive on some systems. For example, if time steps are short compared to the polling cost, we would not want to poll after every time step.
The text was updated successfully, but these errors were encountered:
After an async flush has started, an application must make another SCR call to finalize that flush. Even after the async flush has copied all files, the output set is not valid until it has been finalized. The calls that finalize async flushes are:
SCR_Start_output
,SCR_Complete_output
, andSCR_Finalize
.If an application using async flush does not write checkpoints frequently, then it could be likely that a failure occurs after all files have been copied but before the flush has been finalized. In this case, SCR will roll back to an earlier checkpoint when restarting the application. This is a shame since the hard work of copying all of the files is done.
It would be nice to extend
SCR_Init
so that SCR can detect an async flush is done but not yet marked as complete. To do this, we could use the file size of each file (if we trust POSIX semantics), or we could have each rank write an additional "done" flag to the file system. On restart,SCR_Init
could look for these markers and update the status of the checkpoint if it finds that all files had been successfully copied. Something similar could be added to scavenge.In the meantime, it could be useful to add checks to calls like
SCR_Need_checkpoint
andSCR_Should_exit
, which an application may call more frequently. In that case, one might need to configure how often SCR checks, since polling for completion may be expensive on some systems. For example, if time steps are short compared to the polling cost, we would not want to poll after every time step.The text was updated successfully, but these errors were encountered: