Investigate fire-and-forget mode for async flush #531

adammoody · 2023-02-14T22:37:45Z

After an async flush has started, an application must make another SCR call to finalize that flush. Even after the async flush has copied all files, the output set is not valid until it has been finalized. The calls that finalize async flushes are: SCR_Start_output, SCR_Complete_output, and SCR_Finalize.

If an application using async flush does not write checkpoints frequently, then it could be likely that a failure occurs after all files have been copied but before the flush has been finalized. In this case, SCR will roll back to an earlier checkpoint when restarting the application. This is a shame since the hard work of copying all of the files is done.

It would be nice to extend SCR_Init so that SCR can detect an async flush is done but not yet marked as complete. To do this, we could use the file size of each file (if we trust POSIX semantics), or we could have each rank write an additional "done" flag to the file system. On restart, SCR_Init could look for these markers and update the status of the checkpoint if it finds that all files had been successfully copied. Something similar could be added to scavenge.

In the meantime, it could be useful to add checks to calls like SCR_Need_checkpoint and SCR_Should_exit, which an application may call more frequently. In that case, one might need to configure how often SCR checks, since polling for completion may be expensive on some systems. For example, if time steps are short compared to the polling cost, we would not want to poll after every time step.

The text was updated successfully, but these errors were encountered:

adammoody mentioned this issue Feb 14, 2023

check async flush progress during need_checkpoint and should_exit #532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate fire-and-forget mode for async flush #531

Investigate fire-and-forget mode for async flush #531

adammoody commented Feb 14, 2023 •

edited

Loading

Investigate fire-and-forget mode for async flush #531

Investigate fire-and-forget mode for async flush #531

Comments

adammoody commented Feb 14, 2023 • edited Loading

adammoody commented Feb 14, 2023 •

edited

Loading