Divide and conquer with workaround for batch submission #717

scottkosty · 2024-03-06T18:25:29Z

scottkosty
Mar 6, 2024

From what I understand, when working with computer clusters, the first approach is to try to use batchtools via future.batchtools. The system my coauthor is working with is HTCondor, which unfortunately isn't currently supported (see HenrikBengtsson/future.batchtools#29 and mllg/batchtools#68).

I don't have access to HTCondor myself, so I won't be able to work on adding support. I thus find myself trying to come up with a workaround. There are a few details specific to our workflow and use of future:

My code does use the RNG, so I set the future.seed argument of future_lapply() to TRUE.
My code does not need to share data, and the iterations are independent. It is Monte Carlo simulations, so the main function just need to call a function, dgp(), to generate the simulated data set and then a function statistics() to evaluate the generated data set. dgp() uses randomness, and statistics() may use randomness as well (e.g., to bootstrap).

From what I understand (I'm new to computing clusters and HPC), it's common to request many jobs with 1 core from the batch system. This is recommended (e.g., discussed here https://jepusto.github.io/Designing-Simulations-in-R/parallel-processing.html) at various sources. The challenge, then, is how to set up the seed on each instance.

The goal is to achieve numeric reproducibility, using future, no matter how the code is run.

I believe I need to generate the seeds, using future, and each instance would fast-forward to the appropriate seed. For this, perhaps I can use future:::next_random_seed()? For example, the instance might know "OK I'm working on the 57th simulation" and it forwards the seed 56 times and then runs the code. I'm concerned that next_random_seed is not exported, so I should not be relying on it. Is there a better way?

After figuring out the seed issue, I just need to retrieve the saved .Rds files, import them, and merge them together. In theory, this should lead to numeric reproducibility.

Has anyone done something like this? Is the strategy at least reasonable (given the constraints) ?

I'm working on a general Monte Carlo R package, which relies on future. The above details are sufficient, I believe, but for completeness the full package is here: montetools.

Answered by HenrikBengtsson

Apr 17, 2024

Using:

set.seed(0xBEEF)
y <- future_lapply(X, FUN = my_fcn, future.seed = TRUE)

should be 100% reproducible, i.e. no need to orchestrate the initial random seeds (.Random.seed) yourself.

If you're concerned about some tasks failing and not wanting to have rerun everything from scratch, you can use memoization for my_fcn(). The gist:

my_fcn <- function(x) {
  file <- x_to_rds(x)

  ## Already processed?
  if (already_exists(file)) return(file)

  ## Otherwise, run the analysis
  file <- full_run(x)

  file
}

Yes, this would be a bit wasteful on the job scheduler, because you're requesting jobs for steps that will be skipped. Right now, we don't have a mechanism to avoid this. If we could run

View full answer

scottkosty · 2024-04-17T14:46:14Z

scottkosty
Apr 17, 2024
Author

@HenrikBengtsson No problem if you don't have time to look at the details of this, but I'd be curious if the approach I suggest at least sounds reasonable to you.

0 replies

HenrikBengtsson · 2024-04-17T15:08:02Z

HenrikBengtsson
Apr 17, 2024
Maintainer

Using:

set.seed(0xBEEF)
y <- future_lapply(X, FUN = my_fcn, future.seed = TRUE)

should be 100% reproducible, i.e. no need to orchestrate the initial random seeds (.Random.seed) yourself.

If you're concerned about some tasks failing and not wanting to have rerun everything from scratch, you can use memoization for my_fcn(). The gist:

my_fcn <- function(x) {
  file <- x_to_rds(x)

  ## Already processed?
  if (already_exists(file)) return(file)

  ## Otherwise, run the analysis
  file <- full_run(x)

  file
}

Yes, this would be a bit wasteful on the job scheduler, because you're requesting jobs for steps that will be skipped. Right now, we don't have a mechanism to avoid this. If we could run if (already_exists(file)) return(file) before launching the job, that would solve it, but such a feature is currently only on the drawing board (since several years).

If you really want to pre-generate your own list of .Random.seed:s, you could do something like:

old_plan <- plan(sequential)
set.seed(0xBEEF)
seeds <- future_lapply(X, FUN = function(x) get(".Random.seed", envir = globalenv()), future.seed = TRUE)
plan(old_plan)

You can then use these seeds in your calls as:

y <- future_lapply(X, FUN = my_fcn, future.seed = seeds)

and if you only want to process a subset of X, you can use:

idxs <- c(1,3,8)
y[idxs] <- future_lapply(X[idxs], FUN = my_fcn, future.seed = seeds[idxs])

1 reply

scottkosty Apr 17, 2024
Author

This looks very useful! I will study it. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Divide and conquer with workaround for batch submission #717

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Divide and conquer with workaround for batch submission #717

scottkosty Mar 6, 2024

Replies: 2 comments · 1 reply

scottkosty Apr 17, 2024 Author

HenrikBengtsson Apr 17, 2024 Maintainer

scottkosty Apr 17, 2024 Author

scottkosty
Mar 6, 2024

Replies: 2 comments 1 reply

scottkosty
Apr 17, 2024
Author

HenrikBengtsson
Apr 17, 2024
Maintainer

scottkosty Apr 17, 2024
Author