Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]: xCDAT Practical Parallel Notebook #663

Open
pochedls opened this issue Jun 8, 2024 · 0 comments
Open

[Doc]: xCDAT Practical Parallel Notebook #663

pochedls opened this issue Jun 8, 2024 · 0 comments
Labels
type: docs Updates to documentation

Comments

@pochedls
Copy link
Collaborator

pochedls commented Jun 8, 2024

Describe your documentation update

This issue arises from the existing parallel computing guide that is meant to provide some general guidance on parallel computing with dask/xarray/xcdat.

It would be helpful to create some specific documentation on how to parallelize in the xcdat context. The ideas would be to show more xcdat-oriented practical examples (without the complications of the dask cluster stuff). I imagine it would do things like this:

  • Download a dataset that is a few GBs to disk (e.g., a piControl file via wget)
  • Show how chunking in time-versus-space affects performance for a given operation (e.g., spatial averaging probably needs time-chunks, but temporal averaging might do fine with space chunks?)
  • Walk through how you might decide on a chunk size (e.g., this is a 4 GB dataset with 100 years or 1200 timepoints, so breaking it into decades [120 months], would give me pretty manageable 400 MB chunks to work on with 5 workers).
  • Maybe show a dask.delayed (or joblib) example, e.g.,
    • First download a number of netcdfs (e.g., a CMIP historical tas simulation broken into ~12 files)
    • Do a glob to get the file list
    • Create a function that opens a file, computes the spatial average, returns the spatial average
    • run results = dask.delayed(...)
    • Use xr.concat to combine the results into one dataset
    • Compare dask.delayed to serial performance
  • Talk about some xCDAT-specific parallelization considerations (e.g., the FAQs in the existing parallel notebook)
@pochedls pochedls added the type: docs Updates to documentation label Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: docs Updates to documentation
Projects
Status: Todo
Development

No branches or pull requests

1 participant