Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add partition generation, modify find commands to output run info correctly #155

Merged
merged 6 commits into from
May 13, 2024

Conversation

JoshCu
Copy link
Collaborator

@JoshCu JoshCu commented Apr 30, 2024

Changes

Functional

  • adds a nexus first partition generator python script

  • makes auto file finding generic *.gpkg instead of datastream.gpkg

  • increase output file cleaning depth to 2 so output/ngen/ output/troute/ subfolders get cleaned too

  • set procs to the number of partitions (min(number_of_nexuses, core_count)

  • set procs to the number of partitions (min(number_of_catchments, core_count)

  • now retains /var /opt /tmp in NGIAB Image size is unchanged but allows dnf packages to be installed for debug

QoL

  • removes sleep 3 before prompt to pipe to dev/null
  • update find commands to properly display number of files in config output forcings
  • updated auto file finder to only print geopackage once

Note

The changes to partition generation and parallel runs only apply if the /ngen/ngen/data/.partition_by_nexus file is present to reduce unexpected behavior
Auto mode is unchanged
Serial mode is unchanged

Matching Data package Generation branch

latest data preprocessor / map tool
It still has a bug or two in it, but it won't take me long to fix and is the shortest path to something user friendly and compliant with the NGIAB data package layout.

@JordanLaserGit
Copy link
Collaborator

Point by point thoughts:

  • adds a nexus first partition generator python script
    Seems reasonable in the case of very few catchments, @ZacharyWills @hellkite500 any reason why not to do this? Only thing I would advocate for is editing the partition generator itself, if there is an edge case that is not being handled well.
  • makes auto file finding generic *.gpkg instead of datastream.gpkg
    This makes good sense. Many tools may use NGIAB and we shouldn't impose anything datastream on other tools.
  • update find commands to properly display number of files in config output forcings
    Looks fine
  • increase output file cleaning depth to 2 so output/ngen/ output/troute/ subfolders get cleaned too
    Looks fine
  • set procs to the number of partitions (min(number_of_nexus, core_count)
    I think I need clarity here. I read the code as procs is set to $(nprocs) in parallel mode, but if .parition_by_nexus file is detected, nprocs is overridden based on the number of partitions. I'm a little unclear on how .parition_by_nexus is created. Is this from the preprocessor tool? If so this introduces a bit of a dependency. Not strongly opposed, but there may be a better way on how to go about this. Perhaps the partition generator itself should have this logic built into it? That way the procs variable can be set to min(paritions, core_count, requested_procs)

@arpita0911patel
Copy link
Member

@ZacharyWills please take alook.

Copy link
Collaborator

@hellkite500 hellkite500 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partitioning isn't correct for general parallel processing. What is the intent of this partitioning scheme?

docker/partition_gen.py Outdated Show resolved Hide resolved
@arpita0911patel
Copy link
Member

@JoshCu , Ben has made couple of changes to latest DockerFile, so please merge those changes in this PR. We will test this PR after that.

@benlee0423
Copy link

@JoshCu 156 and 157 need to be merged.

@JoshCu
Copy link
Collaborator Author

JoshCu commented May 11, 2024

After further testing and reviewing comments:

@JoshCu JoshCu force-pushed the devcon_prep branch 3 times, most recently from 6c4709e to 51015db Compare May 11, 2024 00:38
procs=2 # Temporary fixed value
num_catchments=$(find forcings -name *.csv | wc -l)
if [ $num_catchments -lt $procs ]; then
procs=$num_catchments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you are running on a few hundred to thousand cores, I would suspect this isn't going to actually give any significant benefit and I would just default to running the serial code or setting procs=1. This will likely actually under perform in this case as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of ngen in NGIAB reads the catchment config in serial and is completely cpu bound, when I was testing 6600 catchments the speedup on the config reading section of execution was near linear. It's not huge portion of the execution, but enough that a two month 673 catchment run using cfe and noah-owp-m took 3m15 serially and 52s in 19x parallel. I originally looked into it because it was taking around 20 minutes to read config for a 6600 catchment run which dropped to ~15 seconds running in parallel on 56 cores on one machine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my testing has been either on a 128gb 56 core or 64gb 20 core machine though. I'm not sure what the average laptop at devcon will look like, so I don't know if the performance gain I see will translate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( @hellkite500 if you want to re-approve, we don't have to bypass branch protections. 🙂 )

I was wondering if we should move the partitioning system into a separate development line to allow for handling the DMOD/datastream-type partitioning in a different way from what might be optimal for desktop-based small basin experiments. (Not recommending this urgently, though.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could have clarified a bit haha, the number of processors and the scaling is definitely a function of the domain size. As I think about this a bit more given the context provided I actually think this should just be a user argument with a sensible default (probably nprocs?) and maybe a user message with some guidance? That message could try to count the number of catchments and give some hints about picking a good value. It would be more useful in general to have the user provide info if we have scales of a few cores to 50-100 as the common use cases.

Copy link
Member

@jameshalgren jameshalgren May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a sensible default

I think that's the intent of what's here -- limiting the nprocs to the number of catchments as a maximum feels like a reasonable first attempt at a default (and it avoids crashing/hanging ngen). Is it safe to say there is never a case where someone would want more processors than there can be partitions in the network? Probably not -- someday, there could be ensemble or overlay models.

We can work on additional customizations/user input in future PRs.

Copy link
Member

@arpita0911patel arpita0911patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging these changes.

@arpita0911patel arpita0911patel merged commit 1e9a376 into CIROH-UA:main May 13, 2024
9 checks passed
@JoshCu JoshCu deleted the devcon_prep branch July 12, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants