-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add partition generation, modify find commands to output run info correctly #155
Conversation
Point by point thoughts:
|
@ZacharyWills please take alook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The partitioning isn't correct for general parallel processing. What is the intent of this partitioning scheme?
@JoshCu , Ben has made couple of changes to latest DockerFile, so please merge those changes in this PR. We will test this PR after that. |
After further testing and reviewing comments:
|
6c4709e
to
51015db
Compare
procs=2 # Temporary fixed value | ||
num_catchments=$(find forcings -name *.csv | wc -l) | ||
if [ $num_catchments -lt $procs ]; then | ||
procs=$num_catchments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you are running on a few hundred to thousand cores, I would suspect this isn't going to actually give any significant benefit and I would just default to running the serial code or setting procs=1
. This will likely actually under perform in this case as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version of ngen in NGIAB reads the catchment config in serial and is completely cpu bound, when I was testing 6600 catchments the speedup on the config reading section of execution was near linear. It's not huge portion of the execution, but enough that a two month 673 catchment run using cfe and noah-owp-m took 3m15 serially and 52s in 19x parallel. I originally looked into it because it was taking around 20 minutes to read config for a 6600 catchment run which dropped to ~15 seconds running in parallel on 56 cores on one machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my testing has been either on a 128gb 56 core or 64gb 20 core machine though. I'm not sure what the average laptop at devcon will look like, so I don't know if the performance gain I see will translate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( @hellkite500 if you want to re-approve, we don't have to bypass branch protections. 🙂 )
I was wondering if we should move the partitioning system into a separate development line to allow for handling the DMOD/datastream-type partitioning in a different way from what might be optimal for desktop-based small basin experiments. (Not recommending this urgently, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I could have clarified a bit haha, the number of processors and the scaling is definitely a function of the domain size. As I think about this a bit more given the context provided I actually think this should just be a user argument with a sensible default (probably nprocs?) and maybe a user message with some guidance? That message could try to count the number of catchments and give some hints about picking a good value. It would be more useful in general to have the user provide info if we have scales of a few cores to 50-100 as the common use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a sensible default
I think that's the intent of what's here -- limiting the nprocs to the number of catchments as a maximum feels like a reasonable first attempt at a default (and it avoids crashing/hanging ngen). Is it safe to say there is never a case where someone would want more processors than there can be partitions in the network? Probably not -- someday, there could be ensemble or overlay models.
We can work on additional customizations/user input in future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging these changes.
Changes
Functional
adds a nexus first partition generator python scriptmakes auto file finding generic *.gpkg instead of datastream.gpkg
increase output file cleaning depth to 2 so output/ngen/ output/troute/ subfolders get cleaned too
set procs to the number of partitions (min(number_of_nexuses, core_count)set procs to the number of partitions (min(number_of_catchments, core_count)
now retains /var /opt /tmp in NGIAB Image size is unchanged but allows dnf packages to be installed for debug
QoL
Note
The changes to partition generation and parallel runs only apply if the/ngen/ngen/data/.partition_by_nexus
file is present to reduce unexpected behaviorAuto mode is unchanged
Serial mode is unchanged
Matching Data package Generation branch
latest data preprocessor / map tool
It still has a bug or two in it, but it won't take me long to fix and is the shortest path to something user friendly and compliant with the NGIAB data package layout.