add partition generation, modify find commands to output run info correctly #155

JoshCu · 2024-04-30T23:13:36Z

Changes

Functional

~~adds a nexus first partition generator python script~~
makes auto file finding generic *.gpkg instead of datastream.gpkg
increase output file cleaning depth to 2 so output/ngen/ output/troute/ subfolders get cleaned too
~~set procs to the number of partitions (min(number_of_nexuses, core_count)~~
set procs to the number of partitions (min(number_of_catchments, core_count)
now retains /var /opt /tmp in NGIAB Image size is unchanged but allows dnf packages to be installed for debug

QoL

removes sleep 3 before prompt to pipe to dev/null
update find commands to properly display number of files in config output forcings
updated auto file finder to only print geopackage once

Note

~~The changes to partition generation and parallel runs only apply if the /ngen/ngen/data/.partition_by_nexus file is present to reduce unexpected behavior~~
Auto mode is unchanged
Serial mode is unchanged

Matching Data package Generation branch

latest data preprocessor / map tool
It still has a bug or two in it, but it won't take me long to fix and is the shortest path to something user friendly and compliant with the NGIAB data package layout.

…rectly

JordanLaserGit · 2024-05-01T14:49:06Z

Point by point thoughts:

adds a nexus first partition generator python script
Seems reasonable in the case of very few catchments, @ZacharyWills @hellkite500 any reason why not to do this? Only thing I would advocate for is editing the partition generator itself, if there is an edge case that is not being handled well.
makes auto file finding generic *.gpkg instead of datastream.gpkg
This makes good sense. Many tools may use NGIAB and we shouldn't impose anything datastream on other tools.
update find commands to properly display number of files in config output forcings
Looks fine
increase output file cleaning depth to 2 so output/ngen/ output/troute/ subfolders get cleaned too
Looks fine
set procs to the number of partitions (min(number_of_nexus, core_count)
I think I need clarity here. I read the code as procs is set to $(nprocs) in parallel mode, but if .parition_by_nexus file is detected, nprocs is overridden based on the number of partitions. I'm a little unclear on how .parition_by_nexus is created. Is this from the preprocessor tool? If so this introduces a bit of a dependency. Not strongly opposed, but there may be a better way on how to go about this. Perhaps the partition generator itself should have this logic built into it? That way the procs variable can be set to min(paritions, core_count, requested_procs)

arpita0911patel · 2024-05-02T21:52:41Z

@ZacharyWills please take alook.

docker/Dockerfile

hellkite500

The partitioning isn't correct for general parallel processing. What is the intent of this partitioning scheme?

docker/partition_gen.py

arpita0911patel · 2024-05-07T15:25:21Z

@JoshCu , Ben has made couple of changes to latest DockerFile, so please merge those changes in this PR. We will test this PR after that.

benlee0423 · 2024-05-07T17:17:29Z

@JoshCu 156 and 157 need to be merged.

JoshCu · 2024-05-11T00:28:50Z

After further testing and reviewing comments:

Fix ngen deps fails due to blosc2 and boost directory #156 and lowercase platform #157 have been merged
python partitioning removed completely
set partitions in parallel mode to min(nproc, number_of_catchments)

hellkite500 · 2024-05-11T02:28:15Z

docker/HelloNGEN.sh

-        procs=2 # Temporary fixed value
+        num_catchments=$(find forcings -name *.csv | wc -l)
+        if [ $num_catchments -lt $procs ]; then
+                procs=$num_catchments


Unless you are running on a few hundred to thousand cores, I would suspect this isn't going to actually give any significant benefit and I would just default to running the serial code or setting procs=1. This will likely actually under perform in this case as is.

The version of ngen in NGIAB reads the catchment config in serial and is completely cpu bound, when I was testing 6600 catchments the speedup on the config reading section of execution was near linear. It's not huge portion of the execution, but enough that a two month 673 catchment run using cfe and noah-owp-m took 3m15 serially and 52s in 19x parallel. I originally looked into it because it was taking around 20 minutes to read config for a 6600 catchment run which dropped to ~15 seconds running in parallel on 56 cores on one machine.

Most of my testing has been either on a 128gb 56 core or 64gb 20 core machine though. I'm not sure what the average laptop at devcon will look like, so I don't know if the performance gain I see will translate

( @hellkite500 if you want to re-approve, we don't have to bypass branch protections. 🙂 )

I was wondering if we should move the partitioning system into a separate development line to allow for handling the DMOD/datastream-type partitioning in a different way from what might be optimal for desktop-based small basin experiments. (Not recommending this urgently, though.)

I guess I could have clarified a bit haha, the number of processors and the scaling is definitely a function of the domain size. As I think about this a bit more given the context provided I actually think this should just be a user argument with a sensible default (probably nprocs?) and maybe a user message with some guidance? That message could try to count the number of catchments and give some hints about picking a good value. It would be more useful in general to have the user provide info if we have scales of a few cores to 50-100 as the common use cases.

a sensible default

I think that's the intent of what's here -- limiting the nprocs to the number of catchments as a maximum feels like a reasonable first attempt at a default (and it avoids crashing/hanging ngen). Is it safe to say there is never a case where someone would want more processors than there can be partitions in the network? Probably not -- someday, there could be ensemble or overlay models.

We can work on additional customizations/user input in future PRs.

arpita0911patel

Merging these changes.

JoshCu added 2 commits April 30, 2024 18:02

add partition generation, modify find commands to output run info cor…

1c50074

…rectly

remove test image hardcode

5d0f3f7

JoshCu requested review from arpita0911patel, ZacharyWills and JordanLaserGit April 30, 2024 23:13

Fix database connection not being closed in partition_gen.py

ca4f788

arpita0911patel requested a review from benlee0423 May 1, 2024 13:43

arpita0911patel requested review from jameshalgren and removed request for arpita0911patel, jameshalgren, JordanLaserGit and benlee0423 May 1, 2024 15:02

ZacharyWills reviewed May 3, 2024

View reviewed changes

docker/Dockerfile Show resolved Hide resolved

Update Dockerfile

bfa1872

ZacharyWills approved these changes May 3, 2024

View reviewed changes

hellkite500 requested changes May 3, 2024

View reviewed changes

docker/partition_gen.py Outdated Show resolved Hide resolved

arpita0911patel mentioned this pull request May 7, 2024

Partition related changes for NGIAB CIROH-UA/NGIAB-HPCInfra#14

Open

remove python patitioner, add num_catchment check to partitioner

84c808a

JoshCu force-pushed the devcon_prep branch 3 times, most recently from 6c4709e to 51015db Compare May 11, 2024 00:38

Merge branch 'CIROH-UA:main' into devcon_prep

51015db

hellkite500 reviewed May 11, 2024

View reviewed changes

arpita0911patel approved these changes May 13, 2024

View reviewed changes

arpita0911patel merged commit 1e9a376 into CIROH-UA:main May 13, 2024
9 checks passed

JoshCu deleted the devcon_prep branch July 12, 2024 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add partition generation, modify find commands to output run info correctly #155

add partition generation, modify find commands to output run info correctly #155

JoshCu commented Apr 30, 2024 •

edited

Loading

JordanLaserGit commented May 1, 2024

arpita0911patel commented May 2, 2024

hellkite500 left a comment

arpita0911patel commented May 7, 2024

benlee0423 commented May 7, 2024

JoshCu commented May 11, 2024

hellkite500 May 11, 2024

JoshCu May 11, 2024

JoshCu May 11, 2024

jameshalgren May 12, 2024

hellkite500 May 12, 2024

jameshalgren May 13, 2024 •

edited

Loading

arpita0911patel left a comment

add partition generation, modify find commands to output run info correctly #155

add partition generation, modify find commands to output run info correctly #155

Conversation

JoshCu commented Apr 30, 2024 • edited Loading

Changes

Functional

QoL

Note

Matching Data package Generation branch

JordanLaserGit commented May 1, 2024

arpita0911patel commented May 2, 2024

hellkite500 left a comment

Choose a reason for hiding this comment

arpita0911patel commented May 7, 2024

benlee0423 commented May 7, 2024

JoshCu commented May 11, 2024

hellkite500 May 11, 2024

Choose a reason for hiding this comment

JoshCu May 11, 2024

Choose a reason for hiding this comment

JoshCu May 11, 2024

Choose a reason for hiding this comment

jameshalgren May 12, 2024

Choose a reason for hiding this comment

hellkite500 May 12, 2024

Choose a reason for hiding this comment

jameshalgren May 13, 2024 • edited Loading

Choose a reason for hiding this comment

arpita0911patel left a comment

Choose a reason for hiding this comment

JoshCu commented Apr 30, 2024 •

edited

Loading

jameshalgren May 13, 2024 •

edited

Loading