Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add partition generation, modify find commands to output run info correctly #155

Merged
merged 6 commits into from
May 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ RUN echo "/dmod/shared_libs/" >> /etc/ld.so.conf.d/ngen.conf && ldconfig -v
COPY HelloNGEN.sh /ngen/HelloNGEN.sh
RUN chown -R root /dmod /ngen /root
RUN chmod a+x /dmod/bin/* /ngen/HelloNGEN.sh /root
RUN rm -rf tmp var opt
JoshCu marked this conversation as resolved.
Show resolved Hide resolved

WORKDIR /ngen

ENTRYPOINT ["./HelloNGEN.sh"]
Expand Down
13 changes: 7 additions & 6 deletions docker/HelloNGEN.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ auto_select_file() {
}

# Finding files
HYDRO_FABRIC_CATCHMENTS=$(find . -name "*datastream*.gpkg")
HYDRO_FABRIC_NEXUS=$(find . -name "*datastream*.gpkg")
HYDRO_FABRIC_CATCHMENTS=$(find . -name "*.gpkg")
HYDRO_FABRIC_NEXUS=$(find . -name "*.gpkg")
NGEN_REALIZATIONS=$(find . -name "*realization*.json")

# Auto-selecting files if only one is found
Expand Down Expand Up @@ -71,8 +71,7 @@ if [ "$2" == "auto" ]
echo "Run completed successfully, exiting, have a nice day!"
exit 0
else
echo "Entering Interactive Mode"
continue
echo "Entering Interactive Mode"
fi

echo -e "${YELLOW}Select an option (type a number): ${RESET}"
Expand All @@ -89,15 +88,17 @@ select option in "${options[@]}"; do

if [ "$option" == "Run NextGen model framework in parallel mode" ]; then
procs=$(nproc)
procs=2 # Temporary fixed value
num_catchments=$(find forcings -name *.csv | wc -l)
if [ $num_catchments -lt $procs ]; then
procs=$num_catchments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you are running on a few hundred to thousand cores, I would suspect this isn't going to actually give any significant benefit and I would just default to running the serial code or setting procs=1. This will likely actually under perform in this case as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of ngen in NGIAB reads the catchment config in serial and is completely cpu bound, when I was testing 6600 catchments the speedup on the config reading section of execution was near linear. It's not huge portion of the execution, but enough that a two month 673 catchment run using cfe and noah-owp-m took 3m15 serially and 52s in 19x parallel. I originally looked into it because it was taking around 20 minutes to read config for a 6600 catchment run which dropped to ~15 seconds running in parallel on 56 cores on one machine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my testing has been either on a 128gb 56 core or 64gb 20 core machine though. I'm not sure what the average laptop at devcon will look like, so I don't know if the performance gain I see will translate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( @hellkite500 if you want to re-approve, we don't have to bypass branch protections. 🙂 )

I was wondering if we should move the partitioning system into a separate development line to allow for handling the DMOD/datastream-type partitioning in a different way from what might be optimal for desktop-based small basin experiments. (Not recommending this urgently, though.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could have clarified a bit haha, the number of processors and the scaling is definitely a function of the domain size. As I think about this a bit more given the context provided I actually think this should just be a user argument with a sensible default (probably nprocs?) and maybe a user message with some guidance? That message could try to count the number of catchments and give some hints about picking a good value. It would be more useful in general to have the user provide info if we have scales of a few cores to 50-100 as the common use cases.

Copy link
Member

@jameshalgren jameshalgren May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a sensible default

I think that's the intent of what's here -- limiting the nprocs to the number of catchments as a maximum feels like a reasonable first attempt at a default (and it avoids crashing/hanging ngen). Is it safe to say there is never a case where someone would want more processors than there can be partitions in the network? Probably not -- someday, there could be ensemble or overlay models.

We can work on additional customizations/user input in future PRs.

fi
generate_partition "$n1" "$n2" "$procs"
run_command="mpirun -n $procs /dmod/bin/ngen-parallel $n1 all $n2 all $n3 $(pwd)/partitions_$procs.json"
else
run_command="/dmod/bin/ngen-serial $n1 all $n2 all $n3"
fi

echo -e "${YELLOW}Your NGEN run command is $run_command${RESET}"
sleep 3
break
;;
"Run Bash shell")
Expand Down
22 changes: 11 additions & 11 deletions guide.sh
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ validate_directory() {
local color=$3

if [ -d "$dir" ]; then
local count=$(ls "$dir" | wc -l)
local count=$(find $dir -type f | uniq | wc -l)
echo -e "${color}${name}${Color_Off} exists. $count ${name} found."
else
echo -e "Error: Directory $dir does not exist."
Expand All @@ -81,7 +81,7 @@ cleanup_folder() {
local folder_name="$3"

# Construct the find command
local find_cmd="find \"$folder_path\" -maxdepth 1 -type f \( $file_types \)"
local find_cmd="find \"$folder_path\" -maxdepth 2 -type f \( $file_types \)"

# Execute the find command and count the results
local file_count=$(eval "$find_cmd" 2> /dev/null | wc -l)
Expand All @@ -105,7 +105,7 @@ choose_option() {
echo "Cleaning folder for fresh run"

# Construct the find delete command
local find_delete_cmd="find \"$folder_path\" -maxdepth 1 -type f \( $file_types \) -delete"
local find_delete_cmd="find \"$folder_path\" -maxdepth 2 -type f \( $file_types \) -delete"

# Execute the find delete command
eval "$find_delete_cmd"
Expand All @@ -129,25 +129,26 @@ choose_option() {
# Cleanup Process for Outputs Folder
cleanup_folder "$HOST_DATA_PATH/outputs/" "-name '*' " "Outputs"

# Cleanup Process for ngen/data Folder
cleanup_folder "$HOST_DATA_PATH/" "-name '*.parquet' -o -name '*.csv' -o -name '*.cn'" "ngen/data"
# Cleanup Process for restarts Folder
cleanup_folder "$HOST_DATA_PATH/restarts/" "-name '*' " "Restarts"



# File discovery
echo -e "\nLooking in the provided directory gives us:"
find_files() {
local path=$1
local name=$2
local color=$3
local regex=$3
local color=$4

local files=$(find "$path" -iname "*$name*.*")
local files=$(find "$path" -iname "$regex")
echo -e "${color}Found these $name files:${Color_Off}"
echo "$files" || echo "No $name files found."
}

find_files "$HOST_DATA_PATH" "datastream" "$UGreen"
find_files "$HOST_DATA_PATH" "datastream" "$UGreen"
find_files "$HOST_DATA_PATH" "realization" "$UGreen"
find_files "$HOST_DATA_PATH" "hydrofabric" "*.gpkg" "$UGreen"
find_files "$HOST_DATA_PATH" "realization" "realization.json" "$UGreen"

# Detect Arch and Docker
echo -e "\nDetected ISA = $(uname -a)"
Expand All @@ -163,7 +164,6 @@ else
IMAGE_NAME=awiciroh/ciroh-ngen-image:latest-x86
fi


# Model run options
echo -e "${UYellow}Select an option (type a number): ${Color_Off}"
options=("Run NextGen using existing local docker image" "Run NextGen after updating to latest docker image" "Exit")
Expand Down
Loading