Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRW forecast runs do not use threading even if OMP_NUM_THREADS_RUN_FCST is set > 1 #1138

Open
mkavulich opened this issue Oct 14, 2024 · 3 comments · May be fixed by #1136
Open

SRW forecast runs do not use threading even if OMP_NUM_THREADS_RUN_FCST is set > 1 #1138

mkavulich opened this issue Oct 14, 2024 · 3 comments · May be fixed by #1136

Comments

@mkavulich
Copy link
Collaborator

mkavulich commented Oct 14, 2024

Expected behavior

SRW contains the config variable OMP_NUM_THREADS_RUN_FCST which should activate threading when set greater than 1 (default in SRW is 2, so should be every run unless specified otherwise).

Current behavior

Threading is not activated for the weather model, no matter what OMP_NUM_THREADS_RUN_FCST is set to. You can observe this by opening the run_fcst log file for any SRW run; here is just one example from the WE2E test grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0:

* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
     PROGRAM ufs-weather-model HAS BEGUN. COMPILED       0.00     ORG: np23
     STARTING DATE-TIME  OCT 14,2024  17:32:13.156  288  MON   2460598
...
...
atmosphere_init: current_time_seconds =       0.0
Using n_split from the namelist: 005
 Off center implicit scheme param=   1.000000
  p_fac=  0.1000000
Using n_sponge : 024
Using non_ortho :       T
 Starting PEs :           25
 Starting Threads :            1
 Cubed-sphere, single face domain decomposition
whalo =    3, ehalo =    3, shalo =    3, nhalo =    3
...

Note the line "Starting Threads : 1", which indicates only a single thread is being used, despite us requesting 2.

Machines affected

Confirmed on Orion, Hera, and Derecho, but this should affect all machines.

Steps To Reproduce

  1. Run any SRW test that contains a run_fcst task
  2. View the run_fcst log file, and confirm that threading is not being used, despite OMP_NUM_THREADS_RUN_FCST=2 being the default

Detailed Description of Fix

I honestly don't know how to fix this. I am confused by the latest instructions on how to set OMP_NUM_THREADS for the weather model. Currently, the app sets the threading value in the model_configure file (atmos_nthreads: 1, but this variable has been deprecated for years (as a side note, so has PE_MEMBER01, which is also still being set) so the setting has no effect. However, if I follow the instructions in the weather model users guide (setting ATM_omp_num_threads: 2 in ufs.configure), I get behavior that I don't understand from the model: it seems to be using way fewer MPI tasks than expected, and so fails with a corresponding error message:

* . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
     PROGRAM ufs-weather-model HAS BEGUN. COMPILED       0.00     ORG: np23
     STARTING DATE-TIME  OCT 14,2024  17:23:00.856  288  MON   2460598
...
...
atmosphere_init: current_time_seconds =       0.0
Using n_split from the namelist: 005
 Off center implicit scheme param=   1.000000
  p_fac=  0.1000000
Using n_sponge : 024
Using non_ortho :       T
 Starting PEs :           11
 Starting Threads :            2

FATAL from PE     3: mpp_domains_define.inc: not all the pe_end are in the pelist


FATAL from PE     1: mpp_domains_define.inc: not all the pe_end are in the pelist


FATAL from PE     2: mpp_domains_define.inc: not all the pe_end are in the pelist


FATAL from PE     4: mpp_domains_define.inc: not all the pe_end are in the pelist
...
...

Note the entry Starting PEs : 11, which doesn't make any sense to me with the settings provided (identical to the first test except for setting ATM_omp_num_threads: 2 in ufs.configure) and with 30 cores allocated.

Clearly there is something I am missing, but I don't have the time to dig deeper into this. However, this is something that should be solved sooner rather than later, as it likely represents a huge waste of resources due to not using threads that we expect to use (and likely impacts any external groups using recent SRW code to run the weather model).

Additional Information (optional)

I believe this bug was introduced in #1050, but I can not confirm that that is the case.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

After seeing this issue, I checked the results of my fundamental tests and saw the same behavior, atmos_nthreads: 2 is in the settings, but the model only uses a single thread, Starting Threads : 1. Changing to use the pelist and atm_omp_num_threads, I encountered the same FATAL from PE 3: mpp_domains_define.inc: not all the pe_end are in the pelist error messages that you had. Looking in weather model documentation, in order to use threading, you need to set the total number of tasks to OMP_NUM_THREADS_RUN_FCST * (LAYOUT_Y * LAYOUT_X + WRTCMP_write_groups * WRTCMP_write_tasks_per_group) (in other words, you need to multiply the number of tasks by the number of threads). With this change, I was able to successfully run the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 WE2E test with 2 threads.

The erroneous single thread test ran with the following settings:

 Starting PEs :           10
 Starting Threads :            1

while the corrected 2 thread test ran with:

 Starting PEs :           10
 Starting Threads :            2

I will go ahead and get this work merged into my current feature/hash_update branch so that it can be included in PR #1136.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

The necessary changes have been added to PR #1136 at hash 77e2452.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

Working with Rusty and Dusan, I do have an answer to the reason why the starting PEs section isn't using your prescribed 30 cores. Currently, the SRW is using ESMF managed threading. While ESMF managed threading is turned on, the starting PEs will be set to the product of LAYOUT_X and LAYOUT_Y. The various mpp_domains_define.inc:-based error messages are due to the the value not being the product of LAYOUT_X and LAYOUT_Y.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants