Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PnetCDF Non-blocking APIs to Perform I/O #2058

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

yzanhua
Copy link

@yzanhua yzanhua commented Jun 4, 2024

TYPE: new feature

KEYWORDS: parallel I/O, PnetCDF, non-blocking APIs, requests aggregation

SOURCE: Zanhua Huang (Northwestern University), Wei-keng Liao (Northwestern University, @wkliao)

DESCRIPTION OF CHANGES:
Problem:
We found that using PnetCDF non-blocking APIs can improve the parallel I/O performance noticeably over the original blocking APIs used in WRF. A paper discussing the performance of PnetCDF non-blocking APIs with WRF is I/O in WRF: A Case Study in Modern Parallel I/O Techniques

Solution:
We added a new I/O option to enable PnetCDF non-blocking APIs. If users specify enable_pnetcdf_bput to .true. in the namelist, and also specify io_form to PnetCDF for history and/or restart file (io_form = 11), then PnetCDF non-blocking APIs will be used.

When PnetCDF non-blocking APIs are enabled, the write calls to WRF variables are first buffered in the memory, and flushed to file until the end of each time step.

LIST OF MODIFIED FILES:

  1. Registry/registry.io_boilerplate
  2. external/io_pnetcdf/ext_pnc_put_dom_ti.code
  3. external/io_pnetcdf/field_routines.F90
  4. external/io_pnetcdf/wrf_io.F90
  5. frame/module_io.F
  6. share/output_wrf.F

TESTS CONDUCTED:

  1. Do mods fix problem? How can that be demonstrated, and was that test conducted?
    We conducted the performance evaluation on a WRF single-domain benchmark with a grid size of 1900x1300 on Cori at NERSC. The performance improvement of using PnetCDF non-blocking APIs is presented in the paper mentioned above.
  2. Are the Jenkins tests all passing? Not tested

RELEASE NOTE: Support PnetCDF non-blocking APIs to increase parallel I/O performance of PnetCDF. Zanhua Huang, Kaiyuan Hou, Ankit Agrawal, Alok Choudhary, Robert Ross, and Wei-Keng Liao. 2023. I/O in WRF: A Case Study in Modern Parallel I/O Techniques. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '23). Association for Computing Machinery, New York, NY, USA, Article 94, 1–13. https://doi.org/10.1145/3581784.3613216

yzanhua added 8 commits May 28, 2024 12:57
1. Users can choose to use/enable PnetCDF non-blocking APIs by setting the
"enable_pnetcdf_bput" option to ".true." in the namelist.
2. If "enable_pnetcdf_bput" option is set to ".false.", the blocking APIs
will be used.
3. The "enable_pnetcdf_bput" option only affects the PnetCDF method (i.e.
io_form=11).
This commits add a subroutine "BputCalcBufferSize". It calculates the
buffer size needed by PnetCDF non-blocking APIs (bput calls). The
returned buffer size will equal the total amount of writes (in bytes)
for a particular timestep, excluding PnetCDF headers (only the data
section).

PnetCDF bput calls need additional buffer space because they cache write
reqs in memory and flush them to file system altogether later on. The
additional buffer space is for the caching purpose.

----------

Below explains the implementaion of this new subroutine.

The implementation codes are borrowed from
share/output_wrf.F: subroutine "output_wrf".

The subroutine "output_wrf" is to write all variables to file.
The implementation logic of "output_wrf" is that:

  Iterate over all variables (do while loop):
      Some "if" conditions:
          write/output the variable (of size amoutX) to file
      End IF
  End loop.

The new subroutine "BputCalcBufferSize" is to return the buffer size
(i.e. the total amount of writes). The logic here is to replace
"write/output the variable to file" in the original logic with
"increment total size":

  totalSize = 0
  Iterate over all variables (do while loop):
      Some "if" conditions:
          totalSize = totalSize + amoutX
      End IF
  End loop.
PNC-nb APIs refer to PnetCDF non-blocking APIs.

This commit provide subroutine to tell PnetCDF the size of buffer to be
used by PnetCDF non-blocking bput calls.
PNC-nb refers to PnetCDF non-blocking APIs.

Using the non-blocking APIs, multiple write requests will be coalesced,
and then flushed when an explict "flush" is called. This commits provide
subroutines to perform the explicit flush. The newly added subroutines
in essential call nfmpi_wait_all API to perform the flush.
PNC-nb refers to PnetCDF Non-Blocking APIs.

In this commit, we
1) call BputCalcBufferSize to calculate the buffer size needed by PNC-nb
2) call BputSetBufferSize to tell PnetCDF library the buffer size
3) call BputWait to flush pending write reqs.
4) tell PnetCDF library to detach the buffer at file close.

The call flow is:

  For each time step:
      if buffer_size is not calculated:
          buffer_size = BputCalcBufferSize(...)
      if not yet told PnetCDF the needed buffer_size:
          BputSetBufferSize(buffer_size, ...)

      For each variable:
          post write reqs (but not actually flushed)

      call BputWait(...) to flush pending write reqs

  When closing the file:
      tell PnetCDF library to detach the buffer

Here are some important facts:
0) Posting writ reqs is not yet implemented. Next commit will address
   this.
1) Currently only restart and history file can enable PnetCDF
   non-blocking APIs. It is enabled only if the namelist set
   "enable_pnetcdf_bput" to ".true.", and io_form_history,
   io_form_restart, or both are set to 11.
2) Buffer size calculation and setting buffer size happens before any
   write reqs.
3) Buffer size calculation happens at most once for all history files.
   This is because variables are same across different time steps and
   different history files. So calculating once is enough. Same for
   restart files.
4) BputSetBufferSize(...) is called at most once per file.
5) We flush pending/cached reqs to file at the end of each time step.
6) The PnetCDF library will only detach the buffer for non-blocking
   APIs during file close.
PNC-nb refers to PnetCDF non-blocking APIs.
In this commit, we call NFMPI_BPUT instead of NFMPI_PUT if non-blocking
APIs is enabled.
This is an optimization that applies to both blocking and non-blocking
PnetCDF APIs.

In PnetCDF, an application creating a file will first enter
"define mode", in which it can describe all attributes, dimensions,
types and structures of variables. The program will then exit
"define mode" and enter data mode, in which it actually performs I/O.

Previously in WRF, it enters and exits define mode several times,
while only once is enough/necessary. This commit remove redundant enter/
exit of define mode.
This is an optmization for PnetCDF (both blocking and non-blocking APIs)
. In WRF, there are two types of variables, paritioned and
non-partitioned. Non-partitioned variables are small variables that all
process write the entire variables with the same values to the file.
Partitioned varaibles are large variables; each process is responsible
for a (non-overrlapped) sub-region of such a variable.

When writing a non-partitioned variable, there's no need for every
process to make the PnetCDF write call.
1) If non-blocking APIs are used, only rank 0 will post write reqs
   using nfmpi_bput. Other processes do not need to call nfmpi_bput
   because nfmpi_bput does not require collective call.
2) If using blocking APIs, all processes still need to call
   nfmpi_put_all (which requires collective call). But only process of
   rank 0 has req size > 0, all other processes have req size = 0 (i.e.
   vcount = 0)
@yzanhua yzanhua requested review from a team as code owners June 4, 2024 16:18
@weiwangncar
Copy link
Collaborator

@yzanhua One of the compilation has failed and it is for WRFDA build. See the attached output file.
output_0.gz

If you need instructions to build WRFDA, see here.

@wkliao
Copy link
Contributor

wkliao commented Jun 5, 2024

@yzanhua

The error messages are extracted below.

/opt/rh/devtoolset-9/root/usr/libexec/gcc/x86_64-redhat-linux/9/ld: /wrf/WRFPLUS/main/libwrflib.a(output_wrf.o): in function `bputcalcbuffersize_':
output_wrf.f90:(.text+0x0): multiple definition of `bputcalcbuffersize_'; ./libwrfvar.a(output_wrf.o):output_wrf.f90:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

@wkliao wkliao requested a review from a team as a code owner June 6, 2024 19:42
@weiwangncar
Copy link
Collaborator

The regression tests have passed:

Test Type              | Expected  | Received |  Failed
= = = = = = = = = = = = = = = = = = = = = = = =  = = = =
Number of Tests        : 23           24
Number of Builds       : 60           57
Number of Simulations  : 158           150        0
Number of Comparisons  : 95           86        0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

@wkliao
Copy link
Contributor

wkliao commented Jun 6, 2024

Hi, @weiwangncar

Thanks. Can you also modify the regression test to add a test with this
feature enabled? This feature is optional to the users and can be enabled
by setting enable_pnetcdf_bput to .true. in the namelist, and io_form = 11.

If it improved the write time significantly, as shown in our SC paper, maybe it
can become a default option in the future.

@weiwangncar weiwangncar changed the base branch from master to develop June 6, 2024 23:31
@weiwangncar
Copy link
Collaborator

@wkliao We do not test pnetcdf io in the regression test at the moment. But we will test the code on our system. Thanks for contributing this to the community code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants