Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SWMR in HDF5 (updated) #2653

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Conversation

ZedThree
Copy link
Contributor

@ZedThree ZedThree commented Mar 7, 2023

I've taken the work @eivindlm started in #1448, and done the following:

  • merged in recent main and fixed conflicts
  • removed use of deleted macro
  • added --enable-hdf5-swmr option to configure.ac
  • changed the use of add_definition in CMake to defining a macro in config.h, to be consistent with autotools version
  • fixed a bug where H5Pset_libver_bounds wasn't set when opening a file
  • replaced calls to H5Pstart_swmr_write with file access flags
    • this was required to get the tests to pass

This feature needs some documentation, as it's not quite straightforward to use. Where's best to put that?

Should this also have a feature flag set in netcdf_meta.h, and in nc-config?

Given that this feature needs explicitly enabling at configure time, should it be added to an existing/new CI build?

eivindlm and others added 14 commits July 10, 2019 21:16
* main: (1776 commits)
  Invert solution as discussed at Unidata#2618
  Correct a potential null pointer dereference.
  Fix a logic error that was resulting in an easy-to-miss error when running configure.
  Fix issue with dangling test file not getting cleaned up.
  Turn nczarr zip support off by default in cmake, add a status line indicating whether nczarr-zip-support is available, in libnetcdf.settings.
  Update the version of the cache action used by github action from v2 to v3.
  Explicit cast to unsigned char.
  More issues returned by sanitizer, related to attempts to assign MAX_UNSIGNED_CHAR (255) to a variable of type char.
  Fixed an issue where memcpy was potentially passed a null pointer.
  Correct another uninitialized issue.
  Correct undefined variable error.
  Fixing issues uncovered by compiling with '-fsanitize=undefined' and running nc_test/nc_test, in support of Unidata#1983, amongst others.  The cast as was being used was undefined behavior, and had to be worked around in a style approximating C++'s 'reinterpret_cast'
  Remove a stray character at the head file.
  Fix a distcheck failure with nczarr_test/run_interop.sh
  Turn benchmarks off by default. They require netcdf4, additional logic is required in order to have them on by default.
  Add execute bit to test scripts
  Fix missing endif statement
  Add generated parallel tests for nc_perf, cmake-based build system.
  Correct typo in CMakeLists.txt
  Wiring performance benchmarks into cmake, cleaned up serial compression performance test dependency on MPI.
  ...
@edwardhartnett
Copy link
Contributor

Awesome work!

Definitely add a flag to netcdf_meta/nc-config for this!

@ZedThree
Copy link
Contributor Author

ZedThree commented Mar 7, 2023

Ok, I've added NC_HAS_HDF5_SWMR to netcdf_meta.h and --has-hdf5-swmr to nc-config

@DennisHeimbigner
Copy link
Collaborator

HDF5 has lots of features that are not supported in netcdf-c.
In order to include such features, we need evidence that it is
high demand.

@edwardhartnett
Copy link
Contributor

@DennisHeimbigner SWMR is a very useful feature for HPC users. It is frequently the case that a model is writing results and other processes want to see those results. To calculate the next timestep, all processors need some data from the previous timestep. The simplest solution would be for those reading processes to read the file and get what they want, but without SWMR it will not work.

A NOAA group just asked about this recently, so I believe it is a feature that will be used by multiple modeling groups, There's also a good chance that PIO would allow access to this features, and PIO is used in several important models at NOAA and NCAR.

@edwardhartnett
Copy link
Contributor

@gsjaardema what do you think about SWMR?

@ZedThree
Copy link
Contributor Author

ZedThree commented Mar 9, 2023

The inability to watch netCDF4 files while they are being written is the main complaint I get from researchers when upgrading applications and libraries. Using SWMR removes that limitation. This is a pretty basic QoL feature that would benefit a lot of users.

It would be even better it this feature could be enabled and used completely automatically, but it's not clear if that's even possible.

@gsjaardema
Copy link
Contributor

@gsjaardema what do you think about SWMR?

I agree with @ZedThree that many users want to be able to query the HDF5 file while it is being written and the inability to do that is a big complaint. Especially when it works for netcdf-3 and netcdf-5 but not netcdf-4. Explaining that to users is confusing.

At one time, SWMR did not work with parallel files; I don't remember if that has changed yet. We would at a minimum need to support parallel writing with serial read; parallel write and parallel read would be nice. Serial write/serial read which SWMR definitely supports is also useful for us though.

There is also a question of overhead -- does use of SWMR entail any performance hit on the writing. If no or minimal, then I could see using it at all times; if there is a performance hit, then we would probably need it to be turned on/off on a file-by-file basis.

@ZedThree
Copy link
Contributor Author

Just to be clear, in case there's any confusion, this is a completely opt-in feature currently, and requires netCDF to be configured and built with explicit support, as well as a runtime flag when opening the file.

I think SWMR is incompatible with parallel files, and only works in serial. I have seen some references to "VFD SWMR", "full SWMR", and "MWMR" (multiple writer, multiple reader), but I'm not sure of the availability of these features.

@Mitch3007
Copy link

Mitch3007 commented Nov 14, 2023

Hi all, great stuff! I was just wondering if there had been any updates on this work regarding building into the main branch or any other initiatives in NetCDF over the past six months that do something similar? Very keen to be able to read NetCDF4 model result files (in QGIS for example) while a hydrodynamic model is still running. The ability to review results 'on-the-fly' as a model is running is an important part of flood and coastal modelling workflow, and this functionality will provide great value to researchers and practitioners.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Nov 14, 2023

replaced calls to H5Pstart_swmr_write with file access flags; this was required to get the tests to pass

Where is this change; I do not see it in the files changed for this PR.

@ZedThree
Copy link
Contributor Author

I was just wondering if there had been any updates on this work regarding building into the main branch or any other initiatives in NetCDF over the past six months that do something similar?

It looks like this branch has fallen out of date a little bit, I'll try and fix that up. I don't think there has been any other work on this though.

If you're able, it would be nice to see if you could try out this branch and see if there's any major issues I've missed!

Where is this change; I do not see it in the files changed for this PR.

This is just a difference in implementing this feature between #1448 and this PR, so it doesn't appear in the diff between this branch and main.

In libhdf5/hdf5open.c in #1448 there was:

    if (mode & NC_WRITE && mode & NC_HDF5_SWMR) {
      if ((retval = H5Fstart_swmr_write(h5->hdfid))) {
        BAIL(retval);
      }
    }

and this PR I instead changed the file access flags:

      if((mode & NC_HDF5_SWMR)) {
        flags |= H5F_ACC_SWMR_WRITE;
      }

The former starts SWMR mode after the file as been opened; the latter opens the file immediately in SWMR mode. I don't recall exactly why this made any difference, except that it was required to get the tests to pass.

@Mitch3007
Copy link

Mitch3007 commented Nov 20, 2023

Hi @ZedThree, (and an early disclaimer here about me being a hack), I've been testing the past few days and have compiled #2652 on top of HDF5 (have completed tests on both HDF5 1.10.11 and 1.14.1-2). For the most part it compiles successfully however tst_filter appears to fall over. See outputs at the bottom of this comments.Any advice is certainly welcome.

Despite tst_filter I've pushed on. Our program uses netcdf-fortran to write results, so to try and test your commits i've built v4.6.1 (https://github.com/Unidata/netcdf-fortran/tree/v4.6.1) on top of #2653. Sadly, I appear to be still getting file lock issues (no doubt user error here...).

Are you aware of anything within the netcdf-fortran libs what will also need to be updated to enabled swmr, or potentially if there are additional flags I should be setting when first opening our result files via nf90_create. Happy to send more information through. Many thanks for your help. Cheers, Mitch.

image

_=======================================================
   netCDF 4.9.3-development: nc_test4/test-suite.log
=======================================================

# TOTAL: 83
# PASS:  82
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: tst_filter
================

findplugin.sh loaded
final HDF5_PLUGIN_DIR=/home/mjs/dev/tuflowfv_netcdf/tuflowfv/tuflow_shared_external/netcdf-c/plugins/.libs
filter szip not available
*** Testing dynamic filters using API

*** Testing API: bzip2 compression.
show parameters for bzip2: level=9
show chunks: chunks=4,4,4,4

*** Testing API: bzip2 decompression.
data comparison: |array|=256
no data errors
*** Pass: API dynamic filter

*** Testing dynamic filters parameter passing
test1: compression.
test: nparams=14: params= 1 239 23 65511 27 77 93 1145389056 3287505826 1097305129 1 2147483648 4294967295 4294967295
dimsizes=4,4,4,4
chunksizes=4,4,4,4
>>> mismatch: tint64
>>> mismatch: tuint64
>>> mismatch: tfloat64
test1: decompression.
>>> mismatch: tint64
>>> mismatch: tuint64
>>> mismatch: tfloat64
data comparison: |array|=256
no data errors
test2: dimsize % chunksize != 0: compress.
test: nparams=14: params= 2 239 23 65511 27 77 93 1145389056 3287505826 1097305129 1 2147483648 4294967295 4294967295
dimsizes=4,4,4,4
chunksizes=4,4,4,4
>>> nbytes = 1024 chunk size = 1024
test2: dimsize % chunksize != 0: decompress.
>>> nbytes = 1024 chunk size = 1024
data comparison: |array|=256
no data errors
test3: dimsize % chunksize != 0: compress.
test: nparams=14: params= 3 239 23 65511 27 77 93 1145389056 3287505826 1097305129 1 2147483648 4294967295 4294967295
dimsizes=4,4,4,4
chunksizes=4,4,4,4
test3: error code = 0
test3: dimsize % chunksize != 0: decompress.
data comparison: |array|=256
no data errors
*** Pass: parameter passing
*** Testing dynamic filters using ncgen
*** Pass: ncgen dynamic filter
*** Testing dynamic filters using nccopy
	*** Testing simple filter application
	*** Pass: nccopy simple filter
	*** Testing '*' filter application
	*** Pass: nccopy '*' filter
	*** Testing 'v&v' filter application
	*** Pass: nccopy 'v|v' filter
	*** Testing pass-thru of filters
	*** Pass: pass-thru of filters
	*** Testing -F none
	*** Pass: -F none
	*** Testing -F var,none 
	*** Pass: -F var,none
*** Pass: all nccopy filter tests
*** Testing dynamic filters using ncgen with -lc
*** Pass: ncgen dynamic filter
*** Testing multiple filters

*** Testing Multi-filter application: filter set = bzip2 deflate noop
filters verified
show chunks: chunks=4,4,4,4
direction=compress id=40000 cd_nelmts=0 cd_values=

*** Testing Multi-filters.
filters verified
direction=decompress id=40000 cd_nelmts=0 cd_values=
data comparison: |array|=256
no data errors
*** nccopy -F with multiple filters
direction=compress id=40000 cd_nelmts=0 cd_values=
*** ncgen with multiple filters
*** Pass: multiple filters
*** Testing filter re-definition invocation
*** Testing multiple filter order of invocation on create
7a8,9
> direction=compress id=40000 cd_nelmts=1 cd_values= 0
> direction=compress id=40001 cd_nelmts=1 cd_values= 1
11,12d12
< direction=compress id=40000 cd_nelmts=1 cd_values= 0
< direction=compress id=40001 cd_nelmts=1 cd_values= 1_

include/netcdf.h Outdated Show resolved Hide resolved
@ZedThree
Copy link
Contributor Author

@Mitch3007 Thanks for taking a look! I'm not sure about the test errors -- it passes everything in CI, and this feature shouldn't interact with the filters at all, so it's possible there's an issue with your environment.

Are you aware of anything within the netcdf-fortran libs what will also need to be updated to enabled swmr, or potentially if there are additional flags I should be setting when first opening our result files via nf90_create.

Ah yes the Fortran API will need the NC_HDF5_SWMR flag exposing. Currently it uses the same value as the deprecated MPIIO flag, so you should use NF90_MPIIO. There will need to be a corresponding NF90_HDF5_SWMR if this gets merged.

There's currently no docs for this feature, but to use it properly, you should first open the file normally and create all the variables and attributes. Then, close and reopen the file in SWMR mode (nc_open(FILE_NAME, NC_WRITE|NC_HDF5_SWMR, &ncid)) to append your data. At this point, you should be able to open the file for reading in a different process, also with the SWMR flag.

Reading through the HDF5 docs, I realise this PR is actually missing a crucial part, a call to H5Drefresh. This needs to be called from the reading process to refresh its metadata of a dataset. I'll add this to NC4_get_vars.

CMakeLists.txt Outdated Show resolved Hide resolved
* main:
  Fixed various UBSan warnings about working with NULL pointers
  Fixed misaligned memory access flagged by UBSan
  disable test that depend on ncpathcvt in cmake build w/o utilities
  Update RELEASENOTES
  Remove the execinfo capability
  fix memory leak
  ckp
  remove conflicts
  Fixed various UBSan warnings about invalid bit shifting
  Add fenceposting so DLL flags are only introduced wwhen we are compiling DLL-based shared libraries.
  Working to get the proposed change working with Visual studio
  CMake: Use helper libraries for nczarr tests
  Fix most float conversion warnings
  Update release notes
  Update internal tinyxml2 code to the latest version
  Update release notes
  Improve fetch performance of DAP4
Previously, setting some HDF5 CMake config flags directly would skip
both the check for HDF5 1.10 and setting the `HDF5_HAS_SWMR` flag
Copy link
Contributor

@magnusuMET magnusuMET left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be mistaken but I think the version handling code is wrong in places. You'll always want the highest version possible, but limit downwards to 1.10 (when SWMR was introduced) when using this feature.

libhdf5/hdf5create.c Outdated Show resolved Hide resolved
libhdf5/hdf5create.c Outdated Show resolved Hide resolved
libhdf5/hdf5open.c Outdated Show resolved Hide resolved
libhdf5/hdf5open.c Outdated Show resolved Hide resolved
@ZedThree
Copy link
Contributor Author

@magnusuMET Yes, I think you're right. Actually, I'm not sure why we can't just sidestep the issue and only call H5Pset_libver_bounds when enabling SWMR mode.

@ZedThree
Copy link
Contributor Author

@magnusuMET You prompted me to look into this properly -- it turns out that since the original PR (#1448) that this one is based on, a new function hdf5set_format_compatibility was added which sets the libver bounds, and that was being called after the new call to H5Pset_libver_bounds here with the result that the values would get clobbered anyway.

The bounds that are currently set (in main) have H5F_LIBVER_LATEST for the high bound, so we don't have to do anything special for SWMR: simply creating a file in SWMR mode sets the file format to the HDF5 1.10 format (superblock 3).

Libver bounds already taken care of elsewhere, and merely opening the
file in SWMR mode will correctly set format to at least v110
@ZedThree
Copy link
Contributor Author

I've added nc_test4/test_hdf5_swmr.sh that now properly exercises the SWMR capability. Running this test creates separate processes for simultaneously writing to and reading from the same file. This was very useful because it flagged that we need to refresh the metadata when reading dimension lengths. With that implemented, I think this is now the minimally useful version of the feature.

Things that are now missing:

  • testing in CI
  • some docs on how to actually use this
  • maybe ncwatch, the equivalent of h5watch for watching files (basically ncdump but in SWMR mode)

This is currently gated behind a build-time flag (as well as the runtime flag). Does that still make sense? I notice that the CI now only runs on HDF5 1.10.8+, so if 1.8.x is no longer supported, this feature should always be available (though not always turned on of course!). It would also mean not needing an additional CI job.

@WardF @DennisHeimbigner thoughts on how to proceed from here?

@DennisHeimbigner
Copy link
Collaborator

Has anyone tested this under Windows (Visual Studio)?

@ZedThree
Copy link
Contributor Author

I don't have access to a Windows dev environment myself unfortunately. When the Windows tests are running on CI, could use that?

@Mitch3007
Copy link

Mitch3007 commented Dec 17, 2024

I don't have access to a Windows dev environment myself unfortunately. When the Windows tests are running on CI, could use that?

Hi @ZedThree,

I have access to both Linux (using CentOS7) and Windows 10 VS Studio 2019 build environments so am happy to undertake some Visual Studio tests once I've got things up and running properly on Linux. I've had some time to look at this again over the past week, however I'm struggling to build the swmr tests within netcdf-c/nc_test4/CMakeLists.txt:

if (ENABLE_HDF5_SWMR)
BUILD_BIN_TEST(test_hdf5_swmr_writer)
BUILD_BIN_TEST(test_hdf5_swmr_reader)
ADD_SH_TEST(nc_test4 test_hdf5_swmr)
endif()

The terminal output when running the tests (which doesn't include test_hdf5_swmr) is provided as follows.

_-----------------------------
make  check-TESTS
make[2]: Entering directory `/home/mjs/dev/tuflowfv_netcdf/tuflowfv/tuflow_shared_external/netcdf-c/nc_test4'
make[3]: Entering directory `/home/mjs/dev/tuflowfv_netcdf/tuflowfv/tuflow_shared_external/netcdf-c/nc_test4'
PASS: tst_dims
PASS: tst_dims2
PASS: tst_dims3
PASS: tst_files
PASS: tst_files4
PASS: tst_vars
PASS: tst_varms
PASS: tst_unlim_vars
PASS: tst_converts
PASS: tst_converts2
PASS: tst_grps
PASS: tst_grps2
PASS: tst_compounds
PASS: tst_compounds2
PASS: tst_compounds3
PASS: tst_opaques
PASS: tst_strings
PASS: tst_strings2
PASS: tst_interops
PASS: tst_interops4
PASS: tst_interops5
PASS: tst_interops6
PASS: tst_interops_dims
PASS: tst_enums
PASS: tst_coords
PASS: tst_coords2
PASS: tst_coords3
PASS: tst_vars3
PASS: tst_vars4
PASS: tst_chunks
PASS: tst_chunks2
PASS: tst_utf8
PASS: tst_fills
PASS: tst_fills2
PASS: tst_fillbug
PASS: tst_xplatform
PASS: tst_xplatform2
PASS: tst_endian_fill
PASS: tst_atts
PASS: t_type
PASS: cdm_sea_soundings
PASS: tst_camrun
PASS: tst_vl
PASS: tst_atts1
PASS: tst_atts2
PASS: tst_vars2
PASS: tst_files5
PASS: tst_files6
PASS: tst_sync
PASS: tst_h_scalar
PASS: tst_rename
PASS: tst_rename2
PASS: tst_rename3
PASS: tst_h5_endians
PASS: tst_atts_string_rewrite
PASS: tst_hdf5_file_compat
PASS: tst_fill_attr_vanish
PASS: tst_rehash
PASS: tst_filterparser
PASS: tst_bug324
PASS: tst_types
PASS: tst_atts3
PASS: tst_put_vars
PASS: tst_elatefill
PASS: tst_udf
PASS: tst_put_vars_two_unlim_dim
PASS: tst_bug1442
PASS: tst_quantize
PASS: tst_h_transient_types
PASS: tst_alignment
PASS: tst_h_strbug
PASS: tst_h_refs
PASS: run_empty_vlen_test.sh
PASS: tst_v2
PASS: run_grp_rename.sh
PASS: tst_misc.sh
PASS: test_fillonly.sh
PASS: tst_fixedstring.sh
PASS: tst_filter.sh
PASS: tst_specific_filters.sh
PASS: tst_bloscfail.sh
PASS: tst_filter_vlen.sh
PASS: tst_filter_misc.sh
make[4]: Entering directory `/home/mjs/dev/tuflowfv_netcdf/tuflowfv/tuflow_shared_external/netcdf-c/nc_test4'
make[4]: Nothing to be done for `all'.
make[4]: Leaving directory `/home/mjs/dev/tuflowfv_netcdf/tuflowfv/tuflow_shared_external/netcdf-c/nc_test4'
============================================================================
Testsuite summary for netCDF 4.9.3-development
============================================================================
# TOTAL: 83
# PASS:  83
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================

-----------------------------_

Is there anything additional I need to add to my build environment to add the nc_test4 list?

As a test, I've also tried to commenting out the ENABLE_HDF5_SWMR if as follows:

#if (ENABLE_HDF5_SWMR)
BUILD_BIN_TEST(test_hdf5_swmr_writer)
BUILD_BIN_TEST(test_hdf5_swmr_reader)
ADD_SH_TEST(nc_test4 test_hdf5_swmr)
#endif()

However this still doesn't seem to add the tests. Any thoughts on what I might be doing wrong or need to add? Very much appreciate any help you can provide.

I've attached the build script I'm using to build netcdf-c, which I hope provides some further context on how I'm building. build_netcdf.txt. I've renamed from .sh to .txt here to hopefully avoid any security issues.

The HDF version I'm building on is 1.14.1-2.

Cheers, Mitch.

@WardF
Copy link
Member

WardF commented Dec 17, 2024

@Mitch3007 When you run cmake, there should be a line in the output that either confirms 'Found bash:orbash shell not found.` Can you take a look and see what's being reported? Similarly, I'll test this in my environment where I know I have bash, and then I will see if I can get the conflicts sorted. Thanks!

@Mitch3007
Copy link

Hi @WardF, thanks for taking a look at this :)

I've reviewed the terminal output when building netcdf-c. I don't seem to see any 'Found bash:' or 'bash shell not found output' however the build environment looks to be using -bash: BASH_VERSION = 4.2.46(2)-release. Does that look ok to you, or is there any specific log/output file I can take a look at further? I've attached the full terminal output: build_netcdf_all_terminal_output.txt if it's of any use as reference.

For further context the build process I'm using is as follows:

  1. Setup Intel OneAPI environment via call to source set_dev_environment.sh (have attached as set_dev_environment.txt
  2. Run build_all.sh to build netcdf-c (attached as build_all.txt). In the attached terminal output hdf5 had already been built previously, hence the --no_hdf5 flag. This script calls build_netcdf.sh (build_netcdf.txt). Note, I've commented out the netcdf-fortran build for the moment as I'll cross that bridge once swmr is confirmed to be working ok on my C Linux and Windows build environments.

Also of interest, I was expecting test_hdf5_swmr_reader.c and test_hdf5_swmr_writer.c to produce some .o files in netcdf-c/nc_test4 but I don't appear to be building them.
swmr_nc_test4

Thanks again, Mitch.

@Mitch3007
Copy link

Mitch3007 commented Dec 23, 2024

Hi @WardF ,

Despite the nc_test4 build issues mentioned above I've been able successfully build pull request commit ff7bde5 on both Linux and Windows build environments. To test swmr, I'm using fortran 90 and have put together two small fortran prototype programs.

The first opens the file in swmr mode, declares the file then closes it and opens it again for writing. It writes out 20 time steps to a CF-compilant NetCDF grid on long lat coordinates.

The second program opens the NetCDF in swmr read mode, reports some information on the length of the time dimension, then closes the file. There is a short pause (0.1 seconds) and then it reopens and repeats the process 200 times.

To get it working on fortran I've pulled down netcdf-fortran 138a6b7 and have locally added to netcdf-fortran/fortran/netcdf_constants.F90:

! Flags for SWMR
integer, parameter, public :: nf90_hdf5_swmr = 65536

So @DennisHeimbigner and @ZedThree, swmr appears to be working on Windows and Linux. Keen to hear where we need go from here. Happy to upload my test programs if you think it will be useful. Please let me know if you need anything further from me to help progress this one.

@WardF
Copy link
Member

WardF commented Jan 3, 2025

The next step will be to figure out these conflicts, I will take a stab at that. Any additional thoughts @DennisHeimbigner ?

@DennisHeimbigner
Copy link
Collaborator

The only thing that comes to mind is the threading issue.
Remember that while HDF5 has locking, netcdf does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants