-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature supporting of different charsets #184
Feature supporting of different charsets #184
Conversation
…charset. MOD H5iosp, H5headerNew, H4iosp, H4header, N3iosp, N3header: added logic for defining charset for reading netcdf files. This definition is made by sending charset as a message to the IOSP object. If charset was not set the default UTF-8 charset will be used as before. The definition of charset is needed because there are tools that create netcdf-files without considering of usage of UTF-8 charset and just use the local charset to write text. Added logic for simple creation of new iosp/header for different netcdf-3 file format: Uniplot CDH format that is netcdf3 with "CDH" magic, little endian byte order and ISO 88591-1 encoding.
… of length. Otherwise for unlimited dimension with length == 0 IllegalArgumentException will be thrown.
… fixed formatting of java-doc comments.
Thank you for your contribution @V-F! I just want to make sure I have a good understanding before proceeding with feedback. From what I can tell, this basically enables IOSPs to read strings from files that potentially contain non-UTF-8 charsets strings (default still UTF-8). Then, the HDF4, HFD5, and netCDF3 IOSPs are modified to allow for the presence of non-UTF-8 charset strings (with the default remaining UTF-8). In this case, it looks like consideration is only given to data and attribute values, but not object names. The control of the charset used is done via the sendIospMessage() method. All of this is to enable new IOSPs in support of netCDF- and HDF-based files where strings were not encoded using UTF-8. Overall I think this makes sense. Finally, as a bonus, a bug fix when using the new builder-based API (thank you!). Does that sound correct? |
Absolutely. Thank you for your correction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments (some repeated across the various classes). Have you had a chance to look at the CLA? It looks like github isn't able to link the email address in your commits with your github account, so signing electronically will take a few additional steps. Once these things are taken care of, we'll be able to merge this in.
cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf4/H4header.java
Outdated
Show resolved
Hide resolved
cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5headerNew.java
Outdated
Show resolved
Hide resolved
cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5iospNew.java
Outdated
Show resolved
Hide resolved
cdm/core/src/main/java/ucar/nc2/internal/iosp/netcdf3/N3iospNew.java
Outdated
Show resolved
Hide resolved
Does the Charset defined here only apply to data values or does it also affect netCDF variable, dimension, attribute, or group names? NetCDF object names are restricted to ASCII and UTF-8 (with some additional restrictions on particular characters) as described in the NUG section on "Characters in NetCDF Names". This caught my eye because there is a conversation about characters allowed in netCDF object names going on in the CF Conventions repo, CF Issue #237. |
…etValueCharset method. MOD N3iospNew, H4iosp, H5iospNew: added @nullable annotation for the parameter of the new added setValueCharset method.
… renamed class member valueCharset to the charset since it is used by reading not only values, but for each string.
Initially we implemented a minimally invasive solution considering the charset only for values but not for names since we found no example files with names containing special characters like umlauts. So we couldn't decide whether the generator used UTF-8 or ISO-8859-1 to encode names. |
…ions (spotlessApply).
Thank you for the update @V-F! I'm wondering, could there be a possible case where the library in charge of writing enforced encoding of object names (say, |
Here is another relevant issue @ethanrd (Unidata/netcdf-c#402) |
@lesserwhirls It might help if one can be sure the object names are encoded in |
The specification for netCDF Classic format is pretty clear that netCDF dimension, variable, and attribute names should be UTF-8 encoded strings. (I say should rather than must because I'm not sure how strong the enforcement is in various implementations, though I believe the name strings go through some kind of Unicode Normalization before the dataset is written, and that process I would guess makes some UTF-8 assumptions.) HDF5 also assumes ASCII or UTF-8 though it doesn't sound like it is necessarily enforced. Here is a comment in another CF discussion describing how HDF handles strings. The overall discussion is about variable and attribute values rather than their names so not actually sure if HDF5 treats variable names and such the same as the data values. @ajelenak-thg Can you tell us if the string handling you describe in the CF comment linked above applies to variable names as well as values? |
Yes, it does. By default, HDF5 library assumes ASCII for attribute and link names (HDF5 links are what gives names to HDF5 objects in a file). The other option is UTF-8, and that is set with the |
Pull request
# Conflicts: # cdm/core/src/main/java/ucar/nc2/internal/iosp/hdf5/H5headerNew.java
To summarize the discussion: I need to change the implementation and use the "charset" as a "valueCharset" i.e. only for values, as required by the specification netCDF dimension, variable and attribute names should be UTF-8 encoded strings. Right? |
Hello @V-F! Yes, I believe that is where things stand. I'm thinking that would mean we would want to keep everything up to commit |
Hello @lesserwhirls! |
…ince it is applied only by reading attribute values. MOD H4iosp, H4header: renamed "charset" to the "valueCharset" since it is applied only by reading attribute values, and also text of the TagText, TagAnnotate and TagTextN. MOD H5iospNew, H5headerNew: renamed "charset" to the "valueCharset" since it is applied only by reading attribute values.
I've renamed member variable "charset" to the "valueCharset". |
Hello @lesserwhirls! P.S. I have errors with 2 local tests: "testThreading1" and "testThreadingN": |
Greetings @V-F! No, those errors are not related to your changes, and as I found when moving from Travis CI to GitHub Actions, are actually dependent on the order the tests are run (fixed in #214). I will give your PR a run on our Jenkins instance, which runs the full suite of tests (takes over an hour or so). |
Jenkins looks good (no new failures), so I think we've got it. Thank you @V-F and @JSchnabel for seeing this through! |
Thank you. |
Sorry, Im just looking at this feature now.
|
Also, how does the C library handle non standard charsets? |
Added supporting of definition custom charset for reading netcdf files since there are tools that create netcdf-files without considering of usage UTF-8 charset and just use the local charset to write text.
The definition of the custom charset is made by sending charset as a message to the IOSP object. If charset was not set the default UTF-8 charset will be used as before.
Added logic for simplifying creation of new iosp/header for custom netcdf-3 file format: Uniplot CDH format use netcdf3 file format with "CDH" magic, little endian byte order and ISO 88591-1 encoding.
These changes partially duplicate the changes from this PR.
Fixed bug by creation of Dimension in H5headerNew::addDimension(definition of unlimited flag before definition of length. Otherwise for unlimited dimension with length == 0 IllegalArgumentException will be thrown).