-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move ndarray conversion to a Converter #1537
Move ndarray conversion to a Converter #1537
Conversation
0e65d76
to
5467ac0
Compare
e5946d8
to
a5170ba
Compare
Roman regression tests showing no new errors (19 are failing due to unrelated roman issues) |
Running the benchmarks on this PR vs main I see only one major difference. For 'large' files (these have 26 x 26 = 676 tree nodes) both |
The changes in 52f7bf8 and 77fb0e6 improve the performance of these operations by:
The JWST regression tests finished with 74 fewer errors than the previous run which suggests other changes are confounding the result:
|
Comparing the jwst CI tests to a run on a PR that includes only documentation changes shows the same failures as the run here: |
I've been so far unable to replicate the 2 failed jwst regression tests. |
@eslavich while doing some testing with reading and writing to (a fake) s3. I encountered 2 new issues that are related to this PR. All of the tests so far have been with a fake s3 so I need to verify that this is the same for real s3. The first issue is that reading from s3 (using s3fs) produces a file-like object that has no valid fileno. This means that The second issue is that writing to s3 (using s3fs) produces a file-like object that is not seekable but in this case returns valid results for 'tell'. This differs from #1542 as it appears that os.pipe produces a file that is not seekable and calling Would you like me to open a separate issue (and follow-up PR) for the second issue above or update this PR to fix the issue? |
I didn't read this message until now, and I expect to have the review done by the end of the weekend, so I think it will be fine to update this PR. Kind of you to ask :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to finish the review, but I also don't want to make you wait any longer for feedback, so here's the first batch of comments...
asdf/_block/external.py
Outdated
from asdf import generic_io, util | ||
|
||
|
||
class UseInternal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it matters, but there's precedent for this kind of constant in the Python standard libraries and they seem to implement it more like this:
class NotImplementedType:
pass
NotImplemented = NotImplementedType()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I updated the usage here:
228e022
asdf/_block/external.py
Outdated
from asdf import open as asdf_open | ||
|
||
with asdf_open(resolved_uri, lazy_load=False, copy_arrays=True) as af: | ||
self._cache[key] = af._blocks.blocks[0].cached_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I'll be! Standard says:
The ability to reference block data in an external ASDF file is intentionally limited to the first block in the external ASDF file...
(no problem here, just documenting my voyage of discovery)
asdf/_block/external.py
Outdated
|
||
class ExternalBlockCache: | ||
def __init__(self): | ||
self._cache = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be an LRU cache? Since we're forcing arrays to be copied into memory, a large exploded file might lead to problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure about this one.
The old code called AsdfFile.open_external
which caches every opened AsdfFile (in a dictionary). This means that a repeat call to open_external
(and then accessing the first block) always returns the same array as Block
caches the data). This behavior is relied upon for matching arrays that share an external block as tested in test_explode_then_implode
:
asdf/asdf/_tests/commands/tests/test_exploded.py
Lines 11 to 51 in b0c9a50
def test_explode_then_implode(tmpdir): | |
x = np.arange(0, 10, dtype=float) | |
tree = { | |
"science_data": x, | |
"subset": x[3:-3], | |
"skipping": x[::2], | |
"not_shared": np.arange(10, 0, -1, dtype=np.uint8), | |
} | |
path = os.path.join(str(tmpdir), "original.asdf") | |
ff = AsdfFile(tree) | |
# Since we're testing with small arrays, force all arrays to be stored | |
# in internal blocks rather than letting some of them be automatically put | |
# inline. | |
ff.write_to(path, all_array_storage="internal") | |
with asdf.open(path) as af: | |
assert len(af._blocks._internal_blocks) == 2 | |
result = main.main_from_args(["explode", path]) | |
assert result == 0 | |
files = get_file_sizes(str(tmpdir)) | |
assert "original.asdf" in files | |
assert "original_exploded.asdf" in files | |
assert "original_exploded0000.asdf" in files | |
assert "original_exploded0001.asdf" in files | |
assert "original_exploded0002.asdf" not in files | |
assert files["original.asdf"] > files["original_exploded.asdf"] | |
path = os.path.join(str(tmpdir), "original_exploded.asdf") | |
result = main.main_from_args(["implode", path]) | |
assert result == 0 | |
with asdf.open(str(tmpdir.join("original_exploded_all.asdf"))) as af: | |
assert_tree_match(af.tree, tree) | |
assert len(af._blocks) == 2 |
Changing
ExternalBlockCache
to LRU would mean that when the cache overflows, a call to load
would return an array with a different id which could interfere with settings assignment and block sharing.
I'm inclined to view the memory mapping of external blocks as a bug (and fix it in this PR by not memory mapping) but that does create an issue for large exploded files as you pointed out. As ExternalBlockCache
is used by the Manger
that is aware of if the blocks should be lazy loaded and/or memory mapped another option would be to see how complicated it would be to pass these options to ExternalBlockCache.load
and allow the user some control over how the external blocks are loaded. What do you think about this as an option for addressing the concern about large exploded files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a bit of time looking at this today and I think there may be a few issues with passing along open options.
lazy_load
is not an issue (and is currently supported in this PR as the ndarray converter sets up for lazy loading the external block then forces it's loading if lazy_load
was disabled).
validate_checksums
also should not be an issue but is not currently being passed on for this PR.
copy_arrays
is where things get complicated. The main complication is when the external file is closed. For a non-memmapped external block, the external file can be opened, the block data read and the file closed before the block data is returned. This is not true for memmapped blocks where the external file will need to be held open until the AsdfFile
instance is closed. This was one of the reasons why the previous BlockManager
held a reference to the AsdfFile
(so that it could used AsdfFile.open_external
which can keep the external files open and close them when the AsdfFile
is closed).
I could see a few options to make memory mapping external blocks possible but I'm not sure if the complexity outweighs the benefits.
- Duplicate the
AsdfFile.open_external
functionality in_block.Manager
. asdf prior to this PR calledclose
on the block manager and we will need to re-add that call to allow the block manager to close the external files. - Allow for generating pure
numpy.memmap
objects (which don't guarantee file closing) for external blocks.
I'm going to think about this one some more but any suggestions on how to handle this are appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried 2 above (allow generating a pure numpy.memmap) and it came together pretty cleanly. See:
a488690
Note that this does not support file open mode (so memory maps are always read only). This open mode is currently not passed down to the block manager so would require more work to get it to the numpy.memmap
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got me thinking about other structures that might need to be cleaned up on AsdfFile.close
. It seems like it is safe to clear (set to None
) all cached_data
for internal blocks. I made that change here:
d4498e4
f928524
to
b65023a
Compare
Thanks for the detailed review @eslavich :) |
and add test to make sure a failed write doesn't modify the tree or the version
tests that updated compressed block data does not cause a failed update
949d1d1
to
1534bd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
I updated and release sphinx-asdf and updated pyproject.toml (setting a lower pin and removing the dev requirement) so now the docs build with the pypi version of sphinx asdf. I've opened PR for:
|
This PR moves ndarray handling to a converter.
This required extensive changes to the internal block API which had several knock-on effects.
Public changes
AsdfConfig.convert_unknown_ndarray_subclasses
Moving ndarray to a converter means that subclasses are no longer automatically converted.
This change would break stdatamodels (where
FITS_rec
is automatically converted to ndarray) and breaks one test in asdf-astropy (because of NDArrayMixIn).It seems likely there are other instances 'in the wild' so an
AsdfConfig
option,convert_uknown_ndarray_subclasses
was added and some internal changes were added to allow the new ndarray converter to continue handling subclasses. If an instance of a subclass of ndarray is encountered during serialization which is not explicitly handled by a converter and theconvert_unknown_ndarray_subclasses
option is enabled (as is by default), this instance will be coerced into an ndarray and a warning will be issued (this warning means that the test in asdf-astropy will continue to fail as warnings are converted to errors).A note was added to the docs describing that the default setting for this will be converted to
False
and the option (and special handling) removed in a future version of asdf.asdf.Stream
moved toasdf.tags.core.Stream
andasdf.stream
deprecatedThis is more consistent with other special tag related objects (
asdf.tags.core.NDArrayType
,asdf.tags.core.IntegerType
etc). As an added benefit, the eventual removal ofasdf.stream
should allow the docs to build for development installs on systems that are case-insensitive (by removing the ambiguousasdf.Stream
andasdf.stream
).Changes to
Converter
andSerializationContext
methods for block accessThis changes some of the unreleased public API added in #1508
The updates to the block management to allow moving ndarray to a converter meant that some of the complexity added in #1508 could be removed. With the changes in this PR, the
SerializationContext
is now aware if a serialization or deserialization is being performed and what object is being handled (and can automatically generate and assign keys for most use cases). This means thatasdf.util.BlockKey
is no longer needed (extension code still needs to generate keys if an object reads from multiple blocks, by generating keys withSerializationContext.generate_block_key
asdf can automatically associate these with the correct object). Key-related changes toSerializationContext
are:find_block_index(key, data_callback=None)
tofind_available_block_index(data_callback, key=None)
get_block_data_callback(index)
toget_block_data_callback(index, key=None)
assign_block_key
(all assignment is now automatic)These changes also mean that
Converter.reserve_blocks
is no longer needed and was removed.array settings copied between
AsdfFile
instancesTo fix #1524 and remove an undocumented behavior, creating an
AsdfFile
with an existingAsdfFile
(AsdfFile(af)
) no longer attempts to copy block options from the originalAsdfFile
. This was previously done inconsistently and could result in files that could not be opened (see #1524).external blocks are no longer memmapped
In working towards a fix for #1525 memmapping for external blocks was disabled. With this PR external blocks are always lazy loaded and never memmapped. If these are sensible settings I propose we document the behavior and close the issue. If instead further changes are needed to allow control over lazy loading, memmapping or read/write mode than a separate PR will be needed.
AsdfFile.update
block organization changeInstead of attempting to limit the number of blocks moved/rewritten (which added significant complexity),
AsdfFile.update
now writes all new blocks to the end of the file then moves the blocks to the end of the new tree and updates associations between data callbacks and blocks to allow objects that lazy load blocks to read the correct block. This new strategy allows for updating with compressed blocks (where the size is unknown until the data is written) and allows for any change in tree size (even if it completely overwrites all previous blocks). This has one major downside where for a very large file it can temporarily almost double in size. However, the final file size after an update with this PR should be equivalent or even smaller (windows support for truncating was added in this PR to allow the final file to be a sensible size without excessive padding). This seems like a reasonable trade-off (exchanging some storage inefficiency for simplicity in code) to fix the numerous errors related to update found during work on this PR.Private API changes
Removal of
block.py
and replacement with_block
submodulesThe
Block
andBlockManger
classes were removed and replaced with a number of new classes and functions. Documentation was added to the new_block
submodules and should serve as the primary description of this new code (and should be updated based on questions and comments to the PR).The changes to the internal block API do mean that sphinx-asdf is no longer compatible with this PR. A PR was opened for sphinx-asdf to make it compatible with the changes in this PR and previous versions of ASDF: asdf-format/sphinx-asdf#64
To allow docs to build for this PR the requirements were temporarily changed to install sphinx-asdf from the source branch for the sphinx-asdf PR linked above. Once the private block API changes in this PR are decided the sphinx-asdf PR can be finalized, merged, released and the requirements updated here (either prior to or after merging but prior to release of asdf 3.0).
Test strucuture
Numerous issues were discovered during this rewrite:
seekable
#1545SerializationContext
missing from docs #1549Fixes or work towards fixes for many issues are included in this PR
SerializationContext
should have the same publicset_array_storage
methods asAsdfFile
than this issue could be solvedgeneric_io
To make it more obvious which tests correspond to issues, which are unit tests and which are now 'legacy' tests a new test organization was started in this PR.
'legacy' tests were mostly untouched (left in the same location and only updated to reflect changes to any internal API used and changed in this PR).
Tests that correspond to a specific issue are organized in
_tests/_issues
. Each issue gets it's own filetest_1520.py
with a functiontest_1520
with a docstring linking to the github issue and code that closely reflects any minimal test case in the issue.New unit tests (like those added for
_block
) are organized in a directory structure that matches the code layout (asdf._block.key
tests exists inasdf._tests._block.key
). The goal is to have 100% test coverage for a submodule using only the tests for that module. Measurement of this goal is not currently done in the CI (and has to be checked manually) and this PR does not achieve that goal for all changes (especiallyasdf._block.manager
which attempts to mimic the oldBlockManager
).