-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving views results in saving the base array #1669
Comments
@perrygreenfield what do you think about dropping saving the 'base' array. @eslavich mentioned that you might have some thoughts on this. One 'pro' is the above issue would be fixed. One 'con' is that files with views of other arrays in the same file would contain duplicate data. I have not yet started investigating how much work would be required to drop this feature (and what effect it would have on downstream packages). I am envisioning a result/solution that would:
I haven't constructed a case for this yet but I imagine we might see the above issue if, for instance, a user opens a roman file, takes a cut-out, and attempts to save this cut-out to an asdf file (they would currently end up with a file that is likely larger than the original file). |
FWIW, in the context of Roman we had accidentally saved the entire L1 file in the L2 files because we wanted to save the 4 reference pixels and were bringing those in as a view. If we keep this feature, it feels worthwhile to throw some kind of warning if the views are much smaller than the image. Brett was able to rapidly diagnose this, but I was not---I looked at all the arrays and the dtypes and saw they were correct but I could not find the "missing memory," which required looking at the block manager et al.. |
If I'm not mistaken, you should be able to tell from the yaml definition for the small arrays that they are views based on the metadata necessary to present a view into a larger array. And now I see a long ago question I need to answer. |
It's true that I remember seeing words like 'strides' in there, but not having ever stared at the yaml definitions before, for me it was not enough to trigger alarms. |
Regarding the old question. If there is only one view of an array (and the base array is not referred to the tree) it should be possible to determine that and thus only save the data in the view. It becomes more complex if there are multiple views (again without a reference to the base array in the tree), but still in principle possible to determine the subarray needed to contain all views. But this suggests some pre-write analysis to do such optimizations (it has to be held until then since the base array may be referenced later) . Also, if views don't share common areas, they could also be made separate arrays. But I would be against losing the shared aspect of views since that has meaning and duplication would break. |
Thanks for the discussion. I don't think the approach of looking for views that overlap would solve the roman issue. There are 4 row/cols of reference pixels on all edges (top, bottom, left, right) of the detector array and views are generated by slicing out these rows/cols. I think minimally we should provide a way to disable this behavior because as seen above there are some cases where we know we don't want to save the base array. I think this could be accomplished by extending the current block options (although those are stored using the base array). I started a draft PR taking this approach: #1753 |
And two cents, the default should be to warn if you're putting a view in a file, or at least provide an easier mechanism to see the size of the blocks (e.g., in asdf.info? or something?). We were definitely making files 3x too big for 6 months without noticing! |
As far as the warning goes, sure until the behavior changes. But when people make slices that they don't care that they share some identical data, I think the onus is on the developer to copy the views before saving. |
I was curious how the current asdf behavior (saving the base array) is impacting roman file sizes (beyond the issue mentioned above). I modified the
I then ran the regression tests (https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FRoman-Developers-Pull-Requests/detail/Roman-Developers-Pull-Requests/601/tests) where each test failure is due to a difference in file size between the 2 conditions. The table below shows the file sizes for the 22 test files. Only 2 had a smaller "shared" size (for these difference was <1%). Many files are significantly smaller for the "not-shared" case (not saving the base array). Just looking at the mean of the ratios, disabling saving the base array will reduce the roman file sizes by ~22%.
|
xref: #1005 |
When asdf stores an array, it stores the base array. This can result in unexpectedly large files like in the following example:
Tested with current main and 2.8.0 (a randomly selected old asdf). For the above example, the block contains 16000000 bytes even though only 400 are needed for the view.
This is related to asdf sharing array data between blocks. One fix for the above issue would be to disable this feature.
The text was updated successfully, but these errors were encountered: