-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple ASDF block indices in file? #989
Comments
Strange... I haven't seen this before. The extra In [22]: b"ASDF BLOCK INDEX" in af.blocks.get_block(0).data.tobytes()
Out[22]: True but it also occurs in the compressed data in the file? So blosc must not be modifying those bytes? The checksum of the decompressed data matches: In [24]: af.blocks.get_block(0).validate_checksum()
Out[24]: True which implies that the block index was in the data before the file was written? |
Okay, I figured it out. You put me on the right track by observing the extra a = np.empty(10**5)
a[:10] = 1. # initialize up to 10
af = asdf.AsdfFile(dict(arr=a[:10])) # only first 10 elements are in the tree, but all 10^5 elements written
af.write_to('a.asdf') Executing a pattern like that multiple times in a single process runs the risk of recycling a region of memory that used to hold a block index. And this can produce un-openable ASDF files, if the second index is within one file system page of the first. Here's an example: https://drive.google.com/file/d/1DJr2qkdC2Qm3tos1cigmbUK8ngFzha5K/view?usp=sharing While this is a pretty wacky bug, I also think the mechanism is pretty generic. I think the ASDF block index detection could be made more robust, perhaps by ensuring the block index it is reading is the last one, which could be as simple as changing this line to an It is also worth considering that technically a block index is not required (if I understand correctly), but that a file without a block index might have an unrelated one due to uninitialized data. That seems harder to recover from, although maybe requiring it to be at the end of the file makes it much less likely to occur. Or ensuring that the index is never read from inside a block (as determined by that block's self-reported size). There's also the question of whether ASDF could simply only write out the views referenced in the tree, and not the whole underlying array. I assume the current behavior is so that multiple views into the same array don't duplicate data, but I think this behavior is a bit surprising for non-overlapping views. As a simple fix to catch most cases, maybe the parent arrays could be checked for multiple views, and if they only have one view, then just that view's data could be written to disk? On why blosc didn't touch the data, I think it must be because the data size fell below the block size we were using, so it left it uncompressed. |
Wow, nice find!! What a horrible yet satisfying conclusion.
An rfind would certainly help, but the situation is still uncomfortable. We had some discussion last year on changes that might be made to the block index: I would very much like:
What we would have to give up is the ability to write asdf files to non-seekable destinations, so e.g. if a user were posting to a website the file would need to be written to memory or on disk first. That's just because the index can't be assembled until we know the compressed size of each block. We do however know the number of blocks ahead of time, and can leave sufficient space for a binary index and seek backwards to fill it in after the blocks have been written.
There's been some email discussion over here of this surprise, and @perrygreenfield made the same suggestion to only write out the full base array if multiple views are present. I'd prefer that we disable this feature by default and figure out a way to allow expert users to opt in. That seems like a simpler story to explain and implement. |
I'll go ahead and open a PR with this (one-character) change, just to make this a little more robust while thinking about long-term fixes.
I suppose the blocks could all be compressed in memory to determine their sizes. It's a small distinction from "writing the files to memory first", although the compressed data could be discarded on-the-fly, if only the size is needed (basically what the If they are not, then at least one streaming block could still be allowed; its size could be inferred from the file size (and its size could be recorded at the tail for robustness).
When you say "this feature", do you mean the feature of writing the base arrays, or the feature of just writing the views? |
…o contains something that looks like a block index [asdf-format#989]
Oops, I mean, disable the feature of writing the full base array. And then instead of implementing a set of rules that asdf follows to choose whether to write the base array or just the viewed portion, we ask users to explicitly select the views that they'd like serialized that way. |
I'll go ahead and close this issue; I think the only outstanding thread is the opt-in mechanism for writing the base arrays of views. Is this already being tracked somewhere, or should I open an issue? |
I don't think there's an issue for that yet... |
This is a bit of a long shot, but I am working with a collaborator whose pipeline appears to occasionally produce corrupted ASDF files, and I'm trying to figure out how this is occurring. The symptom is that the files have multiple
#ASDF BLOCK INDEX
sections, and so ASDF fails to load the files (I'm pretty sure that having multiple block indices in a single file should never happen, but please correct me if I'm wrong!). Has anyone seen this failure mode before? I think it's likely that the ASDF library itself is not at fault, but instead it's some kind of bad interaction between software components and/or the file system. The failure does not seem deterministic, but we are still debugging.Here's a link to an example file; if you grep for
ASDF BLOCK
, you'll find there are two occurrences: https://drive.google.com/file/d/1L8X9V_1oiODgP1V-ZhKRJ-UcDmv6ZHR-/view?usp=sharingThe text was updated successfully, but these errors were encountered: