Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple ASDF block indices in file? #989

Closed
lgarrison opened this issue May 4, 2021 · 7 comments
Closed

Multiple ASDF block indices in file? #989

lgarrison opened this issue May 4, 2021 · 7 comments

Comments

@lgarrison
Copy link
Contributor

This is a bit of a long shot, but I am working with a collaborator whose pipeline appears to occasionally produce corrupted ASDF files, and I'm trying to figure out how this is occurring. The symptom is that the files have multiple #ASDF BLOCK INDEX sections, and so ASDF fails to load the files (I'm pretty sure that having multiple block indices in a single file should never happen, but please correct me if I'm wrong!). Has anyone seen this failure mode before? I think it's likely that the ASDF library itself is not at fault, but instead it's some kind of bad interaction between software components and/or the file system. The failure does not seem deterministic, but we are still debugging.

Here's a link to an example file; if you grep for ASDF BLOCK, you'll find there are two occurrences: https://drive.google.com/file/d/1L8X9V_1oiODgP1V-ZhKRJ-UcDmv6ZHR-/view?usp=sharing

@eslavich
Copy link
Contributor

eslavich commented May 5, 2021

Strange... I haven't seen this before. The extra #ASDF BLOCK INDEX occurs in the middle of the first block's decompressed data:

In [22]: b"ASDF BLOCK INDEX" in af.blocks.get_block(0).data.tobytes()
Out[22]: True

but it also occurs in the compressed data in the file? So blosc must not be modifying those bytes?

The checksum of the decompressed data matches:

In [24]: af.blocks.get_block(0).validate_checksum()
Out[24]: True

which implies that the block index was in the data before the file was written?

@lgarrison
Copy link
Contributor Author

Okay, I figured it out. You put me on the right track by observing the extra BLOCK INDEX must have been in the data before the file was written, due to the checksum. The bug occurs when an uninitialized Numpy array contains a recycled region of memory that used to hold a block index. In this case, we were slicing the Numpy array up to the last initialized value before adding to the tree, but because ASDF writes the whole underlying array, not just the view, the uninitialized region ended up in the file too. Like so:

a = np.empty(10**5)
a[:10] = 1.  # initialize up to 10
af = asdf.AsdfFile(dict(arr=a[:10]))  # only first 10 elements are in the tree, but all 10^5 elements written
af.write_to('a.asdf')

Executing a pattern like that multiple times in a single process runs the risk of recycling a region of memory that used to hold a block index. And this can produce un-openable ASDF files, if the second index is within one file system page of the first. Here's an example: https://drive.google.com/file/d/1DJr2qkdC2Qm3tos1cigmbUK8ngFzha5K/view?usp=sharing

While this is a pretty wacky bug, I also think the mechanism is pretty generic. I think the ASDF block index detection could be made more robust, perhaps by ensuring the block index it is reading is the last one, which could be as simple as changing this line to an rfind: https://github.com/asdf-format/asdf/blob/master/asdf/block.py#L458

It is also worth considering that technically a block index is not required (if I understand correctly), but that a file without a block index might have an unrelated one due to uninitialized data. That seems harder to recover from, although maybe requiring it to be at the end of the file makes it much less likely to occur. Or ensuring that the index is never read from inside a block (as determined by that block's self-reported size).

There's also the question of whether ASDF could simply only write out the views referenced in the tree, and not the whole underlying array. I assume the current behavior is so that multiple views into the same array don't duplicate data, but I think this behavior is a bit surprising for non-overlapping views. As a simple fix to catch most cases, maybe the parent arrays could be checked for multiple views, and if they only have one view, then just that view's data could be written to disk?

On why blosc didn't touch the data, I think it must be because the data size fell below the block size we were using, so it left it uncompressed.

@eslavich
Copy link
Contributor

eslavich commented May 6, 2021

The bug occurs when an uninitialized Numpy array contains a recycled region of memory that used to hold a block index.

Wow, nice find!! What a horrible yet satisfying conclusion.

While this is a pretty wacky bug, I also think the mechanism is pretty generic. I think the ASDF block index detection could be made more robust, perhaps by ensuring the block index it is reading is the last one, which could be as simple as changing this line to an rfind: https://github.com/asdf-format/asdf/blob/master/asdf/block.py#L458

An rfind would certainly help, but the situation is still uncomfortable. We had some discussion last year on changes that might be made to the block index:

asdf-format/asdf-standard#266

I would very much like:

  • a binary block index instead of yaml, so parsing it isn't so slow and the size of the index is predictable
  • a block index header with character sequences illegal in yaml, so we can safely skip ahead to it without parsing the tree
  • to make the block index required so software can rely upon it
  • to position it ahead of the blocks so that there's no question that the header sequence is the correct one

What we would have to give up is the ability to write asdf files to non-seekable destinations, so e.g. if a user were posting to a website the file would need to be written to memory or on disk first. That's just because the index can't be assembled until we know the compressed size of each block. We do however know the number of blocks ahead of time, and can leave sufficient space for a binary index and seek backwards to fill it in after the blocks have been written.

There's also the question of whether ASDF could simply only write out the views referenced in the tree, and not the whole underlying array. I assume the current behavior is so that multiple views into the same array don't duplicate data, but I think this behavior is a bit surprising for non-overlapping views.

There's been some email discussion over here of this surprise, and @perrygreenfield made the same suggestion to only write out the full base array if multiple views are present. I'd prefer that we disable this feature by default and figure out a way to allow expert users to opt in. That seems like a simpler story to explain and implement.

@lgarrison
Copy link
Contributor Author

lgarrison commented May 6, 2021

An rfind would certainly help, but the situation is still uncomfortable.

I'll go ahead and open a PR with this (one-character) change, just to make this a little more robust while thinking about long-term fixes.

What we would have to give up is the ability to write asdf files to non-seekable destinations, so e.g. if a user were posting to a website the file would need to be written to memory or on disk first. That's just because the index can't be assembled until we know the compressed size of each block.

I suppose the blocks could all be compressed in memory to determine their sizes. It's a small distinction from "writing the files to memory first", although the compressed data could be discarded on-the-fly, if only the size is needed (basically what the get_compressed_size() function does now). That would allow writing to non-seekable destinations if the blocks are already in memory.

If they are not, then at least one streaming block could still be allowed; its size could be inferred from the file size (and its size could be recorded at the tail for robustness).

I'd prefer that we disable this feature by default and figure out a way to allow expert users to opt in.

When you say "this feature", do you mean the feature of writing the base arrays, or the feature of just writing the views?

lgarrison added a commit to lgarrison/asdf that referenced this issue May 6, 2021
@eslavich
Copy link
Contributor

eslavich commented May 6, 2021

When you say "this feature", do you mean the feature of writing the base arrays, or the feature of just writing the views?

Oops, I mean, disable the feature of writing the full base array. And then instead of implementing a set of rules that asdf follows to choose whether to write the base array or just the viewed portion, we ask users to explicitly select the views that they'd like serialized that way.

@lgarrison
Copy link
Contributor Author

I'll go ahead and close this issue; I think the only outstanding thread is the opt-in mechanism for writing the base arrays of views. Is this already being tracked somewhere, or should I open an issue?

@eslavich
Copy link
Contributor

eslavich commented Jun 9, 2021

I don't think there's an issue for that yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants