Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squashfs: fix file reading #197

Merged
merged 8 commits into from
Dec 19, 2023
Merged

squashfs: fix file reading #197

merged 8 commits into from
Dec 19, 2023

Conversation

ncw
Copy link
Contributor

@ncw ncw commented Dec 15, 2023

This is a series of patches to fix up the reading in squashfs.

The first introduces some new tests and then we attempt to fix the tests!

Update all tests working now!

  • squashfs: add tests to check corner cases of reading files
  • squashfs: use the blocksize of squashfs not that supplied by the user for decompression
  • squashfs: a zero sized chunk is to be intepreted as a full block of zeros
  • squashfs: fix block size list parsing
  • squashfs: fix file reading
  • squashfs: close the gz compressor to make sure we read CRC errors
  • squashfs: add check for unexpected block size

@ncw ncw marked this pull request as draft December 15, 2023 17:02
@ncw ncw changed the title fix read files squashfs: fix file reading Dec 15, 2023
ncw added 8 commits December 16, 2023 16:34
…eros

This is as specifed in the specification.
- make sure we don't overflow int when using large files
- fix reading 4x too few blocks
- correct inverted logic on whether a fragment exists or not
- pass the calculated block sizes into the fetching routine
- Fix array bounds errors by taking more care reading the correct data out of the buffers
- Fix potential 32/64 bit problems
- Fix off by one termination condition in loop
- Error if a fragment is expected but we don't have one (shouldn't happen)
- Error if we ever read no data (shouldn't happen)
Before this change it was accidentally using the size of the basic
file inode as the offset of the blocklist.
@ncw
Copy link
Contributor Author

ncw commented Dec 16, 2023

Update all the tests are fixed now - turned out to be a very simple fix.

This could be merged now so I'll unset the draft.

@ncw ncw marked this pull request as ready for review December 16, 2023 16:37
@deitch
Copy link
Collaborator

deitch commented Dec 17, 2023

Do you mind describing in prose here what the issue was? I am reading it in the code, but a description here would be very helpful.

@deitch
Copy link
Collaborator

deitch commented Dec 17, 2023

Also, is the logic in the comments right before the changes in squashfs/file.go still correct? Or does it need updating?

@ncw
Copy link
Contributor Author

ncw commented Dec 17, 2023

Do you mind describing in prose here what the issue was? I am reading it in the code, but a description here would be very helpful.

I noticed when testing the code that it had trouble reading most files. In fact the only files it could read reliably were very short ones stored entirely in the fragments or files which were an exact multiple of the block size. These happen to be the type of files in the file.sqs!

I decided to try to fix this. The first thing I did was make a new read_test.sqs with files of different sizes in which I knew would exercise the different parts of the code (I updated this as I carried on with the debugging process).

This let me track down the problems with the code.

I made one commit per fix to try and keep the code reviewable.

commit e608eb1

squashfs: fix parsing of extended file inodes

Before this change it was accidentally using the size of the basic
file inode as the offset of the blocklist.

This was just a copy paste bug I think. the 16 got copy pasted from the basic inode and needed to be changed to 40.

commit 13ce837

squashfs: add check for unexpected block size

This commit isn't strictly speaking necessary - it doesn't fix anything, but it makes it return a sensible error when things aren't as expected which was the case for quite a lot of the time when debugging.

commit d105557

squashfs: close the gz compressor to make sure we read CRC errors

This is as recommended by the go docs. This doesn't fix anything, it was just something I noticed while reviewing the code.

commit 11f9a4c

squashfs: fix file reading

- Fix array bounds errors by taking more care reading the correct data out of the buffers
- Fix potential 32/64 bit problems
- Fix off by one termination condition in loop
- Error if a fragment is expected but we don't have one (shouldn't happen)
- Error if we ever read no data (shouldn't happen)

This was the big fix. The main problem here was that this code gave array index errors if you tried to read something which wasn't an exact block size. It went through quite a few iterations trying to get all the test cases to pass but I think it is correct now 🤞

The main fix was to be very careful with exactly how much data was read from the input block. The fragment reading needed the same logic so I factored it into a closure.

This introduces the pos variable which keeps track of file position while trying to find the correct blocks to read data from. This is what enables the correct clipping of the blocks to be done and fixes the index errors.

I also added some defensive programming which was used when trying to make it work before I realised that the blocklist was actually wrong (see above) and it was being passed corrupted data.

I fixed up some 32/64 bit problems to make sure that the code would work on architectures where int is 32 bits.

commit 5bb1d38

squashfs: fix block size list parsing

- make sure we don't overflow int when using large files
- fix reading 4x too few blocks
- correct inverted logic on whether a fragment exists or not
- pass the calculated block sizes into the fetching routine

This was the other very important fix

There were two main things wrong here. The number of blocks was being calculated wrong (the fragment exists logic was back to front) and this was using byte indexing when it should have been using uint32 indexing. This does the calculation of the number of blocks once and passes it down to parseFileBlockSizes rather than getting it to calculate it again (which it was doing wrong, but in a different way!).

I also fixed up some 32/64 bit problems to make sure that the code would work on architectures where int is 32 bits.

commit 96796b1

squashfs: a zero sized chunk is to be intepreted as a full block of zeros

This is as specifed in the specification.

This is mentioned in the spec - a zero sized block should be treated as block_size worth of zeroes.

Sparse files are handled at block_size granularity. If an entire block is found to be full of zero bytes, the block isn't written to disk. Instead a size of zero is stored in the inode.

commit 4968323

squashfs: use the blocksize of squashfs not that supplied by the user for decompression

I noticed that it was using the wrong blocksize when using an external file. It needs to use the blocksize from the squashfs header not the one the user passes in otherwise everything goes wrong!

Also, is the logic in the comments right before the changes in squashfs/file.go still correct? Or does it need updating?

Yes it still works exactly like that.

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

@deitch
Copy link
Collaborator

deitch commented Dec 19, 2023

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

Yeah, that would be good... for a future one. I think this one covers enough.

Thank you for the detailed explanation.

Based on how you described it, it looks like each commit actually could stand alone? If each were a separate PR, then the mainline branch would be valid at each given commit? No, I don't want separate PRs at this point, but if they really do stand alone, no point in doing a squash-and-merge, as I usually do.

Copy link
Collaborator

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me quite a bit to understand it (but a lot less than understanding the innards of squashfs 😆 ), but I see it now.

@ncw
Copy link
Contributor Author

ncw commented Dec 19, 2023

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

Yeah, that would be good... for a future one. I think this one covers enough.

Yes I'll do that as a followup. I think it will double the performance for file reading for relatively little cost (a small buffer per open file).

Thank you for the detailed explanation.

Based on how you described it, it looks like each commit actually could stand alone? If each were a separate PR, then the mainline branch would be valid at each given commit? No, I don't want separate PRs at this point, but if they really do stand alone, no point in doing a squash-and-merge, as I usually do.

Yes each commit does stand alone. I tried to make the commits individually reviewable as I know how hard reviewing code is. They could be individual PRs but if you were to merge without squashing that would have the same effect. (I developed this patch series in a completely different order and when I was done fixing stuff I separated them out into logically separate fixes!).

Took me quite a bit to understand it (but a lot less than understanding the innards of squashfs 😆 ), but I see it now.

:-) I can't say I understand the innards of squashfs completely, but I know more about it now than I did!

@deitch
Copy link
Collaborator

deitch commented Dec 19, 2023

I can't say I understand the innards of squashfs completely, but I know more about it now than I did!

Well, if you feel like learning more about ext4 and inode trees, the help would be appreciated. ext4 implementation has been slowed down a lot for a long time. Largely blocked on some of the inode tree rebalancing. I am sure there is more there. One of these lifetimes I will get someone who wants it enough to pay for a few months of consulting and finish it. I have been tempted to ask GPT-4 as well.

@deitch deitch merged commit 21cc751 into diskfs:master Dec 19, 2023
19 checks passed
@ncw ncw deleted the fix-read-files branch December 19, 2023 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants