squashfs: fix file reading #197

ncw · 2023-12-15T16:58:36Z

This is a series of patches to fix up the reading in squashfs.

The first introduces some new tests and then we attempt to fix the tests!

Update all tests working now!

squashfs: add tests to check corner cases of reading files
squashfs: use the blocksize of squashfs not that supplied by the user for decompression
squashfs: a zero sized chunk is to be intepreted as a full block of zeros
squashfs: fix block size list parsing
squashfs: fix file reading
squashfs: close the gz compressor to make sure we read CRC errors
squashfs: add check for unexpected block size

… for decompression

…eros This is as specifed in the specification.

- make sure we don't overflow int when using large files - fix reading 4x too few blocks - correct inverted logic on whether a fragment exists or not - pass the calculated block sizes into the fetching routine

- Fix array bounds errors by taking more care reading the correct data out of the buffers - Fix potential 32/64 bit problems - Fix off by one termination condition in loop - Error if a fragment is expected but we don't have one (shouldn't happen) - Error if we ever read no data (shouldn't happen)

Before this change it was accidentally using the size of the basic file inode as the offset of the blocklist.

ncw · 2023-12-16T16:36:56Z

Update all the tests are fixed now - turned out to be a very simple fix.

This could be merged now so I'll unset the draft.

deitch · 2023-12-17T08:25:29Z

Do you mind describing in prose here what the issue was? I am reading it in the code, but a description here would be very helpful.

deitch · 2023-12-17T08:27:57Z

Also, is the logic in the comments right before the changes in squashfs/file.go still correct? Or does it need updating?

ncw · 2023-12-17T11:19:57Z

Do you mind describing in prose here what the issue was? I am reading it in the code, but a description here would be very helpful.

I noticed when testing the code that it had trouble reading most files. In fact the only files it could read reliably were very short ones stored entirely in the fragments or files which were an exact multiple of the block size. These happen to be the type of files in the file.sqs!

I decided to try to fix this. The first thing I did was make a new read_test.sqs with files of different sizes in which I knew would exercise the different parts of the code (I updated this as I carried on with the debugging process).

This let me track down the problems with the code.

I made one commit per fix to try and keep the code reviewable.

commit `e608eb1`

squashfs: fix parsing of extended file inodes

Before this change it was accidentally using the size of the basic
file inode as the offset of the blocklist.

This was just a copy paste bug I think. the 16 got copy pasted from the basic inode and needed to be changed to 40.

commit `13ce837`

squashfs: add check for unexpected block size

This commit isn't strictly speaking necessary - it doesn't fix anything, but it makes it return a sensible error when things aren't as expected which was the case for quite a lot of the time when debugging.

commit `d105557`

squashfs: close the gz compressor to make sure we read CRC errors

This is as recommended by the go docs. This doesn't fix anything, it was just something I noticed while reviewing the code.

commit `11f9a4c`

squashfs: fix file reading

- Fix array bounds errors by taking more care reading the correct data out of the buffers
- Fix potential 32/64 bit problems
- Fix off by one termination condition in loop
- Error if a fragment is expected but we don't have one (shouldn't happen)
- Error if we ever read no data (shouldn't happen)

This was the big fix. The main problem here was that this code gave array index errors if you tried to read something which wasn't an exact block size. It went through quite a few iterations trying to get all the test cases to pass but I think it is correct now 🤞

The main fix was to be very careful with exactly how much data was read from the input block. The fragment reading needed the same logic so I factored it into a closure.

This introduces the pos variable which keeps track of file position while trying to find the correct blocks to read data from. This is what enables the correct clipping of the blocks to be done and fixes the index errors.

I also added some defensive programming which was used when trying to make it work before I realised that the blocklist was actually wrong (see above) and it was being passed corrupted data.

I fixed up some 32/64 bit problems to make sure that the code would work on architectures where int is 32 bits.

commit `5bb1d38`

squashfs: fix block size list parsing

- make sure we don't overflow int when using large files
- fix reading 4x too few blocks
- correct inverted logic on whether a fragment exists or not
- pass the calculated block sizes into the fetching routine

This was the other very important fix

There were two main things wrong here. The number of blocks was being calculated wrong (the fragment exists logic was back to front) and this was using byte indexing when it should have been using uint32 indexing. This does the calculation of the number of blocks once and passes it down to parseFileBlockSizes rather than getting it to calculate it again (which it was doing wrong, but in a different way!).

I also fixed up some 32/64 bit problems to make sure that the code would work on architectures where int is 32 bits.

commit `96796b1`

squashfs: a zero sized chunk is to be intepreted as a full block of zeros

This is as specifed in the specification.

This is mentioned in the spec - a zero sized block should be treated as block_size worth of zeroes.

Sparse files are handled at block_size granularity. If an entire block is found to be full of zero bytes, the block isn't written to disk. Instead a size of zero is stored in the inode.

commit `4968323`

squashfs: use the blocksize of squashfs not that supplied by the user for decompression

I noticed that it was using the wrong blocksize when using an external file. It needs to use the blocksize from the squashfs header not the one the user passes in otherwise everything goes wrong!

Also, is the logic in the comments right before the changes in squashfs/file.go still correct? Or does it need updating?

Yes it still works exactly like that.

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

deitch · 2023-12-19T08:28:00Z

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

Yeah, that would be good... for a future one. I think this one covers enough.

Thank you for the detailed explanation.

Based on how you described it, it looks like each commit actually could stand alone? If each were a separate PR, then the mainline branch would be valid at each given commit? No, I don't want separate PRs at this point, but if they really do stand alone, no point in doing a squash-and-merge, as I usually do.

deitch

Took me quite a bit to understand it (but a lot less than understanding the innards of squashfs 😆 ), but I see it now.

ncw · 2023-12-19T10:56:25Z

For a future improvement it would be sensible to cache the last uncompressed block in the file handle as unless you are reading at exactly block_size it decompresses the same data twice (or indeed many times).

Yeah, that would be good... for a future one. I think this one covers enough.

Yes I'll do that as a followup. I think it will double the performance for file reading for relatively little cost (a small buffer per open file).

Thank you for the detailed explanation.

Based on how you described it, it looks like each commit actually could stand alone? If each were a separate PR, then the mainline branch would be valid at each given commit? No, I don't want separate PRs at this point, but if they really do stand alone, no point in doing a squash-and-merge, as I usually do.

Yes each commit does stand alone. I tried to make the commits individually reviewable as I know how hard reviewing code is. They could be individual PRs but if you were to merge without squashing that would have the same effect. (I developed this patch series in a completely different order and when I was done fixing stuff I separated them out into logically separate fixes!).

Took me quite a bit to understand it (but a lot less than understanding the innards of squashfs 😆 ), but I see it now.

:-) I can't say I understand the innards of squashfs completely, but I know more about it now than I did!

deitch · 2023-12-19T11:55:10Z

I can't say I understand the innards of squashfs completely, but I know more about it now than I did!

Well, if you feel like learning more about ext4 and inode trees, the help would be appreciated. ext4 implementation has been slowed down a lot for a long time. Largely blocked on some of the inode tree rebalancing. I am sure there is more there. One of these lifetimes I will get someone who wants it enough to pay for a few months of consulting and finish it. I have been tempted to ask GPT-4 as well.

ncw marked this pull request as draft December 15, 2023 17:02

ncw changed the title ~~fix read files~~ squashfs: fix file reading Dec 15, 2023

ncw added 8 commits December 16, 2023 16:34

squashfs: add tests to check corner cases of reading files

5a6e6ea

squashfs: use the blocksize of squashfs not that supplied by the user…

4968323

… for decompression

squashfs: a zero sized chunk is to be intepreted as a full block of z…

96796b1

…eros This is as specifed in the specification.

squashfs: fix block size list parsing

5bb1d38

- make sure we don't overflow int when using large files - fix reading 4x too few blocks - correct inverted logic on whether a fragment exists or not - pass the calculated block sizes into the fetching routine

squashfs: close the gz compressor to make sure we read CRC errors

d105557

squashfs: add check for unexpected block size

13ce837

squashfs: fix parsing of extended file inodes

e608eb1

Before this change it was accidentally using the size of the basic file inode as the offset of the blocklist.

ncw force-pushed the fix-read-files branch from e3d0bd4 to e608eb1 Compare December 16, 2023 16:35

ncw marked this pull request as ready for review December 16, 2023 16:37

deitch approved these changes Dec 19, 2023

View reviewed changes

deitch merged commit 21cc751 into diskfs:master Dec 19, 2023
19 checks passed

ncw deleted the fix-read-files branch December 19, 2023 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

squashfs: fix file reading #197

squashfs: fix file reading #197

ncw commented Dec 15, 2023 •

edited

Loading

ncw commented Dec 16, 2023

deitch commented Dec 17, 2023

deitch commented Dec 17, 2023

ncw commented Dec 17, 2023

deitch commented Dec 19, 2023

deitch left a comment

ncw commented Dec 19, 2023

deitch commented Dec 19, 2023

squashfs: fix file reading #197

squashfs: fix file reading #197

Conversation

ncw commented Dec 15, 2023 • edited Loading

ncw commented Dec 16, 2023

deitch commented Dec 17, 2023

deitch commented Dec 17, 2023

ncw commented Dec 17, 2023

commit e608eb1

commit 13ce837

commit d105557

commit 11f9a4c

commit 5bb1d38

commit 96796b1

commit 4968323

deitch commented Dec 19, 2023

deitch left a comment

Choose a reason for hiding this comment

ncw commented Dec 19, 2023

deitch commented Dec 19, 2023

ncw commented Dec 15, 2023 •

edited

Loading

commit `e608eb1`

commit `13ce837`

commit `d105557`

commit `11f9a4c`

commit `5bb1d38`

commit `96796b1`

commit `4968323`