-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug mode performance improvements #77
base: master
Are you sure you want to change the base?
Debug mode performance improvements #77
Conversation
…nt8->Int16 conversions
…tor-based for loops with while loops (no release mode effect)
…ubscript with unsafeBitCast
…es.reverse with unsafeBitCasts
…an be used a stable dependency
@@ -131,16 +131,56 @@ extension PNG | |||
return (x ^ mask) + (mask & 1) | |||
} | |||
|
|||
let v:(Int16, Int16, Int16) = (.init(a), .init(b), .init(c)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let’s tackle this one improvement at a time :)
i am really surprised that integer conversions are a bottleneck here. have you raised this issue on the Swift Forums yet? to me this borders on a compiler bug, and i’d love to hear what some compiler engineers have to add here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah I haven't raised an issue. I have a feeling that this is kinda the intended behaviour of debug mode (although in my opinion it should be doing more basic optimisations like this which are so trivial). In release mode it compiles to one instruction but in debug mode it doesn't specialise the generic Int16
initialiser (which takes a T: BinaryInteger
), and ends up spending all its time reading generic metadata and stuff, which of course adds up to a lot of overhead when you're doing a lot of these generic operations per pixel (roughly).
The changes in this PR are kinda all just working around things that the compiler could pretty easily compile better in debug mode. The reason that I did it this way is that waiting for compiler changes is quite a long game. I'd be happy to start a discussion about debug mode performance on the Swift forums at some point but I'd also like to be able to have reasonable debug mode performance earlier than the compiler will evolve.
That reminds me, another toolchain improvement that I thought might help was if we had a way of specifying separate optimisation levels for specific dependencies of a target when building. That way I could just get users to always compile SwiftCrossUI with optimisations and then they can compile their app with whatever optimisation level they want and still have reasonable image loading times.
As some additional context. I've got a test app written with SwiftCrossUI that has 12 500x500 images (among other things). Originally they were JPGs, but the swift I'm thinking that I might update my |
Context
While developing swift-image-formats as part of the SwiftCrossUI project, I found the need to improve the performance of Swift PNG under debug mode. During my original testing I had only been using 16x16 test images. I discovered the issue when loading a 1544x596 png file took 4 seconds in debug mode. The reason that I need debug mode performance specifically is that SwiftCrossUI is distributed as a source code library so it gets compiled at the same optimisation level as the rest of an app's code; and it'd be unreasonable to expect all users of SwiftCrossUI to build their apps in release mode at all times. Ideally SwiftPM would allow users to override the default build configuration for specific dependencies, but until then I need to focus on debug mode performance.
For the large test image I used (see below) I managed to get loading the image down from 28.1 seconds to 4.16 seconds on average in debug mode. None of my code changes have affected release mode performance. In many cases I only applied my changes for debug mode, because in release mode the original code compiled better (because it had more semantic information for the compiler I guess).
During optimisation I took notes which I've included below with the hope that they can be of use to a future contributor to
swift-png
or a similar library which needs to improve its debug mode performance.Related to #26
Test image
I decided to up the ante and use quite a big image as my test image to get more samples in the profiler. This image is 3440x1440 pixels and its file is 10.5mb.
Measurement technique
I'm testing performance from a test within
swift-image-formats
. I'm measuring two things; Swift PNG's decoding code, andswift-image-formats
' pixel unpacking code which converts Swift PNG's pixel format toswift-image-formats
' RGBA format.Approach
First I'll attempt to completely eliminate the unpacking step by implementing Swift PNG's pixel format protocol for
swift-image-formats
'RGBA
type. Then I'll let the profiler guide me to slow code paths.Initial measurements for my original smaller image
Initial measurements
With unpacking fast-path
I implemented
PNG.Color
forswift-image-formats
'RGBA
type and added a short circuit which can be taken when the PNG file was already using the.rgba8
pixel format, in which case nothing has to be done.From now on I'll stop including the unpacking measurements since that's now essentially a no-op.
With optimised
PNG.paeth(_:_:_:)
In the previous profile, the relatively simple
PNG.paeth(_:_:_:)
function took up around 3.1 seconds total in debug mode (almost 10% of the total decoding time). Most of this time was spent convertingUInt8
values toInt16
due to parsing generic metadata and doing stack checks for something which should really just be one operation even in debug mode. I made a manual conversion implementation that seems to compile pretty well even in debug mode.PNG.paeth(_:_:_:)
now only takes about 440 milliseconds total.Although the numbers make it seem as if this change made the implementation slower in release mode, I double checked and the previous version is also running slower today so I think it's just caused by variations in conditions. However,
Int16(_:)
always compiles down to a single instruction in release mode so we may as well use that in release mode just in case.Manually implemented for loop for
LZ77.InflatorIn.rebase(_:pointer:)
About 70% of the 1.38 seconds total spent in this function was spent calling
IndexingIterator.next
on the range getting iterated over in a for loop. By using a while loop with a manual counter variable instead of a for loop, this overhead got completely eliminated in debug mode. The function now only takes up 309 milliseconds of the decoding process.Manually implemented for loops for
PNG.Decoder.defilter(_:last:delay:)
Exact same idea as previous improvement.
Manually implemented for loops for
LZ77.MRC32.update(from:count:)
Manually implemented for loop for
LZ77.InflatorOut.expand(offset:count:)
Now we're 2x faster than when I started!
Manually implemented for loop for
PNG.Image.assign(scanline:at:stride:)
Manually implemented
UInt16, UInt16
toUInt32
conversion forLZ77.InflatorIn.subscript
implementationsTODO: Is this code portable to big endian? Do we need it to be portable to big endian?
Manually implemented integer conversions for
LZ77.InflatorTables.reverse(_:)
We're now 3x faster in debug mode than when I started!
Manually implemented integer conversions for
LZ77.InflatorTables.composite(_:)
Manually implemented integer conversions for
LZ77.MRC32.update(from:count:)
Manually implemented integer conversions for
LZ77.InflatorBuffers.readBlock(with:)
Manually implemented integer conversions for
LZ77.RunLiteral.length
We're 4x faster in debug mode than when I started now!
Manually implemented integer conversions for
LZ77.InflatorIn.rebase(_:pointer:)
Manually implemented integer conversions for
LZ77.Distance
computed propertiesManually implemented integer conversions for
LZ77.RunLiteral.decade
Manually implemented integer conversions for
CRC32.update(with:)
(inswift-hash
)Manually specialised
CRC32.update(with:)
for[UInt8]
withoutreduce
Manually implemented
FixedWidthInteger.init(truncatingIfNeeded:)
forPNG.paeth