Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug mode performance improvements #77

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

stackotter
Copy link
Contributor

Context

While developing swift-image-formats as part of the SwiftCrossUI project, I found the need to improve the performance of Swift PNG under debug mode. During my original testing I had only been using 16x16 test images. I discovered the issue when loading a 1544x596 png file took 4 seconds in debug mode. The reason that I need debug mode performance specifically is that SwiftCrossUI is distributed as a source code library so it gets compiled at the same optimisation level as the rest of an app's code; and it'd be unreasonable to expect all users of SwiftCrossUI to build their apps in release mode at all times. Ideally SwiftPM would allow users to override the default build configuration for specific dependencies, but until then I need to focus on debug mode performance.

For the large test image I used (see below) I managed to get loading the image down from 28.1 seconds to 4.16 seconds on average in debug mode. None of my code changes have affected release mode performance. In many cases I only applied my changes for debug mode, because in release mode the original code compiled better (because it had more semantic information for the compiler I guess).

During optimisation I took notes which I've included below with the hope that they can be of use to a future contributor to swift-png or a similar library which needs to improve its debug mode performance.

Related to #26


Test image

I decided to up the ante and use quite a big image as my test image to get more samples in the profiler. This image is 3440x1440 pixels and its file is 10.5mb.

wide_screenshot

Measurement technique

I'm testing performance from a test within swift-image-formats. I'm measuring two things; Swift PNG's decoding code, and swift-image-formats' pixel unpacking code which converts Swift PNG's pixel format to swift-image-formats' RGBA format.

Approach

First I'll attempt to completely eliminate the unpacking step by implementing Swift PNG's pixel format protocol for swift-image-formats' RGBA type. Then I'll let the profiler guide me to slow code paths.

Initial measurements for my original smaller image

  • Debug: 2.88 seconds to load, 0.676 seconds to unpack
  • Release: 0.00934 seconds to load, 0.00520 seconds to unpack

Initial measurements

  • Debug: 28.1 seconds to load, 3.80 seconds to unpack
  • Release: 0.168 seconds to load, 0.0280 seconds to unpack

With unpacking fast-path

I implemented PNG.Color for swift-image-formats' RGBA type and added a short circuit which can be taken when the PNG file was already using the .rgba8 pixel format, in which case nothing has to be done.

  • Debug: 28.0 seconds to load, 0.00000095 seconds to unpack
  • Release: 0.165 seconds to load, 0.00000191 seconds to unpack
    From now on I'll stop including the unpacking measurements since that's now essentially a no-op.

With optimised PNG.paeth(_:_:_:)

In the previous profile, the relatively simple PNG.paeth(_:_:_:) function took up around 3.1 seconds total in debug mode (almost 10% of the total decoding time). Most of this time was spent converting UInt8 values to Int16 due to parsing generic metadata and doing stack checks for something which should really just be one operation even in debug mode. I made a manual conversion implementation that seems to compile pretty well even in debug mode. PNG.paeth(_:_:_:) now only takes about 440 milliseconds total.

  • Debug: 25.9 seconds
  • Release: 0.175 seconds
    Although the numbers make it seem as if this change made the implementation slower in release mode, I double checked and the previous version is also running slower today so I think it's just caused by variations in conditions. However, Int16(_:) always compiles down to a single instruction in release mode so we may as well use that in release mode just in case.

Manually implemented for loop for LZ77.InflatorIn.rebase(_:pointer:)

About 70% of the 1.38 seconds total spent in this function was spent calling IndexingIterator.next on the range getting iterated over in a for loop. By using a while loop with a manual counter variable instead of a for loop, this overhead got completely eliminated in debug mode. The function now only takes up 309 milliseconds of the decoding process.

  • Debug: 24.3 seconds
  • Release: 0.170 seconds (probably just variation in testing conditions)

Manually implemented for loops for PNG.Decoder.defilter(_:last:delay:)

Exact same idea as previous improvement.

  • Debug: 20.4 seconds
  • Release: 0.171 seconds

Manually implemented for loops for LZ77.MRC32.update(from:count:)

  • Debug: 16.3 seconds
  • Release: 0.174 seconds

Manually implemented for loop for LZ77.InflatorOut.expand(offset:count:)

Now we're 2x faster than when I started!

  • Debug: 12.13 seconds
  • Release: 0.171 seconds

Manually implemented for loop for PNG.Image.assign(scanline:at:stride:)

  • Debug: 11.44 seconds
  • Release: 0.171 seconds

Manually implemented UInt16, UInt16 to UInt32 conversion for LZ77.InflatorIn.subscript implementations

TODO: Is this code portable to big endian? Do we need it to be portable to big endian?

  • Debug: 10.45 seconds
  • Release: 0.172 seconds

Manually implemented integer conversions for LZ77.InflatorTables.reverse(_:)

We're now 3x faster in debug mode than when I started!

  • Debug: 9.15 seconds
  • Release: 0.173 seconds

Manually implemented integer conversions for LZ77.InflatorTables.composite(_:)

  • Debug: 8.37 seconds
  • Release: 0.170 seconds

Manually implemented integer conversions for LZ77.MRC32.update(from:count:)

  • Debug: 7.85 seconds
  • Release: 0.173 seconds

Manually implemented integer conversions for LZ77.InflatorBuffers.readBlock(with:)

  • Debug: 7.36 seconds
  • Release: 0.172 seconds

Manually implemented integer conversions for LZ77.RunLiteral.length

We're 4x faster in debug mode than when I started now!

  • Debug: 6.716 seconds
  • Release: 0.171 seconds

Manually implemented integer conversions for LZ77.InflatorIn.rebase(_:pointer:)

  • Debug: 6.32 seconds
  • Release: 0.171 seconds

Manually implemented integer conversions for LZ77.Distance computed properties

  • Debug: 5.72 seconds
  • Release: 0.171 seconds

Manually implemented integer conversions for LZ77.RunLiteral.decade

  • Debug: 5.45 seconds
  • Release: 0.171 seconds

Manually implemented integer conversions for CRC32.update(with:) (in swift-hash)

  • Debug: 4.96 seconds
  • Release: 0.171 seconds

Manually specialised CRC32.update(with:) for [UInt8] without reduce

  • Debug: 4.45 seconds
  • Release: 0.171 seconds

Manually implemented FixedWidthInteger.init(truncatingIfNeeded:) for PNG.paeth

  • Debug: 4.16 seconds
  • Release: 0.171 seconds

@@ -131,16 +131,56 @@ extension PNG
return (x ^ mask) + (mask & 1)
}

let v:(Int16, Int16, Int16) = (.init(a), .init(b), .init(c))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let’s tackle this one improvement at a time :)

i am really surprised that integer conversions are a bottleneck here. have you raised this issue on the Swift Forums yet? to me this borders on a compiler bug, and i’d love to hear what some compiler engineers have to add here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah I haven't raised an issue. I have a feeling that this is kinda the intended behaviour of debug mode (although in my opinion it should be doing more basic optimisations like this which are so trivial). In release mode it compiles to one instruction but in debug mode it doesn't specialise the generic Int16 initialiser (which takes a T: BinaryInteger), and ends up spending all its time reading generic metadata and stuff, which of course adds up to a lot of overhead when you're doing a lot of these generic operations per pixel (roughly).

The changes in this PR are kinda all just working around things that the compiler could pretty easily compile better in debug mode. The reason that I did it this way is that waiting for compiler changes is quite a long game. I'd be happy to start a discussion about debug mode performance on the Swift forums at some point but I'd also like to be able to have reasonable debug mode performance earlier than the compiler will evolve.

That reminds me, another toolchain improvement that I thought might help was if we had a way of specifying separate optimisation levels for specific dependencies of a target when building. That way I could just get users to always compile SwiftCrossUI with optimisations and then they can compile their app with whatever optimisation level they want and still have reasonable image loading times.

@stackotter
Copy link
Contributor Author

As some additional context. I've got a test app written with SwiftCrossUI that has 12 500x500 images (among other things). Originally they were JPGs, but the swift jpeg library takes over 20 seconds to load all of those images in debug mode (and still about 3 seconds in release mode). I converted them to PNGs, and swift-png took around 4 seconds to load them all in debug mode. And then I converted them to WEBPs, and they loaded basically instantly in both debug and release mode (I'm using libwebp written in C).

I'm thinking that I might update my swift-image-formats library to use C libraries for jpg and png by default with an option to use swift-png and jpeg for ultimate portability by setting an environment variable (or a feature flag if/when SwiftPM gets those). And then there won't be as much pressure on improving the debug mode performance of swift-png and jpeg in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants