Optimize sanitizing www-form-urlencoded input #58

avit · 2020-11-14T07:40:35Z

Using gsub with block syntax is slow. This implementation uses lookup tables to
cache the mapping of encoded/decoded characters since these are known subsets /
ranges, and handle replacement in a way that doesn't have any overhead for
yielding.

Benchmarks for random payloads containing 20% urlencoded characters show a
considerable improvement:

Before:

0.000900	 0.000565	 0.003071	 0.033079	 0.305175	 3.015223

After:

0.000836	 0.000488	 0.001445	 0.011248	 0.110888	 1.107596

Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.

Using gsub with block syntax is slow. This implementation uses lookup tables to cache the mapping of encoded/decoded characters since these are known subsets / ranges, and handle replacement in a way that doesn't have any overhead for yielding. Benchmarks for random payloads containing 20% urlencoded characters show a considerable improvement: Before: 0.000900 0.000565 0.003071 0.033079 0.305175 3.015223 After: 0.000836 0.000488 0.001445 0.011248 0.110888 1.107596 Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.

lib/rack/utf8_sanitizer.rb

whitequark · 2020-11-14T07:58:36Z

I don't understand the correctness or the performance properties of this PR to be comfortable signing off on it.

This matches the implementation from ruby stdlib, and returns uppercase characters for hex.

This keeps the benefit of avoiding sprintf on common codepoints without ballooning the lookup table too much.

avit · 2020-11-14T08:41:32Z

Hi @whitequark,

On correctness—I just reverted one change back to the sprintf implementation used in stdlib URI for the tests to pass. (My previous attempt was faster than sprintf, but the output was lowercase hex.)

This is mainly changing gsub(PATTERN, &block) into gsub(PATTERN, lazy_cached_lookup_table), so that every matched byte doesn't need to be re-encoded with new string objects in the yielded block.

I added a test for benchmarking: ruby test/bench_utf8_sanitizer.rb. Running it with RubyProf, with sample data that has a moderate proportion of encoded characters (20%), it shows the code spends over half of its time in reencode_string with many calls to sprintf.

For a 100kb www-form-urlencoded payload, I'm measuring master branch at about 15ms per request, and these revisions at about 5ms.

I'm not yet convinced of this change since I just wrote it. I only want to share it to get more eyes on it & decide if it's pursuing further. I won't be sad if you decide to close it. 😄

sdhull · 2021-05-04T00:01:24Z

FWIW we have an endpoint that processes a massive form-encoded payload and just this middleware can take anywhere from 1.5s to 11s(!!) to process the request body. So we'd love this (or any performance-focused PR) to land. I can take a closer look later this week, but I'd be curious what it will take to get this merged & released @whitequark

whitequark · 2021-05-04T13:12:27Z

I can take a closer look later this week

Please try running this PR in your environment, and if it works well, I'll merge it.

avit commented Nov 14, 2020

View reviewed changes

lib/rack/utf8_sanitizer.rb Show resolved Hide resolved

avit added 2 commits November 14, 2020 00:06

Revert to sprintf

0eb472e

This matches the implementation from ruby stdlib, and returns uppercase characters for hex.

Limit codepoint caching to max 3-byte sequences

7781530

This keeps the benefit of avoiding sprintf on common codepoints without ballooning the lookup table too much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize sanitizing www-form-urlencoded input #58

Optimize sanitizing www-form-urlencoded input #58

avit commented Nov 14, 2020 •

edited

Loading

whitequark commented Nov 14, 2020

avit commented Nov 14, 2020

sdhull commented May 4, 2021

whitequark commented May 4, 2021

Optimize sanitizing www-form-urlencoded input #58

Are you sure you want to change the base?

Optimize sanitizing www-form-urlencoded input #58

Conversation

avit commented Nov 14, 2020 • edited Loading

whitequark commented Nov 14, 2020

avit commented Nov 14, 2020

sdhull commented May 4, 2021

whitequark commented May 4, 2021

avit commented Nov 14, 2020 •

edited

Loading