Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize sanitizing www-form-urlencoded input #58

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

avit
Copy link

@avit avit commented Nov 14, 2020

Using gsub with block syntax is slow. This implementation uses lookup tables to
cache the mapping of encoded/decoded characters since these are known subsets /
ranges, and handle replacement in a way that doesn't have any overhead for
yielding.

Benchmarks for random payloads containing 20% urlencoded characters show a
considerable improvement:

Before:

0.000900	 0.000565	 0.003071	 0.033079	 0.305175	 3.015223

After:

0.000836	 0.000488	 0.001445	 0.011248	 0.110888	 1.107596

Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.

Using gsub with block syntax is slow. This implementation uses lookup tables to
cache the mapping of encoded/decoded characters since these are known subsets /
ranges, and handle replacement in a way that doesn't have any overhead for
yielding.

Benchmarks for random payloads containing 20% urlencoded characters show a
considerable improvement:

Before:
0.000900	 0.000565	 0.003071	 0.033079	 0.305175	 3.015223

After:
0.000836	 0.000488	 0.001445	 0.011248	 0.110888	 1.107596

Tested data payload sizes are 10b, 100b, 1kb, 10kb, 100kb, 1mb.
@whitequark
Copy link
Owner

I don't understand the correctness or the performance properties of this PR to be comfortable signing off on it.

This matches the implementation from ruby stdlib, and returns uppercase
characters for hex.
This keeps the benefit of avoiding sprintf on common codepoints without
ballooning the lookup table too much.
@avit
Copy link
Author

avit commented Nov 14, 2020

Hi @whitequark,

On correctness—I just reverted one change back to the sprintf implementation used in stdlib URI for the tests to pass. (My previous attempt was faster than sprintf, but the output was lowercase hex.)

This is mainly changing gsub(PATTERN, &block) into gsub(PATTERN, lazy_cached_lookup_table), so that every matched byte doesn't need to be re-encoded with new string objects in the yielded block.

I added a test for benchmarking: ruby test/bench_utf8_sanitizer.rb. Running it with RubyProf, with sample data that has a moderate proportion of encoded characters (20%), it shows the code spends over half of its time in reencode_string with many calls to sprintf.

For a 100kb www-form-urlencoded payload, I'm measuring master branch at about 15ms per request, and these revisions at about 5ms.

I'm not yet convinced of this change since I just wrote it. I only want to share it to get more eyes on it & decide if it's pursuing further. I won't be sad if you decide to close it. 😄

@sdhull
Copy link

sdhull commented May 4, 2021

FWIW we have an endpoint that processes a massive form-encoded payload and just this middleware can take anywhere from 1.5s to 11s(!!) to process the request body. So we'd love this (or any performance-focused PR) to land. I can take a closer look later this week, but I'd be curious what it will take to get this merged & released @whitequark

@whitequark
Copy link
Owner

I can take a closer look later this week

Please try running this PR in your environment, and if it works well, I'll merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants