Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Scrubber is mangling certain symbols and encodings #13

Open
Gordonei opened this issue Sep 14, 2020 · 2 comments
Open

Basic Scrubber is mangling certain symbols and encodings #13

Gordonei opened this issue Sep 14, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@Gordonei
Copy link
Contributor

As reported by @ColinAnthony :

# bad character (&)
scrub id = '55737b7a-5d56-4473-9ae9-94b797a5fad2'
encoded_string = 'Oude Molen Road & Ndabeni'
scrubbed_result = "Oude Molen Road, Western Cape, South Africa"

# bad encoding (byte string)
scrub id = '5c2cae1d-d3b8-410e-8a04-e26836004b70'
encoded_string = b'Oude Molen Road and Ndabeni'
scrubbed_result = "b\'Oude Molen Road and Ndabeni\', Western Cape, South Africa"

# good result
scrub id = '3e0d4bff-91c5-46ef-937c-48e03beb2db1'
encoded_string = 'Oude Molen Road and Ndabeni'
scrubbed_result = "Oude Molen Road and Ndabeni, Western Cape, South Africa"
@Gordonei Gordonei added the bug Something isn't working label Sep 14, 2020
@Gordonei
Copy link
Contributor Author

Gordonei commented Sep 19, 2020

Looking at the first problem - substring being stripped off:

The first case is a bit of mystery, as it's not clear from the scrubber code why it would strip off the & Nabeni bit of the string.

Looking at the server logs (looks like the scrub request ID is actually 9eb6825e-c71a-4f3d-be46-00b5e74084fc):

[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] authorization_controller.check_basicAuth [DEBUG]: Checking 'city-user'...
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] authorization_controller.check_basicAuth [DEBUG]: user_auth_check='True'
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] scrub_controller.scrub [INFO]: Scrubb[ing]...
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] scrub_controller.scrub [DEBUG]: address='Oude Molen Road '
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] BasicScrubber.scrub [DEBUG]: Received address 'Oude Molen Road '
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] BasicScrubber.scrub [DEBUG]: address after strip: 'Oude Molen Road'
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] BasicScrubber.scrub [DEBUG]: address after value injection: Oude Molen Road, Western Cape, South Africa
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] scrub_controller.scrub [DEBUG]: scrubbed_values=
['Oude Molen Road, Western Cape, South Africa']
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:9eb6825e-c71a-4f3d-be46-00b5e74084fc] scrub_controller.scrub [INFO]: ...Scrubb[ed]!
[2020-09-08T09:39:13+0000]-[PID:1]-[RID:NA] _internal._log [INFO]: 172.31.30.156 - - [08/Sep/2020 09:39:13] "GET /v1/scrub?address=Oude%20Molen%20Road%20&%20Ndabeni HTTP/1.1" 200 -

So the scrubber never saw the & Ndabeni substring at all. Looking at the server log line (the last one), you can see that the & is plainly in the URL.

This unfortunately looks like this has to be handled on the client side - the & is in the URL string, so it's assumed to be a new URL parameter.

One fix would be to be properly URL-ify the string. The other is to use the official python client (TM).

These fixes should be tested before this issue should be closed.

@Gordonei
Copy link
Contributor Author

Second problem - byte strings resulting in strange escaped quotes being injected.

Request ID for scrub is actually a517617d-133b-4fc0-b990-1265e54afad7:

[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] authorization_controller.check_basicAuth [DEBUG]: Checking 'city-user'...
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] authorization_controller.check_basicAuth [DEBUG]: user_auth_check='True'
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] scrub_controller.scrub [INFO]: Scrubb[ing]...
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] scrub_controller.scrub [DEBUG]: address='b'Oude Molen Road and Ndabeni''
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] BasicScrubber.scrub [DEBUG]: Received address 'b'Oude Molen Road and Ndabeni''
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] BasicScrubber.scrub [DEBUG]: address after strip: 'b'Oude Molen Road and Ndabeni''
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] BasicScrubber.scrub [DEBUG]: address after value injection: b'Oude Molen Road and Ndabeni', Western Cape, South Africa
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] scrub_controller.scrub [DEBUG]: scrubbed_values=
["b'Oude Molen Road and Ndabeni', Western Cape, South Africa"]
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:a517617d-133b-4fc0-b990-1265e54afad7] scrub_controller.scrub [INFO]: ...Scrubb[ed]!
[2020-09-08T09:42:07+0000]-[PID:1]-[RID:NA] _internal._log [INFO]: 172.31.30.156 - - [08/Sep/2020 09:42:07] "GET /v1/scrub?address=b'Oude%20Molen%20Road%20and%20Ndabeni' HTTP/1.1" 200 -

Mmm, again, looks like the issue is coming from the client side - as far as the server is concerned, it's getting the string "b'Oude Molen Road and Ndabeni'". I presume that the byte string is getting cast to a unicode string without encoding (repr(b"test string") vs b"test_string".decode() on the client side.

One pretty dangerous option to handle this on the server would be to evaluate the string (using exec) - this would allow for arbitrary code injection. A safer one would be to just add a regex into the server's deserialisation, to automatically strip out what looks like a byte string - something along the lines of ^b['"](.*)['"]$ addresses the failing case above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant