Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow customization of decode error handling in MultiPartParser #3009

Open
glumia opened this issue Nov 16, 2024 · 1 comment
Open

Allow customization of decode error handling in MultiPartParser #3009

glumia opened this issue Nov 16, 2024 · 1 comment

Comments

@glumia
Copy link

glumia commented Nov 16, 2024

Hello werkzeug team!

I would like to get some feedback on the possibility to rework a bitFormDataParser and MultiPartParser so that one can easily configure the decode error handling strategy that's currently hard-coded as "replace" here.

The reason I come with this proposal is that at work we integrate with a partner that sends us multipart/form-data requests that do not carry any encoding information for the fields of the payload. Instead, they add another field in the payload itself that contains the charset information for the others (see here for an example).1

This forces us to do a first pass where we parse all the content of the request and then, when we got the charset information, we re-encode all the fields that were not encoded in utf-8 and decode them with the proper charset.2

The problem is that if the error handling strategy is set to "replace", we get some data loss for any characters that utf-8 isn't able to decode.

What we need in this case is to use "surrogateescape", and to achieve that right now we're forced to reimplement the whole FormDataParser._parse_multipart and MultiPartParser.parse for what's essentially a one-line change.


Eventually, these are roughly the changes I had in my mind (the diff is with 7868bef, note that we would need to move the declaration of FormParser after MultiPartParser but I didn't do it here to not bloat the diff):

diff --git a/src/werkzeug/formparser.py b/src/werkzeug/formparser.py
index 01034149..591a9317 100644
--- a/src/werkzeug/formparser.py
+++ b/src/werkzeug/formparser.py
@@ -167,6 +167,8 @@ class FormDataParser:
     .. versionadded:: 0.8
     """
 
+    multipart_parser_cls = MultiPartParser
+
     def __init__(
         self,
         stream_factory: TStreamFactory | None = None,
@@ -253,7 +255,7 @@ class FormDataParser:
         content_length: int | None,
         options: dict[str, str],
     ) -> t_parse_result:
-        parser = MultiPartParser(
+        parser = self.multipart_parser_cls(
             stream_factory=self.stream_factory,
             max_form_memory_size=self.max_form_memory_size,
             max_form_parts=self.max_form_parts,
@@ -290,6 +292,8 @@ class FormDataParser:
 
 
 class MultiPartParser:
+    decode_error_handling = "replace"
+
     def __init__(
         self,
         stream_factory: TStreamFactory | None = None,
@@ -392,7 +396,8 @@ class MultiPartParser:
                     if not event.more_data:
                         if isinstance(current_part, Field):
                             value = b"".join(container).decode(
-                                self.get_part_charset(current_part.headers), "replace"
+                                self.get_part_charset(current_part.headers),
+                                self.decode_errors_handling,
                             )
                             fields.append((current_part.name, value))
                         else:

That would allow one to easily define a custom form data parser that's identic in behavior to FormDataParser besides the decode error handling for multipart/form-data requests.

For example:

from werkzeug.formparser import FormDataParser, MultiPartParser


class SurrogateEscapeMultiPartParser(MultiPartParser):
    decode_errors_handling = "surrogateescape"


class SurrogateEscapeFormDataParser(FormDataParser):
    multipart_parser_cls = SurrogateEscapeMultiPartParser

What do you think?

Footnotes

  1. That's obviously nonsense; unfortunately we have no choice but to deal with it as they'll not be fixing this anytime soon.

  2. We could probably write a custom parser that looks only for that charset field, decodes it and then does another pass on the payload using that information, but that would be a lot of added complexity to maintain for little benefit (the perf hit of re-encoding/decoding isn't an issue in this case).

@glumia
Copy link
Author

glumia commented Nov 16, 2024

I've seen now that this was possible before v3 with the encoding_errors parameter on the Request class, but that it was then removed as part of #2602.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant