-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I skip over invalid UTF-16 string? #99
Comments
You current cannot tell libuna to skip over nonconvertible sequences. The reason for this is that from a data format analysis perspective you don't want to silently skip such errors. Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16 |
Certainly; however, after the “investigative part”, when truncation is ok, it comes as unusual having to spin up iconv/icu and do a UTF-16 -> UTF-8//IGNORE conversion; I'd prefer just reusing the conversion from pff/una, since it's already a dependency.
All I know is that this PST was generated with Outlook 2010 (SP3?). \xD8\x3D is not a good codepoint even in UCS-2. It is possible that the data store already has had those bytes and Outlook just passed it on when it exported to PST. |
I also have some nasty .pst that are triggering a similar issue:
Will try to split the (probably corrupt) .pst and attach a sample, but as spammers & hackers are increasingly trying to exploit anything they can, this sort of thing will only manifest more |
are you sure the string is UTF-16? or could it be Windows UCS-2? |
I too have that error when trying to read some PST files.
I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings. @joachimmetz What do I need to do to get you the encoding ? |
there is https://github.com/libyal/libpff/wiki/Troubleshooting#format-or-behavioral-errors, don't need to use the Python bindings, given that the trace back hints the issue surfaces in libuna_utf8_string_size_from_utf16_stream
Please also read up on what a PST is (a MAPI database) and how information is stored. Libpff is intended to provide you low-level access to the data format, but you'll need to understand how the information on top of that is organized. |
As far as I understood the error, there's only one message that affects the extraction. If I skip this message, everything works fine so it doesn't seem like a PST data format problem, but rather an encoding issue in the affected mail. |
I have a somewhat broken PST file here where a 0x3001001F property has a byte sequence \x3D\xD8 in it, which gets rightfully rejected by libuna (>4932) just as it is by /usr/bin/iconv. But unlike iconv where I can pass
-t UTF-8//IGNORE
, how can I tell libuna to skip over unconvertible sequences rather than aborting?The text was updated successfully, but these errors were encountered: