How do I skip over invalid UTF-16 string? #99

jengelh · 2021-05-28T11:37:32Z

I have a somewhat broken PST file here where a 0x3001001F property has a byte sequence \x3D\xD8 in it, which gets rightfully rejected by libuna (>4932) just as it is by /usr/bin/iconv. But unlike iconv where I can pass -t UTF-8//IGNORE, how can I tell libuna to skip over unconvertible sequences rather than aborting?

(gdb) bt
#0  libuna_unicode_character_copy_from_utf16_stream (unicode_character=0x7fffffffd8ac, utf16_stream=<optimized out>, 
    utf16_stream_size=<optimized out>, utf16_stream_index=0x7fffffffd8b0, byte_order=<optimized out>, error=0x0)
    at libuna_unicode_character.c:4932
#1  0x00007ffff734bfe1 in libuna_utf8_string_size_from_utf16_stream (utf16_stream=utf16_stream@entry=0x485c20 "=\330P", 
    utf16_stream_size=utf16_stream_size@entry=40, byte_order=byte_order@entry=108, 
    utf8_string_size=utf8_string_size@entry=0x7fffffffd988, error=error@entry=0x0) at libuna_utf8_string.c:1871
#2  0x00007ffff7f2fbab in libpff_mapi_value_get_data_as_utf8_string_size (error=0x0, utf8_string_size=0x7fffffffd988, 
    ascii_codepage=<optimized out>, value_data_size=40, value_data=0x485c20 "=\330P", value_type=<optimized out>)
    at libpff_mapi_value.c:155
#3  libpff_mapi_value_get_data_as_utf8_string_size (value_type=<optimized out>, value_data=0x485c20 "=\330P.........", value_data_size=40, 
    ascii_codepage=<optimized out>, utf8_string_size=0x7fffffffd988, error=0x0) at libpff_mapi_value.c:90
#4  0x00007ffff7f414b2 in libpff_record_entry_get_data_as_utf8_string_size (record_entry=<optimized out>, 
    utf8_string_size=<optimized out>, error=0x0) at libpff_record_entry.c:1868


4927                    /* Determine if the UTF-16 character is within the low surrogate range
4928                     */
4929                    if( ( utf16_surrogate < LIBUNA_UNICODE_SURROGATE_LOW_RANGE_START )
4930                     || ( utf16_surrogate > LIBUNA_UNICODE_SURROGATE_LOW_RANGE_END ) )
4931                    {
4932>                           libcerror_error_set(
4933                             error,
4934                             LIBCERROR_ERROR_DOMAIN_RUNTIME,
4935                             LIBCERROR_RUNTIME_ERROR_UNSUPPORTED_VALUE,
4936                             "%s: unsupported low surrogate UTF-16 character.",

The text was updated successfully, but these errors were encountered:

joachimmetz · 2021-05-30T06:40:19Z

You current cannot tell libuna to skip over nonconvertible sequences. The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.

Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16

jengelh · 2021-07-05T19:29:13Z

The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.

Certainly; however, after the “investigative part”, when truncation is ok, it comes as unusual having to spin up iconv/icu and do a UTF-16 -> UTF-8//IGNORE conversion; I'd prefer just reusing the conversion from pff/una, since it's already a dependency.

Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16

All I know is that this PST was generated with Outlook 2010 (SP3?). \xD8\x3D is not a good codepoint even in UCS-2. It is possible that the data store already has had those bytes and Outlook just passed it on when it exported to PST.

NigelPearson · 2023-10-04T05:24:26Z

I also have some nasty .pst that are triggering a similar issue:

$ anaconda3/bin/python ./extract.py Writing messages to /Users/markmail/Desktop/Outlook Files/Backup Number of messages: 4647 Number of messages: 3296 Traceback (most recent call last): File "/Users/markmail/./extract.py", line 25, in <module> Subj = message.subject OSError: pypff_message_get_subject: unable to retrieve subject size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_internal_item_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size. libpff_message_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size.

Will try to split the (probably corrupt) .pst and attach a sample, but as spammers & hackers are increasingly trying to exploit anything they can, this sort of thing will only manifest more

joachimmetz · 2023-10-04T05:29:12Z

are you sure the string is UTF-16? or could it be Windows UCS-2?

deajan · 2023-10-06T11:29:23Z

I too have that error when trying to read some PST files.

Traceback (most recent call last):
  File "/stor/user/ext/extractor.py", line 38, in <module>
    extract_eml(file)
  File "/stor/user/ext/extractor.py", line 26, in extract_eml
    print(message.plain_text_body)
OSError: pypff_message_get_plain_text_body: unable to retrieve plain text body size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_message_get_plain_text_body_size: unable to determine message body size.

I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings.
I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi under linux gives me application/vnd.ms-outlook; charset=binary and text/plain; charset=unknown-8bit.

@joachimmetz What do I need to do to get you the encoding ?

joachimmetz · 2023-10-06T17:43:23Z

I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings.

there is https://github.com/libyal/libpff/wiki/Troubleshooting#format-or-behavioral-errors, don't need to use the Python bindings, given that the trace back hints the issue surfaces in libuna_utf8_string_size_from_utf16_stream

I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi under linux gives me application/vnd.ms-outlook; charset=binary and text/plain; charset=unknown-8bit.

Please also read up on what a PST is (a MAPI database) and how information is stored. Libpff is intended to provide you low-level access to the data format, but you'll need to understand how the information on top of that is organized.

deajan · 2023-10-06T20:23:05Z

As far as I understood the error, there's only one message that affects the extraction. If I skip this message, everything works fine so it doesn't seem like a PST data format problem, but rather an encoding issue in the affected mail.
I'd love to help debug, but I can only do this via python, I'm not a C guy.
Any --debug parameter that exists in python bindings perhaps ?

joachimmetz self-assigned this May 30, 2021

joachimmetz added the question label May 30, 2021

joachimmetz changed the title ~~Skipping over illegal data when converting property data to UTF-8~~ How do I skip over invalid UTF-16 string? May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I skip over invalid UTF-16 string? #99

How do I skip over invalid UTF-16 string? #99

jengelh commented May 28, 2021

joachimmetz commented May 30, 2021 •

edited

Loading

jengelh commented Jul 5, 2021

NigelPearson commented Oct 4, 2023 •

edited

Loading

joachimmetz commented Oct 4, 2023 •

edited

Loading

deajan commented Oct 6, 2023

joachimmetz commented Oct 6, 2023

deajan commented Oct 6, 2023

How do I skip over invalid UTF-16 string? #99

How do I skip over invalid UTF-16 string? #99

Comments

jengelh commented May 28, 2021

joachimmetz commented May 30, 2021 • edited Loading

jengelh commented Jul 5, 2021

NigelPearson commented Oct 4, 2023 • edited Loading

joachimmetz commented Oct 4, 2023 • edited Loading

deajan commented Oct 6, 2023

joachimmetz commented Oct 6, 2023

deajan commented Oct 6, 2023

joachimmetz commented May 30, 2021 •

edited

Loading

NigelPearson commented Oct 4, 2023 •

edited

Loading

joachimmetz commented Oct 4, 2023 •

edited

Loading