Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I skip over invalid UTF-16 string? #99

Open
jengelh opened this issue May 28, 2021 · 7 comments
Open

How do I skip over invalid UTF-16 string? #99

jengelh opened this issue May 28, 2021 · 7 comments
Assignees
Labels

Comments

@jengelh
Copy link

jengelh commented May 28, 2021

I have a somewhat broken PST file here where a 0x3001001F property has a byte sequence \x3D\xD8 in it, which gets rightfully rejected by libuna (>4932) just as it is by /usr/bin/iconv. But unlike iconv where I can pass -t UTF-8//IGNORE, how can I tell libuna to skip over unconvertible sequences rather than aborting?

(gdb) bt
#0  libuna_unicode_character_copy_from_utf16_stream (unicode_character=0x7fffffffd8ac, utf16_stream=<optimized out>, 
    utf16_stream_size=<optimized out>, utf16_stream_index=0x7fffffffd8b0, byte_order=<optimized out>, error=0x0)
    at libuna_unicode_character.c:4932
#1  0x00007ffff734bfe1 in libuna_utf8_string_size_from_utf16_stream (utf16_stream=utf16_stream@entry=0x485c20 "=\330P", 
    utf16_stream_size=utf16_stream_size@entry=40, byte_order=byte_order@entry=108, 
    utf8_string_size=utf8_string_size@entry=0x7fffffffd988, error=error@entry=0x0) at libuna_utf8_string.c:1871
#2  0x00007ffff7f2fbab in libpff_mapi_value_get_data_as_utf8_string_size (error=0x0, utf8_string_size=0x7fffffffd988, 
    ascii_codepage=<optimized out>, value_data_size=40, value_data=0x485c20 "=\330P", value_type=<optimized out>)
    at libpff_mapi_value.c:155
#3  libpff_mapi_value_get_data_as_utf8_string_size (value_type=<optimized out>, value_data=0x485c20 "=\330P.........", value_data_size=40, 
    ascii_codepage=<optimized out>, utf8_string_size=0x7fffffffd988, error=0x0) at libpff_mapi_value.c:90
#4  0x00007ffff7f414b2 in libpff_record_entry_get_data_as_utf8_string_size (record_entry=<optimized out>, 
    utf8_string_size=<optimized out>, error=0x0) at libpff_record_entry.c:1868


4927                    /* Determine if the UTF-16 character is within the low surrogate range
4928                     */
4929                    if( ( utf16_surrogate < LIBUNA_UNICODE_SURROGATE_LOW_RANGE_START )
4930                     || ( utf16_surrogate > LIBUNA_UNICODE_SURROGATE_LOW_RANGE_END ) )
4931                    {
4932>                           libcerror_error_set(
4933                             error,
4934                             LIBCERROR_ERROR_DOMAIN_RUNTIME,
4935                             LIBCERROR_RUNTIME_ERROR_UNSUPPORTED_VALUE,
4936                             "%s: unsupported low surrogate UTF-16 character.",
@joachimmetz
Copy link
Member

joachimmetz commented May 30, 2021

You current cannot tell libuna to skip over nonconvertible sequences. The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.

Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16

@joachimmetz joachimmetz self-assigned this May 30, 2021
@joachimmetz joachimmetz changed the title Skipping over illegal data when converting property data to UTF-8 How do I skip over invalid UTF-16 string? May 30, 2021
@jengelh
Copy link
Author

jengelh commented Jul 5, 2021

The reason for this is that from a data format analysis perspective you don't want to silently skip such errors.

Certainly; however, after the “investigative part”, when truncation is ok, it comes as unusual having to spin up iconv/icu and do a UTF-16 -> UTF-8//IGNORE conversion; I'd prefer just reusing the conversion from pff/una, since it's already a dependency.

Also can you tell me more about this PST file to rule out some older version of the format maybe using UCS-2 instead of UTF-16

All I know is that this PST was generated with Outlook 2010 (SP3?). \xD8\x3D is not a good codepoint even in UCS-2. It is possible that the data store already has had those bytes and Outlook just passed it on when it exported to PST.

@NigelPearson
Copy link

NigelPearson commented Oct 4, 2023

I also have some nasty .pst that are triggering a similar issue:

$ anaconda3/bin/python ./extract.py Writing messages to /Users/markmail/Desktop/Outlook Files/Backup Number of messages: 4647 Number of messages: 3296 Traceback (most recent call last): File "/Users/markmail/./extract.py", line 25, in <module> Subj = message.subject OSError: pypff_message_get_subject: unable to retrieve subject size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_internal_item_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size. libpff_message_get_entry_value_utf8_string_size: unable to retrieve UTF-8 string size.

Will try to split the (probably corrupt) .pst and attach a sample, but as spammers & hackers are increasingly trying to exploit anything they can, this sort of thing will only manifest more

@joachimmetz
Copy link
Member

joachimmetz commented Oct 4, 2023

are you sure the string is UTF-16? or could it be Windows UCS-2?

@deajan
Copy link

deajan commented Oct 6, 2023

I too have that error when trying to read some PST files.

Traceback (most recent call last):
  File "/stor/user/ext/extractor.py", line 38, in <module>
    extract_eml(file)
  File "/stor/user/ext/extractor.py", line 26, in extract_eml
    print(message.plain_text_body)
OSError: pypff_message_get_plain_text_body: unable to retrieve plain text body size. libuna_unicode_character_copy_from_utf16_stream: unsupported UTF-16 character. libuna_utf8_string_size_from_utf16_stream: unable to copy Unicode character from UTF-16 stream. libpff_mapi_value_get_data_as_utf8_string_size: unable to determine size of value data as UTF-8 string. libpff_record_entry_get_data_as_utf8_string_size_with_codepage: unable to determine size of value data as UTF-8 string. libpff_message_get_plain_text_body_size: unable to determine message body size.

I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings.
I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi under linux gives me application/vnd.ms-outlook; charset=binary and text/plain; charset=unknown-8bit.

@joachimmetz What do I need to do to get you the encoding ?

@joachimmetz
Copy link
Member

I am willing to extract that message in raw format for analysis, but I have no idea how to achieve this with libpff python bindings.

there is https://github.com/libyal/libpff/wiki/Troubleshooting#format-or-behavioral-errors, don't need to use the Python bindings, given that the trace back hints the issue surfaces in libuna_utf8_string_size_from_utf16_stream

I've loaded the PST file in outlook and extracted the offending message as both .msg and .txt files, but file -bi under linux gives me application/vnd.ms-outlook; charset=binary and text/plain; charset=unknown-8bit.

Please also read up on what a PST is (a MAPI database) and how information is stored. Libpff is intended to provide you low-level access to the data format, but you'll need to understand how the information on top of that is organized.

@deajan
Copy link

deajan commented Oct 6, 2023

As far as I understood the error, there's only one message that affects the extraction. If I skip this message, everything works fine so it doesn't seem like a PST data format problem, but rather an encoding issue in the affected mail.
I'd love to help debug, but I can only do this via python, I'm not a C guy.
Any --debug parameter that exists in python bindings perhaps ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants