Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting FormatException: Failed to parse message headers due to a buffering bug #991

Open
olivier2 opened this issue Jan 13, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@olivier2
Copy link

olivier2 commented Jan 13, 2024

Describe the bug
I get a "Failed to parse message headers." exception when parsing my ~11GB gmail mbox. This happens reliably, always on the same email. When the exception is raised, MimeParser.MboxMarker contains the string 'From : Subject : Date : ', from the middle of the email's DKIM-Signature header (see attachements). Mimekit has no problem parsing the offending email when it's extracted to its own file. So far, extracting 1MB of data around the email in an attempt to reproduce the problem has not been successful. A previous complete gmail mbox (now lost), that contained the same email, also had no parsing issue. I'm guessing it's related to the ReadAheadSize somehow?

Platform:

  • OS: Windows 10
  • .NET Runtime: CoreCLR
  • .NET Framework: .NET 8.0
  • MimeKit Version: 4.3.0

To Reproduce

  1. Run program with offending mbox file

Result

Failed to parse message headers.
   at MimeKit.MimeParser.ParseMessage(Byte* inbuf, CancellationToken cancellationToken) in C:\MimeKit\MimeKit\MimeParser.cs:line 1923
   at MimeKit.MimeParser.ParseMessage(CancellationToken cancellationToken) in C:\MimeKit\MimeParser.cs:line 2016

The location of the error state origin is already lost when the exception is raised, but in this particular case, it comes from line 965 in StepHeaders(...):

if (headers.Count == 0) {
	if (state == MimeParserState.MessageHeaders) {
		// ignore From-lines that might appear at the start of a message
		if (toplevel && (length < 5 || !IsMboxMarker (start, true))) {
			// not a From-line...
			inputIndex = (int) (start - inbuf);
  /* here -> */		state = MimeParserState.Error;
			headerIndex = 0;
			return false;
		}

Expected behavior
No exception

Code Snippets

static void Main(string[] args)
{
    using(var fileStream = File.OpenRead(@"All mail Including Spam and Trash-002.mbox"))
    {
        var parser = new MimeParser(stream, MimeFormat.Mbox);
        while (!parser.IsEndOfStream)
        {
            parser.ParseMessage();
        }  
    }
}

Additional context

Workaround
If I modify IsMboxMarker() to allow 'From ', but ignore 'From :', the exception is not raised. However, I'm not familiar with RFC standards or the performance impact to suggest this as a permanent fix (also, my test case to prove the fix includes a very private ~11GB file):

... *inptr++ == (byte) 'm' && *inptr++ == (byte) ' ' && *inptr != (byte) ':'
@jstedfast
Copy link
Owner

The change to IsMboxMarker is wallpapering over the real problem. Somehow buffering state is getting messed up such that it is skipping over the real mbox marker and a ton of headers and getting confused, thinking that the mbox marker is in the middle of the DKIM header value.

This could be caused by a few things:

  1. The MimeParser is set to RespectContentLength and the previous message has a Content-Length header value that specifies a length that is longer than the real contents
  2. The MimeParser is set to treat the stream as persistent and your code is consuming the message inside the loop that is parsing the messages, thereby throwing off the stream's seek position out from under the MimeParser.
  3. There's a buffer boundary issue bug in the parser.

Have you tried the ExperimentalMimeParser?

At the very least it should be faster.

@olivier2
Copy link
Author

Thanks for the quick feedback!

  1. The MimeParser is set to RespectContentLength and the previous message has a Content-Length header value that specifies a length that is longer than the real contents

I'm using defaults options, so RespectContentLength is set to false. Also, I tried recreating the email context with several messages before and after, but that didn't reproduce the problem so far.

  1. The MimeParser is set to treat the stream as persistent and your code is consuming the message inside the loop that is parsing the messages, thereby throwing off the stream's seek position out from under the MimeParser.

Persistent was the default (false) in the test (see posted code snippet) and I'm not messing with the stream.

  1. There's a buffer boundary issue bug in the parser.

That's my assumption as well, but I'm not sure how to tweak the code to force a reproduction of the issue.

Have you tried the ExperimentalMimeParser?

I just did and it runs flawlessly!

@jstedfast
Copy link
Owner

jstedfast commented Jan 13, 2024

I'm using defaults options, so RespectContentLength is set to false. Also, I tried recreating the email context with several messages before and after, but that didn't reproduce the problem so far.

Yea, if it's a buffering issue (and based on the fact that you are using the defaults and not doing anything with the stream, it sounds like it is), it'll be difficult to construct an mbox that will reproduce the issue because it'll likely be a corner case where some token or another needs to cross a read boundary at some stage in parsing, causing things to get out of whack.

IIRC, the ExperimentalMimeParser is a little more robust in that sense (if my memory is correct, it's been a while) because it tends to consume everything as it goes whereas the old MimeParser implementation has a few areas where it will abort an attempt to consume something if it can't get the whole token, call ReadAhead to refill the buffer (which moves any remaining buffer to the "prebuffer") and then tries again. This is most likely the cause of the issue, but I obviously can't be 100% certain (and there are several places that do that, so narrowing that down would be a pain).

I think the most likely culprit would be the StepHeaders() (or related) logic in MimeParser.cs. That code has gotten a bit scary, even for me, which is one of the reasons it got rewritten in ExperimentalMimeParser.

@jstedfast jstedfast added the bug Something isn't working label Jan 13, 2024
@jstedfast
Copy link
Owner

FWIW, my plan is to make ExperimentalMimeParser the default in v5.0 and rename the current MimeParser to LegacyMimeParser and more-or-less [Obsolete] it.

I meant to do that for v4.0 but forgot, haha.

@jstedfast jstedfast changed the title Parser appears confused by DKIM-Signature header containing string "From :", but not when the offending email is singled out Getting FormatException: Failed to parse message headers due to a buffering bug Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants