Skip to content

v0.6.0

Compare
Choose a tag to compare
@gwiedeman gwiedeman released this 22 Jun 18:30
· 40 commits to main since this release
cc6803d

PST encoding bugfix

Includes a bugfix where PST encoding didn't use the first priority encoding, which could cause encoding errors in PDF, HTML, and WARC derivatives.

Improve PST HTML body extraction

PST files often contain messages that do not have an HTML body that still renders like it does in Outlook. Outlook and other clients instead use the RTF body. Mailbagit, which previously ignored RTF bodies, now extracts HTML from them when an HTML body is not present. This is then used for both PDF and WARC derivatives too. Previously this was only done for MSG sources.

WARC URI improvement

Previously, WARC derivatives made a custom URI for the important WARC-Target-URI header, using http://mailbag, such as:

http://mailbag/39/body.html
http://mailbag/39/headers.json
http://mailbag/39/attachmentFilename.pdf

This wasn't great as they were likely to create conflicts outside of a mailbag and this didn't denote a real location as the WARC-Target-URI is supposed to have.

A better approach would be to use the Message-ID header, as specified by RFC2392. The reason we didn't originally, was that this was thought to be unreliable, as we had cases where the Message-ID headers were stripped. Yet, just ignoring the field wasn't a great approach, so this change uses Message-ID for WARC-Target-URI when it is present, and only falls back to http://mailbag if it doesn't get a Message-ID that seems valid.

This approach uses the Message-ID header, but strips the leading and trailing brackets (<>) that typically wrap it. To make it a valid URI according to RFC3986 it prepends the mailto: URI scheme.

Thus, the Message-ID header <MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com> becomes the WARC-Target-URI mailto:MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com

What's Changed

Full Changelog: v0.5.1...v0.6.0