v0.6.0
PST encoding bugfix
Includes a bugfix where PST encoding didn't use the first priority encoding, which could cause encoding errors in PDF, HTML, and WARC derivatives.
Improve PST HTML body extraction
PST files often contain messages that do not have an HTML body that still renders like it does in Outlook. Outlook and other clients instead use the RTF body. Mailbagit, which previously ignored RTF bodies, now extracts HTML from them when an HTML body is not present. This is then used for both PDF and WARC derivatives too. Previously this was only done for MSG sources.
WARC URI improvement
Previously, WARC derivatives made a custom URI for the important WARC-Target-URI header, using http://mailbag
, such as:
http://mailbag/39/body.html
http://mailbag/39/headers.json
http://mailbag/39/attachmentFilename.pdf
This wasn't great as they were likely to create conflicts outside of a mailbag and this didn't denote a real location as the WARC-Target-URI is supposed to have.
A better approach would be to use the Message-ID header, as specified by RFC2392. The reason we didn't originally, was that this was thought to be unreliable, as we had cases where the Message-ID headers were stripped. Yet, just ignoring the field wasn't a great approach, so this change uses Message-ID for WARC-Target-URI when it is present, and only falls back to http://mailbag
if it doesn't get a Message-ID that seems valid.
This approach uses the Message-ID header, but strips the leading and trailing brackets (<>
) that typically wrap it. To make it a valid URI according to RFC3986 it prepends the mailto:
URI scheme.
Thus, the Message-ID header <MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com>
becomes the WARC-Target-URI mailto:MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com
What's Changed
- Improve PST parsing and WARC URIs by @gwiedeman in #235
Full Changelog: v0.5.1...v0.6.0