-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support WET files #66
Comments
I'm pretty sure a WET is just a WARC file, but with 'conversion' records that contain text/plain. So, parsing WET files is already supported, at least in terms of basic parsing. Looking at lintool/warcbase#250 and tracking down the RecordLoader it looks like that is automatically filtering out anything other than 'response' records. If that was made configurable so you could change it to access conversion records, I think you're all set. |
+1 to Andy's comments. Any 'deeper' or more advanced processing of WET files probably belong in a dedicated 'WET library'. |
cool! I just told the warcbase guys about this. |
CommonCrawl has the WET files, which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).
Is it possible to support WET files with webarchive-commons?
or shall I implement this feature (to handle WET archives)?
is this a feature that you would include in the webarchive-commons library?
do you see any shortcoming/problem/comment on this?
The text was updated successfully, but these errors were encountered: