Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support WET files #66

Closed
dportabella opened this issue Sep 27, 2016 · 3 comments
Closed

support WET files #66

dportabella opened this issue Sep 27, 2016 · 3 comments

Comments

@dportabella
Copy link
Contributor

CommonCrawl has the WET files, which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).

Is it possible to support WET files with webarchive-commons?

or shall I implement this feature (to handle WET archives)?

is this a feature that you would include in the webarchive-commons library?

do you see any shortcoming/problem/comment on this?

@anjackson
Copy link
Member

I'm pretty sure a WET is just a WARC file, but with 'conversion' records that contain text/plain. So, parsing WET files is already supported, at least in terms of basic parsing.

Looking at lintool/warcbase#250 and tracking down the RecordLoader it looks like that is automatically filtering out anything other than 'response' records. If that was made configurable so you could change it to access conversion records, I think you're all set.

@kris-sigur
Copy link
Member

+1 to Andy's comments. Any 'deeper' or more advanced processing of WET files probably belong in a dedicated 'WET library'.

@dportabella
Copy link
Contributor Author

cool! I just told the warcbase guys about this.
lintool/warcbase#250 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants