support WET files #66

dportabella · 2016-09-27T21:11:50Z

CommonCrawl has the WET files, which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).

Is it possible to support WET files with webarchive-commons?

or shall I implement this feature (to handle WET archives)?

is this a feature that you would include in the webarchive-commons library?

do you see any shortcoming/problem/comment on this?

anjackson · 2016-09-29T12:21:20Z

I'm pretty sure a WET is just a WARC file, but with 'conversion' records that contain text/plain. So, parsing WET files is already supported, at least in terms of basic parsing.

Looking at lintool/warcbase#250 and tracking down the RecordLoader it looks like that is automatically filtering out anything other than 'response' records. If that was made configurable so you could change it to access conversion records, I think you're all set.

kris-sigur · 2016-09-29T12:24:30Z

+1 to Andy's comments. Any 'deeper' or more advanced processing of WET files probably belong in a dedicated 'WET library'.

dportabella · 2016-09-29T14:15:31Z

cool! I just told the warcbase guys about this.
lintool/warcbase#250 (comment)

dportabella mentioned this issue Sep 29, 2016

use WET files from CommonCrawl lintool/warcbase#250

Open

anjackson closed this as completed Sep 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support WET files #66

support WET files #66

dportabella commented Sep 27, 2016

anjackson commented Sep 29, 2016

kris-sigur commented Sep 29, 2016

dportabella commented Sep 29, 2016

support WET files #66

support WET files #66

Comments

dportabella commented Sep 27, 2016

anjackson commented Sep 29, 2016

kris-sigur commented Sep 29, 2016

dportabella commented Sep 29, 2016