use WET files from CommonCrawl #250

dportabella · 2016-09-27T16:06:21Z

CommonCrawl has the WET files, which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).

Is it possible to use WET files with warcbase?

I tried as follows:

val archives = RecordLoader.loadArchives(in, sc)

val htmlPages =
  if (isWET)
    archives
      .map(r => r.getContentString)
  else
    archives
      .keepValidPages()
      .map(r => RemoveHTML(r.getContentString))

If in = "/data/sample.wet.gz", it complains with invalid exception. As I see that the format is quite similar, I tried renaming the file to /data/sample.warc.gz. However, htmlPages.count is zero when isWET is true.

Any clue?

The text was updated successfully, but these errors were encountered:

jrwiebe · 2016-09-27T16:20:06Z

RecordLoader.loadArchives() only supports ARC and WARC files. We use the
webarchive-commons library for processing these, which does not have WET
support.

On Tue, Sep 27, 2016 at 12:06 PM, David Portabella <[email protected]

wrote:

CommonCrawl has the WET files
http://commoncrawl.org/the-data/get-started/, which are WARC files
where HTML response has been converted to plain text (and non html pages
has been removed).

Is it possible to use WET files with warcbase?

I tried as follows:

val archives = RecordLoader.loadArchives(in, sc)

val htmlPages =
if (isWET)
archives
.map(r => r.getContentString)
else
archives
.keepValidPages()
.map(r => RemoveHTML(r.getContentString))

If in = "/data/sample.wet.gz", it complains with invalid exception. As I
see that the format is quite similar, I tried renaming the file to
/data/sample.warc.gz. However, htmlPages.count is zero when isWET is true.

Any clue?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#250, or mute the thread
https://github.com/notifications/unsubscribe-auth/AEaUjdlX_oiDTzg1Kt8xIWkqR4aDYmVDks5quT7-gaJpZM4KH2rv
.

dportabella · 2016-09-27T16:24:55Z

shall I implement this feature (to handle WET archives)?

is this a feature that you would include in Warcbase library?

do you see any shortcoming/problem/comment on this?

jrwiebe · 2016-09-27T16:34:06Z

I'll leave it to @lintool and @ianmilligan1 to comment on this feature's
desirability.

I would suggest that if you do decide to implement a WET reader, which
seems pretty straightforward, do it as a fork of webarchive-commons and see
if they accept it.

On Tue, Sep 27, 2016 at 12:24 PM, David Portabella <[email protected]

wrote:

shall I implement this feature (to handle WET archives)?

is this a feature that you would include in Warcbase library?

do you see any shortcoming/problem/comment on this?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#250 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEaUjRDd48w4B2gv-b5odlhbJ_SarMC9ks5quUNXgaJpZM4KH2rv
.

ianmilligan1 · 2016-09-27T18:21:46Z

Thanks for this. My sense is that only CommonCrawl uses WET files, right?

I don't think we would frame it as a priority for our own development time, but if you were able to get WETs incorporated into webarchive-commons or warcbase, we would love to see it. With the combo of reading from S3 directly (as your other issue suggested) and WET functionality, I think that'd make CommonCrawl analysis very useful.

dportabella · 2016-09-29T14:14:53Z

So, it seems that we can use the webarchive-commons library as it is:
iipc/webarchive-commons#66 (comment)

We need only two changes from warcbase:
1- accept the wet.gz extension
2- filter WARC-Type by conversion instead of by response

warcbase/warcbase-core/src/main/scala/org/warcbase/spark/matchbox/RecordLoader.scala

Line 37 in 8ba16e8

    
           .filter(r => r._2.getRecord.getHeader.getHeaderValue("WARC-Type").equals("response"))

What do you think?

ianmilligan1 · 2016-09-29T14:41:04Z

Interesting! Do you want to give it a try, maybe altering RecordLoader and IngestFiles.java, see if it works with your WETs?

dportabella · 2016-09-29T21:27:23Z

Yup, I'll try that.

anjackson mentioned this issue Sep 29, 2016

support WET files iipc/webarchive-commons#66

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use WET files from CommonCrawl #250

use WET files from CommonCrawl #250

dportabella commented Sep 27, 2016

jrwiebe commented Sep 27, 2016

dportabella commented Sep 27, 2016

jrwiebe commented Sep 27, 2016

ianmilligan1 commented Sep 27, 2016

dportabella commented Sep 29, 2016

ianmilligan1 commented Sep 29, 2016

dportabella commented Sep 29, 2016

use WET files from CommonCrawl #250

use WET files from CommonCrawl #250

Comments

dportabella commented Sep 27, 2016

jrwiebe commented Sep 27, 2016

dportabella commented Sep 27, 2016

jrwiebe commented Sep 27, 2016

ianmilligan1 commented Sep 27, 2016

dportabella commented Sep 29, 2016

ianmilligan1 commented Sep 29, 2016

dportabella commented Sep 29, 2016