-
Notifications
You must be signed in to change notification settings - Fork 47
use WET files from CommonCrawl #250
Comments
RecordLoader.loadArchives() only supports ARC and WARC files. We use the On Tue, Sep 27, 2016 at 12:06 PM, David Portabella <[email protected]
|
shall I implement this feature (to handle WET archives)? is this a feature that you would include in Warcbase library? do you see any shortcoming/problem/comment on this? |
I'll leave it to @lintool and @ianmilligan1 to comment on this feature's I would suggest that if you do decide to implement a WET reader, which On Tue, Sep 27, 2016 at 12:24 PM, David Portabella <[email protected]
|
Thanks for this. My sense is that only CommonCrawl uses WET files, right? I don't think we would frame it as a priority for our own development time, but if you were able to get WETs incorporated into webarchive-commons or warcbase, we would love to see it. With the combo of reading from S3 directly (as your other issue suggested) and WET functionality, I think that'd make CommonCrawl analysis very useful. |
So, it seems that we can use the webarchive-commons library as it is: We need only two changes from warcbase: warcbase/warcbase-core/src/main/scala/org/warcbase/spark/matchbox/RecordLoader.scala Line 37 in 8ba16e8
What do you think? |
Interesting! Do you want to give it a try, maybe altering |
Yup, I'll try that. |
CommonCrawl has the WET files, which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).
Is it possible to use WET files with warcbase?
I tried as follows:
If
in = "/data/sample.wet.gz"
, it complains with invalid exception. As I see that the format is quite similar, I tried renaming the file to/data/sample.warc.gz
. However,htmlPages.count
is zero when isWET is true.Any clue?
The text was updated successfully, but these errors were encountered: