-
Notifications
You must be signed in to change notification settings - Fork 15
Getting started
Péter Bencze edited this page May 31, 2019
·
13 revisions
The Crawler
abstract class provides a skeletal implementation of a crawler to minimize the effort
to create your own. The extending class should implement the logic of the crawler.
Below you can find a simple example that is enough to get you started:
public class MyCrawler extends Crawler {
private final UrlFinder urlFinder;
public MyCrawler(final CrawlerConfiguration config) {
super(config);
// A helper class that is intended to make it easier to find URLs on web pages
urlFinder = UrlFinder.createDefault();
}
@Override
protected void onResponseSuccess(final ResponseSuccessEvent event) {
// Crawl every URL found on the page
urlFinder.findUrlsInPage(event.getCompleteCrawlResponse())
.stream()
.map(CrawlRequest::createDefault)
.forEach(this::crawl);
// ...
}
}
By default, the crawler uses the HtmlUnit headless browser:
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFilterEnabled(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
.build();
// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);
// Start crawling with HtmlUnit
crawler.start();
Of course, you can also use other browsers. Currently Chrome and Firefox are supported.
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFilterEnabled(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(CrawlRequest.createDefault("http://example.com"))
.build();
// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);
// Start crawling with Chrome
crawler.start(Browser.CHROME);
That's it! In just a few lines you can create a crawler that crawls every link it finds, while
filtering duplicate and offsite requests. You also get access to the WebDriver
, so you can use
all the features that are provided by Selenium.