-
Notifications
You must be signed in to change notification settings - Fork 15
Events and callbacks
When an event occurs the framework calls the appropriate callback(s) to handle it.
There are 2 types of callbacks:
- Default callbacks: They are invoked when no applicable custom callbacks are registered for the specific event. Their default implementation simply logs the event. Of course this behavior can be overridden.
- Custom (pattern matching) callbacks: They can be manually registered for specific event types. When registering one, a regex pattern also needs to be provided. When this pattern matches the URL of the current request, the associated callback is called. This is useful for example when the same event needs to be handled differently for different pages of the website.
The following default callbacks are available:
- onBrowserInit
- onStart
- onResponseSuccess
- onNonHtmlResponse
- onNetworkError
- onResponseError
- onRequestRedirect
- onPageLoadTimeout
- onStop
Callback which is used to configure the browser before the crawling begins.
By default, it sets the page load timeout to the default value of 3 minutes and maximizes the browser.
Callback which gets called when the crawler starts.
Typically used to perform initialization (create handles/connections) before the crawling begins.
Callback which gets called when the browser loads the page and the HTTP status code of the response indicates success (2xx).
Typically used to find URLs to follow and to extract data from the HTML source.
Callback which gets called when the content type of the response is not text/html.
Provides the opportunity to download the non-HTML resource.
Callback which gets called when a network error occurs.
Callback which gets called when the browser loads the page and the HTTP status code of the response indicates error (4xx or 5xx).
Callback which gets called when a request is redirected.
Callback which gets called when the page does not load in the browser within the timeout period.
Callback which gets called when the crawler stops.
Typically used to perform resource cleanup (close handles/connections) after the run.
Custom callbacks are registered using the registerCustomCallback method.
It is possible to register multiple callbacks for the same event type. If more than one callback is applicable (because their associated patterns match the URL of the current request), the framework will call all of them.
Example:
public class MyCrawler extends Crawler {
public MyCrawler(final CrawlerConfiguration config) {
super(config);
registerCustomCallback(ResponseSuccessEvent.class,
new PatternMatchingCallback<>(Pattern.compile(".*foo.*"),
this::myCustomOnResponseSuccess));
}
private void myCustomOnResponseSuccess(final ResponseSuccessEvent event) {
// ...
}
}
The specified callback will be invoked when the browser loads an URL containing the word "foo".