Serritor

Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that use JavaScript.

Using Serritor in your build

Maven

Add the following dependency to your pom.xml:

<dependency>
    <groupId>com.github.peterbencze</groupId>
    <artifactId>serritor</artifactId>
    <version>1.4.0</version>
</dependency>

Gradle

Add the following dependency to your build.gradle:

compile group: 'com.github.peterbencze', name: 'serritor', version: '1.4.0'

Manual dependencies

The standalone JAR files are available on the releases page.

Documentation

The Wiki contains usage information and examples
The Javadoc is available here

Quickstart

The BaseCrawler abstract class provides a skeletal implementation of a crawler to minimize the effort to create your own. The extending class should define the logic of the crawler.

Below you can find a simple example that is enough to get you started:

public class MyCrawler extends BaseCrawler {

    private final UrlFinder urlFinder;

    public MyCrawler(final CrawlerConfiguration config) {
        super(config);
        
        // Extract URLs from links on the crawled page
        urlFinder = new UrlFinderBuilder(Pattern.compile(".+")).build();
    }

    @Override
    protected void onPageLoad(final PageLoadEvent event) {
        // Crawl every URL that match the given pattern
        urlFinder.findUrlsInPage(event)
                .stream()
                .map(CrawlRequestBuilder::new)
                .map(CrawlRequestBuilder::build)
                .forEach(this::crawl);
        
        // ...
    }
}

By default, the crawler uses HtmlUnit headless browser:

// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFiltering(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(new CrawlRequestBuilder("http://example.com").build())
        .build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
crawler.start();

Of course, you can also use any other browsers by specifying a corresponding WebDriver instance:

// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
        .setOffsiteRequestFiltering(true)
        .addAllowedCrawlDomain("example.com")
        .addCrawlSeed(new CrawlRequestBuilder("http://example.com").build())
        .build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
crawler.start(new ChromeDriver());

That's it! In just a few lines you can create a crawler that crawls every link it finds, while filtering duplicate and offsite requests. You also get access to the WebDriver instance, so you can use all the features that are provided by Selenium.

License

The source code of Serritor is made available under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checkstyle.xml		checkstyle.xml
pom.xml		pom.xml
wercker.yml		wercker.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Serritor

Using Serritor in your build

Maven

Gradle

Manual dependencies

Documentation

Quickstart

License

About

Releases

Packages

Languages

License

Lifedom/serritor

Folders and files

Latest commit

History

Repository files navigation

Serritor

Using Serritor in your build

Maven

Gradle

Manual dependencies

Documentation

Quickstart

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages