Skip to content

Commit

Permalink
Merge pull request #42 from peterbencze/development
Browse files Browse the repository at this point in the history
Serritor 1.4.0
  • Loading branch information
peterbencze authored Jun 23, 2018
2 parents a91dcf5 + 2264bdd commit 3061e63
Show file tree
Hide file tree
Showing 32 changed files with 1,405 additions and 1,159 deletions.
94 changes: 47 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,39 @@
Serritor
========

Serritor is an open source web crawler framework built upon [Selenium](http://www.seleniumhq.org/) and written in Java. Crawling dynamic web pages is no longer a problem!
Serritor is an open source web crawler framework built upon [Selenium](http://www.seleniumhq.org/) and written in Java. It can be used to crawl dynamic web pages that use JavaScript.

## Installation
### Using Maven
## Using Serritor in your build
### Maven

Add the following dependency to your pom.xml:
```xml
<dependency>
<groupId>com.github.peterbencze</groupId>
<artifactId>serritor</artifactId>
<version>1.3.1</version>
<version>1.4.0</version>
</dependency>
```

### Without Maven
### Gradle

Add the following dependency to your build.gradle:
```groovy
compile group: 'com.github.peterbencze', name: 'serritor', version: '1.4.0'
```

### Manual dependencies

The standalone JAR files are available on the [releases](https://github.com/peterbencze/serritor/releases) page.

## Documentation
See the [Wiki](https://github.com/peterbencze/serritor/wiki) page.
* The [Wiki](https://github.com/peterbencze/serritor/wiki) contains usage information and examples
* The Javadoc is available [here](https://peterbencze.github.io/serritor/)

## Quickstart
_BaseCrawler_ provides a skeletal implementation of a crawler to minimize the effort to create your own. First, create a class that extends _BaseCrawler_. In this class, you can implement the behavior of your crawler. There are callbacks available for every stage of crawling. Below you can find an example:
The `BaseCrawler` abstract class provides a skeletal implementation of a crawler to minimize the effort to create your own. The extending class should define the logic of the crawler.

Below you can find a simple example that is enough to get you started:
```java
public class MyCrawler extends BaseCrawler {

Expand All @@ -37,60 +47,50 @@ public class MyCrawler extends BaseCrawler {
}

@Override
protected void onResponseComplete(final HtmlResponse response) {
protected void onPageLoad(final PageLoadEvent event) {
// Crawl every URL that match the given pattern
urlFinder.findUrlsInResponse(response)
urlFinder.findUrlsInPage(event)
.stream()
.map(CrawlRequestBuilder::new)
.map(CrawlRequestBuilder::build)
.forEach(this::crawl);
}

@Override
protected void onNonHtmlResponse(final NonHtmlResponse response) {
System.out.println("Received a non-HTML response from: " + response.getCrawlRequest().getRequestUrl());
}

@Override
protected void onUnsuccessfulRequest(final UnsuccessfulRequest request) {
System.out.println("Could not get response from: " + request.getCrawlRequest().getRequestUrl());

// ...
}
}
```
By default, the crawler uses [HtmlUnit headless browser](http://htmlunit.sourceforge.net/):
```java
public static void main(String[] args) {
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder().setOffsiteRequestFiltering(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(new CrawlRequestBuilder("http://example.com").build())
.build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
crawler.start();
}
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFiltering(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(new CrawlRequestBuilder("http://example.com").build())
.build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
crawler.start();
```
Of course, you can also use any other browsers by specifying a corresponding _WebDriver_ instance:
Of course, you can also use any other browsers by specifying a corresponding `WebDriver` instance:
```java
public static void main(String[] args) {
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder().setOffsiteRequestFiltering(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(new CrawlRequestBuilder("http://example.com").build())
.build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
crawler.start(new ChromeDriver());
}
// Create the configuration
CrawlerConfiguration config = new CrawlerConfigurationBuilder()
.setOffsiteRequestFiltering(true)
.addAllowedCrawlDomain("example.com")
.addCrawlSeed(new CrawlRequestBuilder("http://example.com").build())
.build();

// Create the crawler using the configuration above
MyCrawler crawler = new MyCrawler(config);

// Start it
crawler.start(new ChromeDriver());
```

That's it! In just a few lines you can make a crawler that crawls every link it finds, while filtering duplicate and offsite requests. You also get access to the _WebDriver_ instance, so you can use all the features that are provided by Selenium.
That's it! In just a few lines you can create a crawler that crawls every link it finds, while filtering duplicate and offsite requests. You also get access to the `WebDriver` instance, so you can use all the features that are provided by Selenium.
## License
The source code of Serritor is made available under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
255 changes: 255 additions & 0 deletions checkstyle.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
<?xml version="1.0"?>
<!DOCTYPE module PUBLIC
"-//Checkstyle//DTD Checkstyle Configuration 1.3//EN"
"http://checkstyle.sourceforge.net/dtds/configuration_1_3.dtd">

<!--
Checkstyle configuration that checks the Google coding conventions from Google Java Style
that can be found at https://google.github.io/styleguide/javaguide.html.
Checkstyle is very configurable. Be sure to read the documentation at
http://checkstyle.sf.net (or in your downloaded distribution).
To completely disable a check, just comment it out or delete it from the file.
Authors: Max Vetrenko, Ruslan Diachenko, Roman Ivanov.
-->

<module name = "Checker">
<property name="charset" value="UTF-8"/>

<property name="severity" value="warning"/>

<property name="fileExtensions" value="java, properties, xml"/>
<!-- Checks for whitespace -->
<!-- See http://checkstyle.sf.net/config_whitespace.html -->
<module name="FileTabCharacter">
<property name="eachLine" value="true"/>
</module>

<module name="SuppressWarningsFilter" />

<module name="TreeWalker">
<module name="SuppressWarningsHolder" />
<module name="OuterTypeFilename"/>
<module name="IllegalTokenText">
<property name="tokens" value="STRING_LITERAL, CHAR_LITERAL"/>
<property name="format"
value="\\u00(09|0(a|A)|0(c|C)|0(d|D)|22|27|5(C|c))|\\(0(10|11|12|14|15|42|47)|134)"/>
<property name="message"
value="Consider using special escape sequence instead of octal value or Unicode escaped value."/>
</module>
<module name="AvoidEscapedUnicodeCharacters">
<property name="allowEscapesForControlCharacters" value="true"/>
<property name="allowByTailComment" value="true"/>
<property name="allowNonPrintableEscapes" value="true"/>
</module>
<module name="LineLength">
<property name="max" value="100"/>
<property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
</module>
<module name="AvoidStarImport"/>
<module name="OneTopLevelClass"/>
<module name="NoLineWrap"/>
<module name="EmptyBlock">
<property name="option" value="TEXT"/>
<property name="tokens"
value="LITERAL_TRY, LITERAL_FINALLY, LITERAL_IF, LITERAL_ELSE, LITERAL_SWITCH"/>
</module>
<module name="NeedBraces"/>
<module name="LeftCurly"/>
<module name="RightCurly">
<property name="id" value="RightCurlySame"/>
<property name="tokens"
value="LITERAL_TRY, LITERAL_CATCH, LITERAL_FINALLY, LITERAL_IF, LITERAL_ELSE,
LITERAL_DO"/>
</module>
<module name="RightCurly">
<property name="id" value="RightCurlyAlone"/>
<property name="option" value="alone"/>
<property name="tokens"
value="CLASS_DEF, METHOD_DEF, CTOR_DEF, LITERAL_FOR, LITERAL_WHILE, STATIC_INIT,
INSTANCE_INIT"/>
</module>
<module name="WhitespaceAround">
<property name="allowEmptyConstructors" value="true"/>
<property name="allowEmptyMethods" value="true"/>
<property name="allowEmptyTypes" value="true"/>
<property name="allowEmptyLoops" value="true"/>
<message key="ws.notFollowed"
value="WhitespaceAround: ''{0}'' is not followed by whitespace. Empty blocks may only be represented as '{}' when not part of a multi-block statement (4.1.3)"/>
<message key="ws.notPreceded"
value="WhitespaceAround: ''{0}'' is not preceded with whitespace."/>
</module>
<module name="OneStatementPerLine"/>
<module name="MultipleVariableDeclarations"/>
<module name="ArrayTypeStyle"/>
<module name="MissingSwitchDefault"/>
<module name="FallThrough"/>
<module name="UpperEll"/>
<module name="ModifierOrder"/>
<module name="EmptyLineSeparator">
<property name="allowNoEmptyLineBetweenFields" value="true"/>
</module>
<module name="SeparatorWrap">
<property name="id" value="SeparatorWrapDot"/>
<property name="tokens" value="DOT"/>
<property name="option" value="nl"/>
</module>
<module name="SeparatorWrap">
<property name="id" value="SeparatorWrapComma"/>
<property name="tokens" value="COMMA"/>
<property name="option" value="EOL"/>
</module>
<module name="SeparatorWrap">
<!-- ELLIPSIS is EOL until https://github.com/google/styleguide/issues/258 -->
<property name="id" value="SeparatorWrapEllipsis"/>
<property name="tokens" value="ELLIPSIS"/>
<property name="option" value="EOL"/>
</module>
<module name="SeparatorWrap">
<!-- ARRAY_DECLARATOR is EOL until https://github.com/google/styleguide/issues/259 -->
<property name="id" value="SeparatorWrapArrayDeclarator"/>
<property name="tokens" value="ARRAY_DECLARATOR"/>
<property name="option" value="EOL"/>
</module>
<module name="SeparatorWrap">
<property name="id" value="SeparatorWrapMethodRef"/>
<property name="tokens" value="METHOD_REF"/>
<property name="option" value="nl"/>
</module>
<module name="PackageName">
<property name="format" value="^[a-z]+(\.[a-z][a-z0-9]*)*$"/>
<message key="name.invalidPattern"
value="Package name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="TypeName">
<message key="name.invalidPattern"
value="Type name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="MemberName">
<property name="format" value="^[a-z][a-z0-9][a-zA-Z0-9]*$"/>
<message key="name.invalidPattern"
value="Member name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="ParameterName">
<property name="format" value="^[a-z]([a-z0-9][a-zA-Z0-9]*)?$"/>
<message key="name.invalidPattern"
value="Parameter name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="CatchParameterName">
<property name="format" value="^[a-z]([a-z0-9][a-zA-Z0-9]*)?$"/>
<message key="name.invalidPattern"
value="Catch parameter name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="LocalVariableName">
<property name="tokens" value="VARIABLE_DEF"/>
<property name="format" value="^[a-z]([a-z0-9][a-zA-Z0-9]*)?$"/>
<message key="name.invalidPattern"
value="Local variable name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="ClassTypeParameterName">
<property name="format" value="(^[A-Z][0-9]?)$|([A-Z][a-zA-Z0-9]*[T]$)"/>
<message key="name.invalidPattern"
value="Class type name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="MethodTypeParameterName">
<property name="format" value="(^[A-Z][0-9]?)$|([A-Z][a-zA-Z0-9]*[T]$)"/>
<message key="name.invalidPattern"
value="Method type name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="InterfaceTypeParameterName">
<property name="format" value="(^[A-Z][0-9]?)$|([A-Z][a-zA-Z0-9]*[T]$)"/>
<message key="name.invalidPattern"
value="Interface type name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="NoFinalizer"/>
<module name="GenericWhitespace">
<message key="ws.followed"
value="GenericWhitespace ''{0}'' is followed by whitespace."/>
<message key="ws.preceded"
value="GenericWhitespace ''{0}'' is preceded with whitespace."/>
<message key="ws.illegalFollow"
value="GenericWhitespace ''{0}'' should followed by whitespace."/>
<message key="ws.notPreceded"
value="GenericWhitespace ''{0}'' is not preceded with whitespace."/>
</module>
<!--
<module name="Indentation">
<property name="basicOffset" value="2"/>
<property name="braceAdjustment" value="0"/>
<property name="caseIndent" value="2"/>
<property name="throwsIndent" value="4"/>
<property name="lineWrappingIndentation" value="4"/>
<property name="arrayInitIndent" value="2"/>
</module>
-->
<module name="AbbreviationAsWordInName">
<property name="ignoreFinal" value="false"/>
<property name="allowedAbbreviationLength" value="1"/>
</module>
<module name="OverloadMethodsDeclarationOrder"/>
<module name="VariableDeclarationUsageDistance"/>
<module name="CustomImportOrder">
<property name="sortImportsInGroupAlphabetically" value="true"/>
<property name="separateLineBetweenGroups" value="true"/>
<property name="customImportOrderRules" value="STATIC###THIRD_PARTY_PACKAGE"/>
</module>
<module name="MethodParamPad"/>
<module name="NoWhitespaceBefore">
<property name="tokens"
value="COMMA, SEMI, POST_INC, POST_DEC, DOT, ELLIPSIS, METHOD_REF"/>
<property name="allowLineBreaks" value="true"/>
</module>
<module name="ParenPad"/>
<module name="OperatorWrap">
<property name="option" value="NL"/>
<property name="tokens"
value="BAND, BOR, BSR, BXOR, DIV, EQUAL, GE, GT, LAND, LE, LITERAL_INSTANCEOF, LOR,
LT, MINUS, MOD, NOT_EQUAL, PLUS, QUESTION, SL, SR, STAR, METHOD_REF "/>
</module>
<module name="AnnotationLocation">
<property name="id" value="AnnotationLocationMostCases"/>
<property name="tokens"
value="CLASS_DEF, INTERFACE_DEF, ENUM_DEF, METHOD_DEF, CTOR_DEF"/>
</module>
<module name="AnnotationLocation">
<property name="id" value="AnnotationLocationVariables"/>
<property name="tokens" value="VARIABLE_DEF"/>
<property name="allowSamelineMultipleAnnotations" value="true"/>
</module>
<module name="NonEmptyAtclauseDescription"/>
<module name="JavadocTagContinuationIndentation"/>
<module name="SummaryJavadoc">
<property name="forbiddenSummaryFragments"
value="^@return the *|^This method returns |^A [{]@code [a-zA-Z0-9]+[}]( is a )"/>
</module>
<module name="JavadocParagraph"/>
<module name="AtclauseOrder">
<property name="tagOrder" value="@param, @return, @throws, @deprecated"/>
<property name="target"
value="CLASS_DEF, INTERFACE_DEF, ENUM_DEF, METHOD_DEF, CTOR_DEF, VARIABLE_DEF"/>
</module>
<module name="JavadocMethod">
<property name="scope" value="public"/>
<property name="allowMissingParamTags" value="true"/>
<property name="allowMissingThrowsTags" value="true"/>
<property name="allowMissingReturnTag" value="true"/>
<property name="minLineCount" value="2"/>
<property name="allowedAnnotations" value="Override, Test"/>
<property name="allowThrowsTagsForSubclasses" value="true"/>
</module>
<module name="MethodName">
<property name="format" value="^[a-z][a-z0-9][a-zA-Z0-9_]*$"/>
<message key="name.invalidPattern"
value="Method name ''{0}'' must match pattern ''{1}''."/>
</module>
<module name="SingleLineJavadoc">
<property name="ignoreInlineTags" value="false"/>
</module>
<module name="EmptyCatchBlock">
<property name="exceptionVariableName" value="expected"/>
</module>
<module name="CommentsIndentation"/>
</module>
</module>
Loading

0 comments on commit 3061e63

Please sign in to comment.