-
Notifications
You must be signed in to change notification settings - Fork 260
ParseFilters
ParseFilters are called from parsing bolts such as JSoupParserBolt and SiteMapParserBolt to extract data from web pages. The data extracted are stored in the Metadata object. ParseFilters can also modify the Outlinks and in that sense act as URLFilters.
ParseFilters need to implement the interface ParseFilter which has three methods :
public void filter(String URL, byte[] content, DocumentFragment doc,
ParseResult parse);
public void configure(Map stormConf, JsonNode filterParams);
public boolean needsDOM();
The filter
method is where the extraction takes place. ParseResult objects contain the outlinks extracted from the document as well as a Map of String ParseData where the String is the URL of a subdocument obtained from the main one (or the main one itself). The ParseData objects contain Metadata, binary content and text for the subdocuments. This is useful for indexing subdocuments independently from the main document.
The needsDOM
method simply indicates whether a given ParseFilter instance needs the DOM structure. If no ParseFilters need it, the parsing bolt will not generate the DOM which can slightly improve the performance.
The configure
method takes as input a JSON file which is loaded by the wrapper class ParseFilters. The config map from Storm can also be used to configure the filters, as explained in Configuration.
Here is the default JSON configuration file for ParseFilters.
The JSON configuration allows to load several instances of the same filtering class with different parameters, allows complex configuration objects since it makes no assumptions about the content of the field 'param'. The ParseFilters are executed in the order in which they are defined in the JSON file.
The following ParseFilters are provided as part of the core code:
CollectionTagger The CollectionTagger assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.
The resource file must specify regular expressions for inclusions but also for exclusions e.g.
{
"collections": [{
"name": "stormcrawler",
"includePatterns": ["http://stormcrawler.net/.+"]
},
{
"name": "crawler",
"includePatterns": [".+crawler.+", ".+nutch.+"],
"excludePatterns": [".+baby.+", ".+spider.+"]
}
]
}
CommaSeparatedToMultivaluedMetadata CommaSeparatedToMultivaluedMetadata rewrites single metadata containing comma separated values into multiple values for the same key, useful for instance for keyword tags.
DebugParseFilter The DebugParseFilter dumps a XML representation of the DOM structure to a temporary file.
DomainParseFilter The DomainParseFilter stores the domain or host name in the metadata for indexing later on.
LDJsonParseFilter The LDJsonParseFilter Extracts data from JSON-LD representation.
LinkParseFilter The LinkParseFilter can be used to extract outlinks from documents using Xpath expressions defined in the config.
MD5SignatureParseFilter The MD5SignatureParseFilter generates a MD5 signature of a document based on the binary content, text or URL (as a last resort). Can be used in combination with the ContentFilter above so that the text used for the signature excludes any boilerplate.
MimeTypeNormalization The MimeTypeNormalization converts the mime-type values (returned by the servers or guessed based on the content) as human-readable values such as pdf, html or image and stores them in the metadata. This can be used during indexing to generate a field used for filtering search results.
XPathFilter The XPathFilter allows to define XPath expressions to extract data from the page and store them in the Metadata object.
- Start
- Components
- Filters
- Bolts
- Protocol
- Metadata
- Resources