Apache-Nutch

Nutch is a production ready web crawler that works in tandem with Apache Solr

Problem Statement

Springshare's search API does search full text but it does not return the matching full text snippet. The full text snippet provides more context to the matched results and this is a feature requested by the Reference & Discovery Services team.

Docker installation

Docker is a containerization tool used for developing, packaging and running applications
Install Docker on the host machine with the instructions outlined here

Initialize Swarm

Docker Swarm is a container orchestration tool which consists of multiple services on a given node
A Docker host can be a manager or a worker node or both in some cases
Initialize a node of 1 (where the manager and the worker are the same node)

  docker swarm init

Build the custom nutch image using

docker build ./ -t nutch:msul --no-cache

Build the custom solr image using

docker build ./solr -t solr:msul --no-cache

Deploy the Apache Nutch Application using

docker stack deploy -c docker-compose.yml nutch-tutorial

Stop the Apache Nutch Docker stack using

docker stack rm nutch-tutorial

Getting started with Nutch

Apache Nutch is a production ready web crawler to fetch, parse and index website data into Apache Solr
Lifecycle of Apache Nutch crawl

Source: https://medium.com/@mobomo/the-basics-working-with-nutch-e5a7d37af231

Glossary

Seed: List of url's that are ready to be fetched and indexed
Inject: Reads the list of url's from the seed file and add them to the list of pages to be crawled. This list is updated with additional metadata in the next steps of the lifecycle.
Generate: Reads the list of injected url's and creates segments based on the eligibility of a page/url.
Segment: Segment is a partition created, which contains the fetch list, content, and the parsed content. The content in segment varies at different stages of the lifecycle.
Fetch: Reads the fetch list from the segments and requests the content for each of the url's in the list.
Parse: Processes the content fetched and primes it for indexing into Solr. This step includes identifying the title, page url, and page body from the fetched content.
Index: Indexes data into a Lucene based search.
Lucene: A java based full-text indexing and searching software.
Solr: A wrapper around Lucene providing a GUI and adding the ability to configure indexing and searching.
Hadoop Map Files: A directory containing 2 files, data and index. Data file consists of key and values and the index file consists of fraction of the keys. The index file is loaded into the memory for quick lookup.

Technical Documentation

Config updates

TODO - Move these steps into the Docker file
- Navigate to /root/nutch_source/runtime/local/conf and update the following files
  - Replace the contents of the file nutch-site.xml.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>
</configuration>

Navigate to /root/nutch_source/runtime/local/conf/index-writers.xml and update the url param.

 <param name="url" value="http://solr:8983/solr/nutch"/>

Enter the Nutch Docker container using

docker exec -it `docker ps -aqf name='nutch-tutorial_nutch'` bash

Seed

A text file containing list of url's with one url per line.
The url's can contain optional metadata separated by tabs, represented in key=value format

https://libguides.lib.msu.edu/ISB202Taylor
https://libguides.lib.msu.edu/ISB202Taylor nutch.score=10 nutch.fetchInterval=2592000

Inject

A seeds.txt file is required for this step. This is the starting point for Nutch to crawl the url's
Inject url's into the db using

nutch inject crawl/crawldb urls

Generate

Extracts the url's from url's from the seeds file and queues them for fetching.
After the first crawl Nutch will generate the url's from hyperlinks on the parsed pages.
Generate a fetch list from the database

bin/nutch generate <crawldb> <segments_dir> [-force] [-adddays numDays]

# Example
nutch generate crawl/crawldb crawl/segments

Fetch

Crawls the url's from the generate step and fetches the content.
Fetch url's using

nutch fetch [-D...] <segment>

# Example

nutch fetch path/to/the/segment

Number of threads used for fetching can be passed as an argument.
Increasing the thread count could make the fetch go faster but may overwhelm the server

nutch fetch path/to/the/segment -threads 50

Parse

Organizes the data scraped by the fetcher.
Parse the fetched entries using

nutch parse [-D...] <segment>

# Example

nutch parse path/to/the/segment

Read the segments parsed. This gives the list of url's identified during the initial crawl

nutch readseg -dump crawl/segments/{segment_file} outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext

# options for reading segments

-nocontent: Pass this to ignore the content directory.

-nofetch: To ignore the crawl_fetch directory.

-nogenerate: To ignore the crawl_generate directory.

-noparse: To ignore the crawl_parse directory.

-noparsedata: To ignore the parse_data directory.

-noparsetext: To ignore the parse_text directory.

Updatedb

Marks the url's for future generate steps.
Update the the db with additional metadata after fetch

nutch updatedb crawl/crawldb path/to/the/segment

Index

Index data into Solr

nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/{segment_file} -filter -deleteGone

Getting started with Solr

Solr is initialized as a prat of this stack and a nutch core is created during container startup. Nutch provides us with a baseline schema for the core which needs to be updated in Solr. Additional configurations can be added to the schema based on requirements

Copy the Nutch schema into the newly created nutch core inside of the docker container

Exec into container using
docker exec -it `docker ps -aqf name='nutch-tutorial_solr'` bash

Schema location: /solr/managed-schema
Container location: /var/solr/data/nutch/conf

Solr is available inside the container on port 8983 and can be accessed by other services in the atck using http://solr:8983
Solr is available on the host machine on 80 and can be accessed using http://localhost:80

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
conf		conf
solr		solr
urls		urls
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache-Nutch

Problem Statement

Docker installation

Initialize Swarm

Getting started with Nutch

Glossary

Technical Documentation

Config updates

Seed

Inject

Generate

Fetch

Parse

Updatedb

Index

Getting started with Solr

About

Releases

Packages

Languages

Sruthin86/apache-nutch-tutorial

Folders and files

Latest commit

History

Repository files navigation

Apache-Nutch

Problem Statement

Docker installation

Initialize Swarm

Getting started with Nutch

Glossary

Technical Documentation

Config updates

Seed

Inject

Generate

Fetch

Parse

Updatedb

Index

Getting started with Solr

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages