Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete commit to Elasticsearch #3

Open
wolverline opened this issue Feb 7, 2018 · 1 comment
Open

Incomplete commit to Elasticsearch #3

wolverline opened this issue Feb 7, 2018 · 1 comment
Labels

Comments

@wolverline
Copy link

@essiembre Hi Pascal, I have faced a couple of issues with Elasticserach Importer. First off, I found the commit count is way off as seen in the following log segment. The site ran previously and was committed to file system. Later I added Elasticsearch committer and ran it after I removed the output folder.

  • I wonder what are the relations between reference count and actual commit count. Don't they have to match?
  • It seems the previous crawl was cached somewhere. How can I clean it up?
big-site: 2018-02-07 03:39:29 INFO - big-site: Crawler finishing: committing documents.
big-site: 2018-02-07 03:39:29 INFO - Committing 181 files
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 31 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Elasticsearch RestClient closed.
big-site: 2018-02-07 03:39:29 INFO - big-site: 10195 reference(s) processed.

The committer config:

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
  <nodes>somewhere in the jungle</nodes>
  <indexName>big-site-index</indexName>
  <queueDir>$workdir/commit</queueDir>
  <connectionTimeout>5 minutes</connectionTimeout>
  <socketTimeout>5 minutes</socketTimeout>
  <typeName>Documents</typeName>
  <commitBatchSize>50</commitBatchSize>
  <maxRetries>1</maxRetries>
</committer>
@essiembre
Copy link
Contributor

The number of "references processed" represents the number of URLs/documents discovered and taken care of, whether they were good (sent to your committer) or bad (e.g. rejected). So they usually do not match and you normally have much more "processed" ones than committed ones. Also, by default, unmodified ones will not trigger a call to your committer.

To start a clean fresh and discard previous runs, delete your "workdir". More precisely, delete the "crawlstore" directory before you start your Collector. This is where information about previous crawls gets cached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants