You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@essiembre Hi Pascal, I have faced a couple of issues with Elasticserach Importer. First off, I found the commit count is way off as seen in the following log segment. The site ran previously and was committed to file system. Later I added Elasticsearch committer and ran it after I removed the output folder.
I wonder what are the relations between reference count and actual commit count. Don't they have to match?
It seems the previous crawl was cached somewhere. How can I clean it up?
big-site: 2018-02-07 03:39:29 INFO - big-site: Crawler finishing: committing documents.
big-site: 2018-02-07 03:39:29 INFO - Committing 181 files
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 50 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Sending 31 commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Done sending commit operations to Elasticsearch.
big-site: 2018-02-07 03:39:29 INFO - Elasticsearch RestClient closed.
big-site: 2018-02-07 03:39:29 INFO - big-site: 10195 reference(s) processed.
The committer config:
<committerclass="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
<nodes>somewhere in the jungle</nodes>
<indexName>big-site-index</indexName>
<queueDir>$workdir/commit</queueDir>
<connectionTimeout>5 minutes</connectionTimeout>
<socketTimeout>5 minutes</socketTimeout>
<typeName>Documents</typeName>
<commitBatchSize>50</commitBatchSize>
<maxRetries>1</maxRetries>
</committer>
The text was updated successfully, but these errors were encountered:
The number of "references processed" represents the number of URLs/documents discovered and taken care of, whether they were good (sent to your committer) or bad (e.g. rejected). So they usually do not match and you normally have much more "processed" ones than committed ones. Also, by default, unmodified ones will not trigger a call to your committer.
To start a clean fresh and discard previous runs, delete your "workdir". More precisely, delete the "crawlstore" directory before you start your Collector. This is where information about previous crawls gets cached.
@essiembre Hi Pascal, I have faced a couple of issues with Elasticserach Importer. First off, I found the commit count is way off as seen in the following log segment. The site ran previously and was committed to file system. Later I added Elasticsearch committer and ran it after I removed the output folder.
The committer config:
The text was updated successfully, but these errors were encountered: