How to crawler stop running? #16

RickyLau · 2017-10-29T04:52:19Z

How to crawler stop running?

medcl · 2017-12-18T01:40:40Z

working on it.

Jasmi77 · 2018-06-11T20:11:42Z

After setting up GOPA, its in Start stage and no way to stop it. Also not sure if its indexing something

Jasmi77 · 2018-06-11T22:42:31Z

hi Medcl, i am so excited to find GOPA as it seems to be promising for internal site search that i am trying to build - however i can get it to work - can you please help ?

medcl · 2018-06-11T22:53:12Z

are you building from the source, or download from the lastest release? @Jasmi77

medcl · 2018-06-11T22:57:54Z

@Jasmi77 the master branch is under heavy development, I suggest you download the v0.10 released package （https://github.com/infinitbyte/gopa/releases/tag/v0.10.0） and read this README https://github.com/infinitbyte/gopa/tree/v0.10.0，note this version only support SQLite as persist database(for tasks).

daveX99 · 2018-07-04T22:55:20Z

Hi - I'm new to ElasticSearch and have been experimenting with Gopa. I'm also having trouble understanding how to 'stop' the crawler. I've pointed it at a dev version of our site, and it seems to find just over 100 documents, but continues to generate a lot of tasks. It seems to be continuously crawling the site over and over.

The site is fairly static, so what I would like to do is have Gopa crawl the site once, and then we can re-index as content is updated. Is it possible to configure Gopa to do that? Or to know when it has finished its initial crawl?

medcl · 2018-07-05T00:52:34Z

Hi, @daveX99
I am a little busy recently (the best excuse I've got :) ) , and regarding your question:

About it generate a lot of tasks, can you check out the task API: http://localhost:8001/tasks/ to see what's inside, the crawler automatically follow all the links in your site, you can filter then in this config section: https://github.com/infinitbyte/gopa/blob/master/gopa.yml#L52
This is easy, by default, Gopa will try to check the site for updates periodically , you can config this as well:
https://github.com/infinitbyte/gopa/blob/master/gopa.yml#L183

daveX99 · 2018-07-05T06:38:25Z

@medcl :

I'm sure you are busy, so I appreciate your quick response.

I played a bit with the parameters to limit the URLs and that fixed my problem.

There are some oddities in the links on the site I am indexing, and that was causing a weird recursion in gopa. Once I set the parameters under url_match_rule, must_not, contain to exclude this link, the indexing ran to completion, and all succeeded.

I will probably need to play with the configuration in gopa.yml some more to fine tune the indexing. Is there any documentation on how those keys/values work?

Thanks again,
-dave.

medcl · 2018-07-05T08:58:34Z

the documents is a issue, and few tips of the configuration:

There are two runners in the pipeline module, one is used to clean up the urls called checker, another one is used to fetch resource and parse the content and save to elasticsearch.
These pipelines are dynamic build up from the configuration, a joint means a process step of the pipeline flow.
Each joint has their own parameters, mostly can found in the example config.
Regarding the speed of indexing, you may consider enlarger the max_go_routine parameter, which means how many concurrent tasks will be running.

daveX99 · 2018-07-05T15:29:06Z

@medcl :

I will keep that in mind. I am still learning the basics of how all this fits together. At this point, I am able to index the site with gopa and get the data into elasticsearch. Indexing does not take more than a few minutes now.

If I have further questions, I will create a new question to the issue queue so that this one is not filled with off-topic issues.

Thanks again for your quick replies!
-dave.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to crawler stop running? #16

How to crawler stop running? #16

RickyLau commented Oct 29, 2017

medcl commented Dec 18, 2017

Jasmi77 commented Jun 11, 2018

Jasmi77 commented Jun 11, 2018

medcl commented Jun 11, 2018 •

edited

Loading

medcl commented Jun 11, 2018

daveX99 commented Jul 4, 2018 •

edited

Loading

medcl commented Jul 5, 2018 •

edited

Loading

daveX99 commented Jul 5, 2018

medcl commented Jul 5, 2018 •

edited

Loading

daveX99 commented Jul 5, 2018

How to crawler stop running? #16

How to crawler stop running? #16

Comments

RickyLau commented Oct 29, 2017

medcl commented Dec 18, 2017

Jasmi77 commented Jun 11, 2018

Jasmi77 commented Jun 11, 2018

medcl commented Jun 11, 2018 • edited Loading

medcl commented Jun 11, 2018

daveX99 commented Jul 4, 2018 • edited Loading

medcl commented Jul 5, 2018 • edited Loading

daveX99 commented Jul 5, 2018

medcl commented Jul 5, 2018 • edited Loading

daveX99 commented Jul 5, 2018

medcl commented Jun 11, 2018 •

edited

Loading

daveX99 commented Jul 4, 2018 •

edited

Loading

medcl commented Jul 5, 2018 •

edited

Loading

medcl commented Jul 5, 2018 •

edited

Loading