Distributed Setup #14

divkakwani · 2019-12-23T18:33:39Z

Regarding distributed setup, this is what I propose. For this setup, we will need scrapyd, rabbitmq, and a distributed file system (HDFS/seaweedfs)

(1) Adding nodes: whatever node we wanna add, we will have to run scrapyd manually on it. Once scrapyd is up and running, we can control it through scrapyd's HTTP API

(2) The DFS will hold the jobdirs and the crawled data. The jobdirs will be regularly updated by the nodes

(3) Rabbitmq will be our event messenger. The running crawlers will push the events here.

(4) Then we can run the dashboard on any machine. The dashboard will show the crawl statistics obtained through events; it will show a list of live nodes, also obtained through events; we can start/stop crawls by using the scrapyd http api.

More specifically, the starting-a-crawl operation will look like this;
<choose node> <list of news sources>
The crawler will query the DFS to retrieve the latest jobdir, then initiate the crawl

Let's brainstorm over this in the current week and then go ahead with the implementation starting next week.

divkakwani · 2019-12-26T13:39:10Z

In the discussion I had with Gokul, we concluded that we need to first assess the capabilities of GCP.

This command scrapy bench can be used to benchmark the crawler. I ran it on two different machines and here are the results.

Personal Machine:

{'downloader/request_bytes': 112822,
 'downloader/request_count': 293,
 'downloader/request_method_count/GET': 293,
 'downloader/response_bytes': 723461,
 'downloader/response_count': 293,
 'downloader/response_status_count/200': 293,
 'elapsed_time_seconds': 10.937102,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 13, 23, 50, 682711),
 'log_count/INFO': 20,
 'memusage/max': 54001664,
 'memusage/startup': 54001664,
 'request_depth_max': 12,
 'response_received_count': 293,
 'scheduler/dequeued': 293,
 'scheduler/dequeued/memory': 293,
 'scheduler/enqueued': 5861,
 'scheduler/enqueued/memory': 5861,
 'start_time': datetime.datetime(2019, 12, 26, 13, 23, 39, 745609)}

My Lab Machine

{'downloader/request_bytes': 274549,
 'downloader/request_count': 599,
 'downloader/request_method_count/GET': 599,
 'downloader/response_bytes': 1915853,
 'downloader/response_count': 599,
 'downloader/response_status_count/200': 599,
 'elapsed_time_seconds': 10.557146,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 13, 28, 32, 395575),
 'log_count/INFO': 20,
 'memusage/max': 53260288,
 'memusage/startup': 53260288,
 'request_depth_max': 22,
 'response_received_count': 599,
 'scheduler/dequeued': 599,
 'scheduler/dequeued/memory': 599,
 'scheduler/enqueued': 11981,
 'scheduler/enqueued/memory': 11981,
 'start_time': datetime.datetime(2019, 12, 26, 13, 28, 21, 838429)}

@GokulNC I see that you already have a GCP instance running. Can you please post the result of the command for that instance?

I also looked up the pricing of GCP. It's $0.12/GB egress and ingress is free. So network pricing won't be much of an issue with GCP. However, there is still the cost of running the VM.

GokulNC · 2019-12-26T14:32:23Z

Here's my output for scrapy bench on my GCP VM:

{'downloader/request_bytes': 207643,
 'downloader/request_count': 493,
 'downloader/request_method_count/GET': 493,
 'downloader/response_bytes': 1394394,
 'downloader/response_count': 493,
 'downloader/response_status_count/200': 493,
 'elapsed_time_seconds': 10.666416,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 14, 30, 2, 366636),
 'log_count/INFO': 20,
 'memusage/max': 51679232,
 'memusage/startup': 51679232,
 'request_depth_max': 18,
 'response_received_count': 493,
 'scheduler/dequeued': 493,
 'scheduler/dequeued/memory': 493,
 'scheduler/enqueued': 9861,
 'scheduler/enqueued/memory': 9861,
 'start_time': datetime.datetime(2019, 12, 26, 14, 29, 51, 700220)}

And sure, that's fine with using GCP costs, we'll discuss about it during our next call.

GokulNC · 2019-12-28T16:02:51Z

BTW, the number of CPU cores that I used for the above VM is 4. In GCP, it's possible to use a large number of CPU cores per VM, so extensively testing for all those different configurations might be helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Setup #14

Distributed Setup #14

divkakwani commented Dec 23, 2019 •

edited

Loading

divkakwani commented Dec 26, 2019 •

edited

Loading

GokulNC commented Dec 26, 2019

GokulNC commented Dec 28, 2019

Distributed Setup #14

Distributed Setup #14

Comments

divkakwani commented Dec 23, 2019 • edited Loading

divkakwani commented Dec 26, 2019 • edited Loading

Personal Machine:

My Lab Machine

GokulNC commented Dec 26, 2019

GokulNC commented Dec 28, 2019

divkakwani commented Dec 23, 2019 •

edited

Loading

divkakwani commented Dec 26, 2019 •

edited

Loading