Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Setup #14

Open
divkakwani opened this issue Dec 23, 2019 · 3 comments
Open

Distributed Setup #14

divkakwani opened this issue Dec 23, 2019 · 3 comments

Comments

@divkakwani
Copy link
Owner

divkakwani commented Dec 23, 2019

Regarding distributed setup, this is what I propose. For this setup, we will need scrapyd, rabbitmq, and a distributed file system (HDFS/seaweedfs)

(1) Adding nodes: whatever node we wanna add, we will have to run scrapyd manually on it. Once scrapyd is up and running, we can control it through scrapyd's HTTP API

(2) The DFS will hold the jobdirs and the crawled data. The jobdirs will be regularly updated by the nodes

(3) Rabbitmq will be our event messenger. The running crawlers will push the events here.

(4) Then we can run the dashboard on any machine. The dashboard will show the crawl statistics obtained through events; it will show a list of live nodes, also obtained through events; we can start/stop crawls by using the scrapyd http api.

More specifically, the starting-a-crawl operation will look like this;
<choose node> <list of news sources>
The crawler will query the DFS to retrieve the latest jobdir, then initiate the crawl

Let's brainstorm over this in the current week and then go ahead with the implementation starting next week.

@divkakwani
Copy link
Owner Author

divkakwani commented Dec 26, 2019

In the discussion I had with Gokul, we concluded that we need to first assess the capabilities of GCP.

This command scrapy bench can be used to benchmark the crawler. I ran it on two different machines and here are the results.

Personal Machine:

{'downloader/request_bytes': 112822,
 'downloader/request_count': 293,
 'downloader/request_method_count/GET': 293,
 'downloader/response_bytes': 723461,
 'downloader/response_count': 293,
 'downloader/response_status_count/200': 293,
 'elapsed_time_seconds': 10.937102,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 13, 23, 50, 682711),
 'log_count/INFO': 20,
 'memusage/max': 54001664,
 'memusage/startup': 54001664,
 'request_depth_max': 12,
 'response_received_count': 293,
 'scheduler/dequeued': 293,
 'scheduler/dequeued/memory': 293,
 'scheduler/enqueued': 5861,
 'scheduler/enqueued/memory': 5861,
 'start_time': datetime.datetime(2019, 12, 26, 13, 23, 39, 745609)}

My Lab Machine

{'downloader/request_bytes': 274549,
 'downloader/request_count': 599,
 'downloader/request_method_count/GET': 599,
 'downloader/response_bytes': 1915853,
 'downloader/response_count': 599,
 'downloader/response_status_count/200': 599,
 'elapsed_time_seconds': 10.557146,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 13, 28, 32, 395575),
 'log_count/INFO': 20,
 'memusage/max': 53260288,
 'memusage/startup': 53260288,
 'request_depth_max': 22,
 'response_received_count': 599,
 'scheduler/dequeued': 599,
 'scheduler/dequeued/memory': 599,
 'scheduler/enqueued': 11981,
 'scheduler/enqueued/memory': 11981,
 'start_time': datetime.datetime(2019, 12, 26, 13, 28, 21, 838429)}

@GokulNC I see that you already have a GCP instance running. Can you please post the result of the command for that instance?

I also looked up the pricing of GCP. It's $0.12/GB egress and ingress is free. So network pricing won't be much of an issue with GCP. However, there is still the cost of running the VM.

@GokulNC
Copy link
Collaborator

GokulNC commented Dec 26, 2019

Here's my output for scrapy bench on my GCP VM:

{'downloader/request_bytes': 207643,
 'downloader/request_count': 493,
 'downloader/request_method_count/GET': 493,
 'downloader/response_bytes': 1394394,
 'downloader/response_count': 493,
 'downloader/response_status_count/200': 493,
 'elapsed_time_seconds': 10.666416,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 12, 26, 14, 30, 2, 366636),
 'log_count/INFO': 20,
 'memusage/max': 51679232,
 'memusage/startup': 51679232,
 'request_depth_max': 18,
 'response_received_count': 493,
 'scheduler/dequeued': 493,
 'scheduler/dequeued/memory': 493,
 'scheduler/enqueued': 9861,
 'scheduler/enqueued/memory': 9861,
 'start_time': datetime.datetime(2019, 12, 26, 14, 29, 51, 700220)}

And sure, that's fine with using GCP costs, we'll discuss about it during our next call.

@GokulNC
Copy link
Collaborator

GokulNC commented Dec 28, 2019

BTW, the number of CPU cores that I used for the above VM is 4. In GCP, it's possible to use a large number of CPU cores per VM, so extensively testing for all those different configurations might be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants