-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Setup #14
Comments
In the discussion I had with Gokul, we concluded that we need to first assess the capabilities of GCP. This command Personal Machine:{'downloader/request_bytes': 112822,
'downloader/request_count': 293,
'downloader/request_method_count/GET': 293,
'downloader/response_bytes': 723461,
'downloader/response_count': 293,
'downloader/response_status_count/200': 293,
'elapsed_time_seconds': 10.937102,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2019, 12, 26, 13, 23, 50, 682711),
'log_count/INFO': 20,
'memusage/max': 54001664,
'memusage/startup': 54001664,
'request_depth_max': 12,
'response_received_count': 293,
'scheduler/dequeued': 293,
'scheduler/dequeued/memory': 293,
'scheduler/enqueued': 5861,
'scheduler/enqueued/memory': 5861,
'start_time': datetime.datetime(2019, 12, 26, 13, 23, 39, 745609)} My Lab Machine{'downloader/request_bytes': 274549,
'downloader/request_count': 599,
'downloader/request_method_count/GET': 599,
'downloader/response_bytes': 1915853,
'downloader/response_count': 599,
'downloader/response_status_count/200': 599,
'elapsed_time_seconds': 10.557146,
'finish_reason': 'closespider_timeout',
'finish_time': datetime.datetime(2019, 12, 26, 13, 28, 32, 395575),
'log_count/INFO': 20,
'memusage/max': 53260288,
'memusage/startup': 53260288,
'request_depth_max': 22,
'response_received_count': 599,
'scheduler/dequeued': 599,
'scheduler/dequeued/memory': 599,
'scheduler/enqueued': 11981,
'scheduler/enqueued/memory': 11981,
'start_time': datetime.datetime(2019, 12, 26, 13, 28, 21, 838429)} @GokulNC I see that you already have a GCP instance running. Can you please post the result of the command for that instance? I also looked up the pricing of GCP. It's $0.12/GB egress and ingress is free. So network pricing won't be much of an issue with GCP. However, there is still the cost of running the VM. |
Here's my output for
And sure, that's fine with using GCP costs, we'll discuss about it during our next call. |
BTW, the number of CPU cores that I used for the above VM is 4. In GCP, it's possible to use a large number of CPU cores per VM, so extensively testing for all those different configurations might be helpful. |
Regarding distributed setup, this is what I propose. For this setup, we will need scrapyd, rabbitmq, and a distributed file system (HDFS/seaweedfs)
(1) Adding nodes: whatever node we wanna add, we will have to run scrapyd manually on it. Once scrapyd is up and running, we can control it through scrapyd's HTTP API
(2) The DFS will hold the jobdirs and the crawled data. The jobdirs will be regularly updated by the nodes
(3) Rabbitmq will be our event messenger. The running crawlers will push the events here.
(4) Then we can run the dashboard on any machine. The dashboard will show the crawl statistics obtained through events; it will show a list of live nodes, also obtained through events; we can start/stop crawls by using the scrapyd http api.
More specifically, the starting-a-crawl operation will look like this;
<choose node> <list of news sources>
The crawler will query the DFS to retrieve the latest jobdir, then initiate the crawl
Let's brainstorm over this in the current week and then go ahead with the implementation starting next week.
The text was updated successfully, but these errors were encountered: