Skip to content
This repository has been archived by the owner on Sep 28, 2022. It is now read-only.

Stuck on Downloading for a long time #3

Open
samos123 opened this issue May 14, 2013 · 9 comments
Open

Stuck on Downloading for a long time #3

samos123 opened this issue May 14, 2013 · 9 comments

Comments

@samos123
Copy link
Contributor

I'm currently seeing that its stuck on downloading for a long time, could it be that the request timed out so it won't continue? Are requests currently not concurrent because of the queues? It only takes one out of the queue one by one?

2013-05-14 13:46:23+0800 [scrapy] DEBUG: Downloading http://xxxxl.com/item.html with webdriver
2013-05-14 13:46:32+0800 [xxx] INFO: Crawled 23 pages (at 23 pages/min), scraped 9 items (at 9 items/min)
2013-05-14 13:47:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:48:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:49:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:50:32+0800 [xx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)  

Feature description:
Add ability to spawn multiple webdrivers so we can scrapy requests concurrently.

For this we need an extra option, max_number of webdriver as it shouldn't grow indefinetly.

The reason that it got stuck on downloading is probably because PhantomJS crashed:

[DEBUG - 2013-05-18T04:28:00.536Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
[DEBUG - 2013-05-18T04:28:00.637Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
ExceptionHandler::GenerateDump waitpid failed:No child processes
PhantomJS has crashed. Please read the crash reporting guide at https://github.com/ariya/phantomjs/wiki/Crash-Reporting and file a bug report at https://github.com/ariya/phantomjs/issues/new with the crash dump file attached: /tmp/75f0d88c-1f16-3dd6-4a2892d0-687e48d0.dmp

So we maybe also need a way to check if PhantomJS is still responding and if not we should automatically restart the webdriver/phantomjs.

@ncadou
Copy link
Collaborator

ncadou commented May 14, 2013

Requests are running concurrently in scrapy in the sense that they won't block the main twisted event loop. Stock scrapy requests will therefore go through concurrently even if an unfinished webdriver request is downloading something. However, because all webdriver requests are attached to a specific webdriver instance (which itself needs to enforce sequential access for obvious reasons), and I haven't got around to implementing multiple webdriver instances support yet, in practice only one webdriver request may be performed at a time.

@samos123
Copy link
Contributor Author

Ah I see, so we basically want it as a new feature multiple webdriver
instances? I'm probably being stupid, but what are the obvious reasons just
wondering. I'm pretty new to the webdriver stuff.

Thanks again for your detailed reply. Helps me a lot!

On Tue, May 14, 2013 at 9:58 PM, Nicolas Cadou [email protected]:

Requests are running concurrently in scrapy in the sense that they won't
block the main twisted event loop. Stock scrapy requests will therefore go
through concurrently even if an unfinished webdriver request is downloading
something. However, because all webdriver requests are attached to a
specific webdriver instance (which itself needs to enforce sequential
access for obvious reasons), and I haven't got around to implementing
multiple webdriver instances support yet, in practice only one webdriver
request may be performed at a time.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-17877501
.

@ncadou
Copy link
Collaborator

ncadou commented May 14, 2013

You got that exactly right, support for multiple webdriver instances would be a new feature for scrapy-webdriver. And no worries about being stupid, you have no idea how much head-banging my desk had to suffer when I was trying to make sense of twisted and scrapy. :)

As for the obvious reasons, a webdriver instance is basically like a browser with just one tab. So trying to download two things at the same time would not work at all. And then, the state of that browser and its currently loaded page need to be left untouched until the parser method in the scrapy spider has finished working with it.

@samos123
Copy link
Contributor Author

Ok I may give this feature a try if you dont mind. Gives me a reason to learn more about Twisted, Scrapy and Selenium. May take some time though, not sure if I will finish at all even, got many other stuff going on also.

I'm amazed so few are using this btw.

@ncadou
Copy link
Collaborator

ncadou commented May 14, 2013

I would certainly not mind contributions. As for the low usage, this project is still very young, so I'm not surprised.

samos123 added a commit to samos123/scrapy-webdriver that referenced this issue May 21, 2013
Fixes brandicted#3 stuck on downloading for a long time
@stringertheory
Copy link

@ncadou Do you think it would be feasible to allow for parallel scrapy-webdriver requests using multiple tabs or windows in a single webdriver instance instead of extending to multiple webdriver instances (to avoid overhead)?

@ncadou
Copy link
Collaborator

ncadou commented May 25, 2013

There are ways with webdriver to create tabs and windows, and switch between them, so it should be possible to implement that support in scrapy-webdriver.

@IIIypuk09
Copy link

@ncadou Could you add a feature to use multiply webdrivers using one of the following settings 'CONCURRENT_REQUESTS'
'CONCURRENT_REQUESTS_PER_DOMAIN'
'CONCURRENT_REQUESTS_PER_IP'

How long to wait this feature?

@ncadou
Copy link
Collaborator

ncadou commented Jun 12, 2013

@IIIypuk09 multiple webdriver instances are planned down the line, and your suggestion about using settings makes total sense, but unfortunately I don't know when I'll have the opportunity to implement that feature.

tonal pushed a commit to tonal/scrapy-webdriver that referenced this issue Apr 14, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants