Stuck on Downloading for a long time #3

samos123 · 2013-05-14T05:54:14Z

I'm currently seeing that its stuck on downloading for a long time, could it be that the request timed out so it won't continue? Are requests currently not concurrent because of the queues? It only takes one out of the queue one by one?

2013-05-14 13:46:23+0800 [scrapy] DEBUG: Downloading http://xxxxl.com/item.html with webdriver
2013-05-14 13:46:32+0800 [xxx] INFO: Crawled 23 pages (at 23 pages/min), scraped 9 items (at 9 items/min)
2013-05-14 13:47:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:48:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:49:32+0800 [xxx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)
2013-05-14 13:50:32+0800 [xx] INFO: Crawled 23 pages (at 0 pages/min), scraped 9 items (at 0 items/min)

Feature description:
Add ability to spawn multiple webdrivers so we can scrapy requests concurrently.

For this we need an extra option, max_number of webdriver as it shouldn't grow indefinetly.

The reason that it got stuck on downloading is probably because PhantomJS crashed:

[DEBUG - 2013-05-18T04:28:00.536Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
[DEBUG - 2013-05-18T04:28:00.637Z] Session [399aee20-bf06-11e2-a1b3-1ff9fbb8ef48] - _execFuncAndWaitForLoadDecorator - Page Loading in Session: true
ExceptionHandler::GenerateDump waitpid failed:No child processes
PhantomJS has crashed. Please read the crash reporting guide at https://github.com/ariya/phantomjs/wiki/Crash-Reporting and file a bug report at https://github.com/ariya/phantomjs/issues/new with the crash dump file attached: /tmp/75f0d88c-1f16-3dd6-4a2892d0-687e48d0.dmp

So we maybe also need a way to check if PhantomJS is still responding and if not we should automatically restart the webdriver/phantomjs.

The text was updated successfully, but these errors were encountered:

ncadou · 2013-05-14T13:58:25Z

Requests are running concurrently in scrapy in the sense that they won't block the main twisted event loop. Stock scrapy requests will therefore go through concurrently even if an unfinished webdriver request is downloading something. However, because all webdriver requests are attached to a specific webdriver instance (which itself needs to enforce sequential access for obvious reasons), and I haven't got around to implementing multiple webdriver instances support yet, in practice only one webdriver request may be performed at a time.

samos123 · 2013-05-14T15:04:37Z

Ah I see, so we basically want it as a new feature multiple webdriver
instances? I'm probably being stupid, but what are the obvious reasons just
wondering. I'm pretty new to the webdriver stuff.

Thanks again for your detailed reply. Helps me a lot!

On Tue, May 14, 2013 at 9:58 PM, Nicolas Cadou [email protected]:

Requests are running concurrently in scrapy in the sense that they won't
block the main twisted event loop. Stock scrapy requests will therefore go
through concurrently even if an unfinished webdriver request is downloading
something. However, because all webdriver requests are attached to a
specific webdriver instance (which itself needs to enforce sequential
access for obvious reasons), and I haven't got around to implementing
multiple webdriver instances support yet, in practice only one webdriver
request may be performed at a time.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3#issuecomment-17877501
.

ncadou · 2013-05-14T15:39:34Z

You got that exactly right, support for multiple webdriver instances would be a new feature for scrapy-webdriver. And no worries about being stupid, you have no idea how much head-banging my desk had to suffer when I was trying to make sense of twisted and scrapy. :)

As for the obvious reasons, a webdriver instance is basically like a browser with just one tab. So trying to download two things at the same time would not work at all. And then, the state of that browser and its currently loaded page need to be left untouched until the parser method in the scrapy spider has finished working with it.

samos123 · 2013-05-14T15:59:42Z

Ok I may give this feature a try if you dont mind. Gives me a reason to learn more about Twisted, Scrapy and Selenium. May take some time though, not sure if I will finish at all even, got many other stuff going on also.

I'm amazed so few are using this btw.

ncadou · 2013-05-14T16:04:23Z

I would certainly not mind contributions. As for the low usage, this project is still very young, so I'm not surprised.

Fixes brandicted#3 stuck on downloading for a long time

stringertheory · 2013-05-24T22:02:50Z

@ncadou Do you think it would be feasible to allow for parallel scrapy-webdriver requests using multiple tabs or windows in a single webdriver instance instead of extending to multiple webdriver instances (to avoid overhead)?

ncadou · 2013-05-25T17:39:35Z

There are ways with webdriver to create tabs and windows, and switch between them, so it should be possible to implement that support in scrapy-webdriver.

IIIypuk09 · 2013-06-11T19:40:06Z

@ncadou Could you add a feature to use multiply webdrivers using one of the following settings 'CONCURRENT_REQUESTS'
'CONCURRENT_REQUESTS_PER_DOMAIN'
'CONCURRENT_REQUESTS_PER_IP'

How long to wait this feature?

ncadou · 2013-06-12T14:14:09Z

@IIIypuk09 multiple webdriver instances are planned down the line, and your suggestion about using settings makes total sense, but unfortunately I don't know when I'll have the opportunity to implement that feature.

Fixed test.

samos123 added a commit to samos123/scrapy-webdriver that referenced this issue May 21, 2013

Added option to abort request on timeout

f8294e2

Fixes brandicted#3 stuck on downloading for a long time

tonal pushed a commit to tonal/scrapy-webdriver that referenced this issue Apr 14, 2017

Merge pull request brandicted#3 from Willet/updated-test-manager

71d1e7b

Fixed test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck on Downloading for a long time #3

Stuck on Downloading for a long time #3

samos123 commented May 14, 2013

ncadou commented May 14, 2013

samos123 commented May 14, 2013

ncadou commented May 14, 2013

samos123 commented May 14, 2013

ncadou commented May 14, 2013

stringertheory commented May 24, 2013

ncadou commented May 25, 2013

IIIypuk09 commented Jun 11, 2013

ncadou commented Jun 12, 2013

Stuck on Downloading for a long time #3

Stuck on Downloading for a long time #3

Comments

samos123 commented May 14, 2013

ncadou commented May 14, 2013

samos123 commented May 14, 2013

ncadou commented May 14, 2013

samos123 commented May 14, 2013

ncadou commented May 14, 2013

stringertheory commented May 24, 2013

ncadou commented May 25, 2013

IIIypuk09 commented Jun 11, 2013

ncadou commented Jun 12, 2013