Skip to content
This repository has been archived by the owner on Sep 28, 2022. It is now read-only.

OffsiteMiddleware not working #6

Open
samos123 opened this issue May 25, 2013 · 2 comments
Open

OffsiteMiddleware not working #6

samos123 opened this issue May 25, 2013 · 2 comments

Comments

@samos123
Copy link
Contributor

I saw the request is replaced with dont_filter=True, if I remove that the spider will just stop when it gets to the same url.

I need to use the offsite middleware though, so any thoughts?

I will do some hacking, on a total rewrite where there is no need for the Spider middleware and only DownloaderMiddleware or a normal Downloader. Starting to understand this stuff a little hehe.

@ncadou
Copy link
Collaborator

ncadou commented May 25, 2013

If I remember correctly, dont_filter=True comes from an earlier experiment where requests were not queued up in the spider middleware. They would be rescheduled in the scrapy queue and then dropped by the offsite middleware. I'm not sure why it'd still be needed though. Do you have an idea where the spider stops exactly?

Another reason for needing WebdriverSpiderMiddleware is that we need to keep track of when a spider parse method finishes working with the webdriver instance it got assigned, as until the parsing is finished, the webdriver instance should not be changed by any other spider activity. We could have the spider parse method explicitly release the webdriver instance, but that looks error-prone and in general not very clean to me. My concern here is ease of use, by making WebdriverRequest as much as a drop-in replacement for the stock Request as possible.

The spider middleware layer ended up being the best place to do the accounting and the future multiple instance management.

@samos123
Copy link
Contributor Author

Yea i noticed the same thing. No idea yet why.. been looking at the related code without much success yet.

I see yea we need the webdriver if people still want to use, couldn't we just pass a deep copy? Guess not because it would be interacting with the same remote webdriver.

You're right I think for using the webdriver in the spider, the Spider middleware seems like a nice solution. I am mostly using this for rendering the page with javascript, so didn't get to that part yet.

I've hacked something together for my own use case last night, which uses the downloader only. The offsite middleware is working fine there. I spied on https://github.com/scrapinghub/scrapyjs/ here is my result of hacking yesterday: https://github.com/samos123/scrapy-webdriver/tree/downloader-only

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants