-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle_urls decorator using a new PageObjectRegistry #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this @ivanprado ! Added some comments for my initial review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Codecov Report
@@ Coverage Diff @@
## master #16 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 4 6 +2
Lines 55 196 +141
==========================================
+ Hits 55 196 +141
|
rules = default_registry.get_overrides_from_module("my_page_obj_project.cool_gadget_site.us.product_listings") | ||
|
||
# Or simply all of Override rules ever declared. | ||
rules = default_registry.get_overrides() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for this to work user would need to call walk_modules
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this on: daa3ff9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've improved upon the previous commit linked above by creating a much more simpler interface for both get_overrides()
+ consuming/importing external packages/modules. Changes are in 0cbeb0b. It can also be viewed in this code lines: https://github.com/scrapinghub/web-poet/pull/16/files#diff-409cb5fe79e7d7d5ede41109573e31bd847a147a966f477de80c99279bcdf479R247-R251
from web_poet.pages import ItemWebPage | ||
|
||
@cool_gadget_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage) | ||
@cool_gadget_us_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having a different idea in mind when talking about multiple registries, and probably a different use case.
Let's say you have all of the following:
- A large package, with many POs for different websites.
- A few small packages, with POs for some specific websites.
These packages may be in separate git repos, packaged separately, etc.
There are a few use cases:
- We want to use all of the POs defined in all these packages. This probably should be handled by web_poet.get_overrides().
- We want to use all of the POs defined in all packages, but replace or exclude some of them (you created a better version and want to use it instead).
- We want to use POs only from some of the packages.
- We want to use some POs from some of the packages.
In all these cases modifying the source code of the packages may not be an option; we'd really want to select which POs to enable and which POs not to use in a flexible way.
It seems this is possible with the get_overrides_from_module
API, but all the examples of multiple registries require making changes to the source code. I'm not sure it'd be a good practice to suggest people to create many registries and modify the source code to add more handle_urls decorators (usually with the same parameters - a lot of copy-paste).
I was thinking about a different approach for using multiple registries, something like this:
registry = web_poet.PageObjectRegistry()
from some_package.pages import CoolGadgetSiteProductPage
registry.handle_urls(CoolGadgetSiteProductPage, "coolgadgetsite.com", overrides=GenericProductPage)
Of course, you can also have POs in your own project using the same registry, declared with a decorator.
What I don't like in the example above is that all handle_urls parameters are still duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I do see your point. Cheers for laying it out! I'll be updating this section in the docs to better showcase such scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @kmike ! I created a new dedicated section for this to explore various ways of using external packages which contain Page Objects. I can see how this would be a common requirement/usecase for the users.
It seems this is possible with the get_overrides_from_module API, but all the examples of multiple registries require making changes to the source code. I'm not sure it'd be a good practice to suggest people to create many registries and modify the source code to add more handle_urls decorators (usually with the same parameters - a lot of copy-paste).
This section was not taking into account the use of external packages but rather exploring how to organize the Override Rules using multiple Registries in a single local project. I believe the new dedicated section named Using Overrides from External Packages might remove the confusion in the docs: bd3a88e UPDATE: The API has been updated and so is the docs. Ref: bf0b3e5
Let me know what you think of this new guide. Cheers!
web_poet/overrides.py
Outdated
return { | ||
cls: rule | ||
for cls, rule in self.data.items() | ||
if cls.__module__.startswith(module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can string matching lead to issues here? E.g. cls.__module__
is foo.bar, and module is "fo"?
I also wonder if startswith is necessary; doesn't walk_modules import all modules recursively, so that you can always use exact match?
Caveat: I'm not confident I understood this part correctly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can string matching lead to issues here? E.g. cls.module is foo.bar, and module is "fo"?
Great catch! 🙏 It would indeed have some matching issues here. Fixed it on 10dff5b.
I also wonder if startswith is necessary; doesn't walk_modules import all modules recursively, so that you can always use exact match?
What we're filtering out here is the Registry's data
values which contains the mapping of override rules. There might be cases where the registry was applied with walk_modules
and could now contain data
values on some other package/module. This filtration would prevent that.
Hi @kmike , After working and polishing on the separate PR in scrapinghub/scrapy-poet#57, I realized we needed a quick and easy way to access the rules via CLI. Thus I've introduced a concept of a registry_pool where it contains all instances of Commit is at de5563a |
Overrides contains mapping rules to associate which URLs a particular | ||
Page Object would be used. The URL matching rules is handled by another library | ||
called `url-matcher <https://url-matcher.readthedocs.io>`_. | ||
|
||
Using such matching rules establishes the core concept of Overrides wherein | ||
its able to use specific Page Objects in lieu of the original one. | ||
|
||
This enables ``web-poet`` to be used effectively by other frameworks like | ||
`scrapy-poet <https://scrapy-poet.readthedocs.io>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: think about how can we simplify the description here. There is nothing wrong with the current description, but it was a bit hard for me to understand what it means. It becomes much more clear in the next section.
called `url-matcher <https://url-matcher.readthedocs.io>`_. | ||
|
||
Using such matching rules establishes the core concept of Overrides wherein | ||
its able to use specific Page Objects in lieu of the original one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its able to use specific Page Objects in lieu of the original one. | |
it's able to use specific Page Objects in lieu of the original one. |
Closing it in favor of #27. |
Do not merge: It requiresurl-matcher
to be released in advance.The new decorator
handle_urls
can be used to annotate overriding rules directly in the Page Object classes. The functionfind_page_object_overrides
list all the declared overrides within a particular module/package recursively.The
PageObjectRegistry
class can be used to create different sets of rules.A CMD has been introduced to explore packages. This is what you get if you run
python -m web_poet tests.po_lib
or simplyweb_poet tests.po_lib
:A pair or new dependencies were introduced:
url-matcher
for thePatterns
definitiontabulate
to format the output in the CMDThe function/classes are documented, but no general documentation was introduced. I leave it as a follow-up task. I'm not sure this documentation should be part of
web-poet
though.Todo:
handle_urls
andfind_page_objects_overrides
, with links toscrapy-poet
documentation where really main part of the documentation should be. Also add documentation about thepython -m web_poet
commandSchema of possible documentation
A section is needed for "overrides" metadata or something like that.
handle_urls
for. We can say it is used by some engines, e.g.scrapy-poet
, to locate the extraction code for particular sites or sets of pages. Link toscrapy-poet
tutorial.url-matcher
documentation.find_page_objects_overrides
function. It can be used to explore the handle_urls metadata. Explanation and usage. Link toscrapy-poet
place where shows how to configure overrides using the function.python -m web_poet ...