Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle_urls decorator using a new PageObjectRegistry #16

Closed
wants to merge 36 commits into from

Conversation

ivanprado
Copy link
Contributor

@ivanprado ivanprado commented Nov 30, 2021

Do not merge: It requires url-matcher to be released in advance.

The new decorator handle_urls can be used to annotate overriding rules directly in the Page Object classes. The function find_page_object_overrides list all the declared overrides within a particular module/package recursively.

The PageObjectRegistry class can be used to create different sets of rules.

A CMD has been introduced to explore packages. This is what you get if you run python -m web_poet tests.po_lib or simply web_poet tests.po_lib:

Use this                                                    instead of                                                           for the URL patterns            except for the patterns      with priority  meta
----------------------------------------------------------  -------------------------------------------------------------------  ------------------------------  -------------------------  ---------------  --------------------
tests.po_lib.POTopLevel1                                    tests.po_lib.POTopLevelOverriden1                                    ['example.com']                 ['/*.jpg|']                            300  {}
tests.po_lib.POTopLevel2                                    tests.po_lib.POTopLevelOverriden2                                    ['example.com']                 []                                     500  {}
tests.po_lib.a_module.POModule                              tests.po_lib.a_module.POModuleOverriden                              ['example.com']                 []                                     500  {'extra_arg': 'foo'}
tests.po_lib.nested_package.PONestedPkg                     tests.po_lib.nested_package.PONestedPkgOverriden                     ['example.com', 'example.org']  ['/*.jpg|']                            500  {}
tests.po_lib.nested_package.a_nested_module.PONestedModule  tests.po_lib.nested_package.a_nested_module.PONestedModuleOverriden  ['example.com', 'example.org']  ['/*.jpg|']                            500  {}

A pair or new dependencies were introduced:

  • url-matcher for the Patterns definition
  • tabulate to format the output in the CMD

The function/classes are documented, but no general documentation was introduced. I leave it as a follow-up task. I'm not sure this documentation should be part of web-poet though.

Todo:

  • Add some documentation about handle_urls and find_page_objects_overrides, with links to scrapy-poet documentation where really main part of the documentation should be. Also add documentation about the python -m web_poet command

Schema of possible documentation

A section is needed for "overrides" metadata or something like that.

  1. What is handle_urls for. We can say it is used by some engines, e.g. scrapy-poet, to locate the extraction code for particular sites or sets of pages. Link to scrapy-poet tutorial.
  2. Explain the basic: handle_url just define the rule: for this URLs use this Page Object instead of this one.
  3. Patterns. What are they and link to url-matcher documentation.
  4. find_page_objects_overrides function. It can be used to explore the handle_urls metadata. Explanation and usage. Link to scrapy-poet place where shows how to configure overrides using the function.
  5. Command line tool to list PO python -m web_poet ...
  6. Namespaces

@BurnzZ BurnzZ self-requested a review December 1, 2021 08:22
Copy link
Contributor

@BurnzZ BurnzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this @ivanprado ! Added some comments for my initial review.

CHANGELOG.rst Outdated Show resolved Hide resolved
docs/requirements.txt Outdated Show resolved Hide resolved
tests/po_lib/__init__.py Outdated Show resolved Hide resolved
web_poet/__main__.py Outdated Show resolved Hide resolved
web_poet/overrides.py Outdated Show resolved Hide resolved
Copy link

@sortafreel sortafreel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

setup.py Show resolved Hide resolved
web_poet/overrides.py Outdated Show resolved Hide resolved
tests/test_overrides.py Outdated Show resolved Hide resolved
web_poet/overrides.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Dec 21, 2021

Codecov Report

Merging #16 (d5a5d75) into master (bdce1ac) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master       #16    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            4         6     +2     
  Lines           55       196   +141     
==========================================
+ Hits            55       196   +141     
Impacted Files Coverage Δ
web_poet/__init__.py 100.00% <100.00%> (ø)
web_poet/__main__.py 100.00% <100.00%> (ø)
web_poet/overrides.py 100.00% <100.00%> (ø)

web_poet/overrides.py Outdated Show resolved Hide resolved
docs/intro/overrides.rst Outdated Show resolved Hide resolved
rules = default_registry.get_overrides_from_module("my_page_obj_project.cool_gadget_site.us.product_listings")

# Or simply all of Override rules ever declared.
rules = default_registry.get_overrides()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for this to work user would need to call walk_modules

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this on: daa3ff9

Copy link
Contributor

@BurnzZ BurnzZ Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved upon the previous commit linked above by creating a much more simpler interface for both get_overrides() + consuming/importing external packages/modules. Changes are in 0cbeb0b. It can also be viewed in this code lines: https://github.com/scrapinghub/web-poet/pull/16/files#diff-409cb5fe79e7d7d5ede41109573e31bd847a147a966f477de80c99279bcdf479R247-R251

from web_poet.pages import ItemWebPage

@cool_gadget_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage)
@cool_gadget_us_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having a different idea in mind when talking about multiple registries, and probably a different use case.

Let's say you have all of the following:

  1. A large package, with many POs for different websites.
  2. A few small packages, with POs for some specific websites.

These packages may be in separate git repos, packaged separately, etc.

There are a few use cases:

  1. We want to use all of the POs defined in all these packages. This probably should be handled by web_poet.get_overrides().
  2. We want to use all of the POs defined in all packages, but replace or exclude some of them (you created a better version and want to use it instead).
  3. We want to use POs only from some of the packages.
  4. We want to use some POs from some of the packages.

In all these cases modifying the source code of the packages may not be an option; we'd really want to select which POs to enable and which POs not to use in a flexible way.

It seems this is possible with the get_overrides_from_module API, but all the examples of multiple registries require making changes to the source code. I'm not sure it'd be a good practice to suggest people to create many registries and modify the source code to add more handle_urls decorators (usually with the same parameters - a lot of copy-paste).

I was thinking about a different approach for using multiple registries, something like this:

registry = web_poet.PageObjectRegistry()

from some_package.pages import CoolGadgetSiteProductPage
registry.handle_urls(CoolGadgetSiteProductPage, "coolgadgetsite.com", overrides=GenericProductPage)

Of course, you can also have POs in your own project using the same registry, declared with a decorator.

What I don't like in the example above is that all handle_urls parameters are still duplicated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I do see your point. Cheers for laying it out! I'll be updating this section in the docs to better showcase such scenario.

Copy link
Contributor

@BurnzZ BurnzZ Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @kmike ! I created a new dedicated section for this to explore various ways of using external packages which contain Page Objects. I can see how this would be a common requirement/usecase for the users.

It seems this is possible with the get_overrides_from_module API, but all the examples of multiple registries require making changes to the source code. I'm not sure it'd be a good practice to suggest people to create many registries and modify the source code to add more handle_urls decorators (usually with the same parameters - a lot of copy-paste).

This section was not taking into account the use of external packages but rather exploring how to organize the Override Rules using multiple Registries in a single local project. I believe the new dedicated section named Using Overrides from External Packages might remove the confusion in the docs: bd3a88e UPDATE: The API has been updated and so is the docs. Ref: bf0b3e5

Let me know what you think of this new guide. Cheers!

return {
cls: rule
for cls, rule in self.data.items()
if cls.__module__.startswith(module)
Copy link
Member

@kmike kmike Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can string matching lead to issues here? E.g. cls.__module__ is foo.bar, and module is "fo"?

I also wonder if startswith is necessary; doesn't walk_modules import all modules recursively, so that you can always use exact match?

Caveat: I'm not confident I understood this part correctly :)

Copy link
Contributor

@BurnzZ BurnzZ Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can string matching lead to issues here? E.g. cls.module is foo.bar, and module is "fo"?

Great catch! 🙏 It would indeed have some matching issues here. Fixed it on 10dff5b.

I also wonder if startswith is necessary; doesn't walk_modules import all modules recursively, so that you can always use exact match?

What we're filtering out here is the Registry's data values which contains the mapping of override rules. There might be cases where the registry was applied with walk_modules and could now contain data values on some other package/module. This filtration would prevent that.

@BurnzZ
Copy link
Contributor

BurnzZ commented Jan 13, 2022

Hi @kmike ,

After working and polishing on the separate PR in scrapinghub/scrapy-poet#57, I realized we needed a quick and easy way to access the rules via CLI. Thus I've introduced a concept of a registry_pool where it contains all instances of PageObjectRegistry that were ever created. This allows us to easily pin point any of the registry instances to be able to access and modify them.

Commit is at de5563a

@BurnzZ BurnzZ changed the title handle_urls decorator and find_page_object_overrides method handle_urls decorator using a new PageObjectRegistry Jan 17, 2022
web_poet/overrides.py Outdated Show resolved Hide resolved
@BurnzZ
Copy link
Contributor

BurnzZ commented Jan 20, 2022

Hi @kmike , I've updated the Registry API so that users can now easily combine, modify, delete, search, etc different rules from different registries. Done in bf0b3e5.

Comment on lines +6 to +14
Overrides contains mapping rules to associate which URLs a particular
Page Object would be used. The URL matching rules is handled by another library
called `url-matcher <https://url-matcher.readthedocs.io>`_.

Using such matching rules establishes the core concept of Overrides wherein
its able to use specific Page Objects in lieu of the original one.

This enables ``web-poet`` to be used effectively by other frameworks like
`scrapy-poet <https://scrapy-poet.readthedocs.io>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: think about how can we simplify the description here. There is nothing wrong with the current description, but it was a bit hard for me to understand what it means. It becomes much more clear in the next section.

called `url-matcher <https://url-matcher.readthedocs.io>`_.

Using such matching rules establishes the core concept of Overrides wherein
its able to use specific Page Objects in lieu of the original one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
its able to use specific Page Objects in lieu of the original one.
it's able to use specific Page Objects in lieu of the original one.

@kmike
Copy link
Member

kmike commented Mar 30, 2022

Closing it in favor of #27.

@kmike kmike closed this Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants