handle_urls decorator using a new PageObjectRegistry #16

ivanprado · 2021-11-30T09:38:44Z

Do not merge: It requires url-matcher to be released in advance.

The new decorator handle_urls can be used to annotate overriding rules directly in the Page Object classes. The function find_page_object_overrides list all the declared overrides within a particular module/package recursively.

The PageObjectRegistry class can be used to create different sets of rules.

A CMD has been introduced to explore packages. This is what you get if you run python -m web_poet tests.po_lib or simply web_poet tests.po_lib:

Use this                                                    instead of                                                           for the URL patterns            except for the patterns      with priority  meta
----------------------------------------------------------  -------------------------------------------------------------------  ------------------------------  -------------------------  ---------------  --------------------
tests.po_lib.POTopLevel1                                    tests.po_lib.POTopLevelOverriden1                                    ['example.com']                 ['/*.jpg|']                            300  {}
tests.po_lib.POTopLevel2                                    tests.po_lib.POTopLevelOverriden2                                    ['example.com']                 []                                     500  {}
tests.po_lib.a_module.POModule                              tests.po_lib.a_module.POModuleOverriden                              ['example.com']                 []                                     500  {'extra_arg': 'foo'}
tests.po_lib.nested_package.PONestedPkg                     tests.po_lib.nested_package.PONestedPkgOverriden                     ['example.com', 'example.org']  ['/*.jpg|']                            500  {}
tests.po_lib.nested_package.a_nested_module.PONestedModule  tests.po_lib.nested_package.a_nested_module.PONestedModuleOverriden  ['example.com', 'example.org']  ['/*.jpg|']                            500  {}

A pair or new dependencies were introduced:

url-matcher for the Patterns definition
tabulate to format the output in the CMD

The function/classes are documented, but no general documentation was introduced. I leave it as a follow-up task. I'm not sure this documentation should be part of web-poet though.

Todo:

Add some documentation about handle_urls and find_page_objects_overrides, with links to scrapy-poet documentation where really main part of the documentation should be. Also add documentation about the python -m web_poet command

Schema of possible documentation

A section is needed for "overrides" metadata or something like that.

What is handle_urls for. We can say it is used by some engines, e.g. scrapy-poet, to locate the extraction code for particular sites or sets of pages. Link to scrapy-poet tutorial.
Explain the basic: handle_url just define the rule: for this URLs use this Page Object instead of this one.
Patterns. What are they and link to url-matcher documentation.
find_page_objects_overrides function. It can be used to explore the handle_urls metadata. Explanation and usage. Link to scrapy-poet place where shows how to configure overrides using the function.
Command line tool to list PO python -m web_poet ...
Namespaces

BurnzZ

Great work on this @ivanprado ! Added some comments for my initial review.

CHANGELOG.rst

docs/requirements.txt

tests/po_lib/__init__.py

web_poet/__main__.py

web_poet/overrides.py

sortafreel

LGTM 👍

setup.py

web_poet/overrides.py

tests/test_overrides.py

web_poet/overrides.py

…urls

codecov · 2021-12-21T03:22:26Z

Codecov Report

Merging #16 (d5a5d75) into master (bdce1ac) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##            master       #16    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files            4         6     +2     
  Lines           55       196   +141     
==========================================
+ Hits            55       196   +141

Impacted Files	Coverage Δ
web_poet/__init__.py	`100.00% <100.00%> (ø)`
web_poet/__main__.py	`100.00% <100.00%> (ø)`
web_poet/overrides.py	`100.00% <100.00%> (ø)`

web_poet/overrides.py

docs/intro/overrides.rst

kmike · 2022-01-05T23:27:08Z

docs/intro/overrides.rst

+    rules = default_registry.get_overrides_from_module("my_page_obj_project.cool_gadget_site.us.product_listings")
+
+    # Or simply all of Override rules ever declared.
+    rules = default_registry.get_overrides()


I think for this to work user would need to call walk_modules

Updated this on: daa3ff9

I've improved upon the previous commit linked above by creating a much more simpler interface for both get_overrides() + consuming/importing external packages/modules. Changes are in 0cbeb0b. It can also be viewed in this code lines: https://github.com/scrapinghub/web-poet/pull/16/files#diff-409cb5fe79e7d7d5ede41109573e31bd847a147a966f477de80c99279bcdf479R247-R251

kmike · 2022-01-05T23:46:09Z

docs/intro/overrides.rst

+    from web_poet.pages import ItemWebPage
+
+    @cool_gadget_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage)
+    @cool_gadget_us_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage)


I was having a different idea in mind when talking about multiple registries, and probably a different use case.

Let's say you have all of the following:

A large package, with many POs for different websites.

A few small packages, with POs for some specific websites.

These packages may be in separate git repos, packaged separately, etc.

There are a few use cases:

We want to use all of the POs defined in all these packages. This probably should be handled by web_poet.get_overrides().

We want to use all of the POs defined in all packages, but replace or exclude some of them (you created a better version and want to use it instead).

We want to use POs only from some of the packages.

We want to use some POs from some of the packages.

In all these cases modifying the source code of the packages may not be an option; we'd really want to select which POs to enable and which POs not to use in a flexible way.

It seems this is possible with the get_overrides_from_module API, but all the examples of multiple registries require making changes to the source code. I'm not sure it'd be a good practice to suggest people to create many registries and modify the source code to add more handle_urls decorators (usually with the same parameters - a lot of copy-paste).

I was thinking about a different approach for using multiple registries, something like this:

registry = web_poet.PageObjectRegistry() from some_package.pages import CoolGadgetSiteProductPage registry.handle_urls(CoolGadgetSiteProductPage, "coolgadgetsite.com", overrides=GenericProductPage)

Of course, you can also have POs in your own project using the same registry, declared with a decorator.

What I don't like in the example above is that all handle_urls parameters are still duplicated

Got it. I do see your point. Cheers for laying it out! I'll be updating this section in the docs to better showcase such scenario.

Thanks for raising this @kmike ! I created a new dedicated section for this to explore various ways of using external packages which contain Page Objects. I can see how this would be a common requirement/usecase for the users.

It seems this is possible with the get_overrides_from_module API, but all the examples of multiple registries require making changes to the source code. I'm not sure it'd be a good practice to suggest people to create many registries and modify the source code to add more handle_urls decorators (usually with the same parameters - a lot of copy-paste).

This section was not taking into account the use of external packages but rather exploring how to organize the Override Rules using multiple Registries in a single local project. ~~I believe the new dedicated section named Using Overrides from External Packages might remove the confusion in the docs: bd3a88e~~ UPDATE: The API has been updated and so is the docs. Ref: bf0b3e5

Let me know what you think of this new guide. Cheers!

kmike · 2022-01-05T23:51:32Z

web_poet/overrides.py

+        return {
+            cls: rule
+            for cls, rule in self.data.items()
+            if cls.__module__.startswith(module)


Can string matching lead to issues here? E.g. cls.__module__ is foo.bar, and module is "fo"?

I also wonder if startswith is necessary; doesn't walk_modules import all modules recursively, so that you can always use exact match?

Caveat: I'm not confident I understood this part correctly :)

Can string matching lead to issues here? E.g. cls.module is foo.bar, and module is "fo"?

Great catch! 🙏 It would indeed have some matching issues here. Fixed it on 10dff5b.

I also wonder if startswith is necessary; doesn't walk_modules import all modules recursively, so that you can always use exact match?

What we're filtering out here is the Registry's data values which contains the mapping of override rules. There might be cases where the registry was applied with walk_modules and could now contain data values on some other package/module. This filtration would prevent that.

…ules()

… instances

BurnzZ · 2022-01-13T07:33:18Z

Hi @kmike ,

After working and polishing on the separate PR in scrapinghub/scrapy-poet#57, I realized we needed a quick and easy way to access the rules via CLI. Thus I've introduced a concept of a registry_pool where it contains all instances of PageObjectRegistry that were ever created. This allows us to easily pin point any of the registry instances to be able to access and modify them.

Commit is at de5563a

web_poet/overrides.py

…egistries

BurnzZ · 2022-01-20T11:29:16Z

Hi @kmike , I've updated the Registry API so that users can now easily combine, modify, delete, search, etc different rules from different registries. Done in bf0b3e5.

…0 update

kmike · 2022-02-24T20:44:43Z

docs/intro/overrides.rst

+Overrides contains mapping rules to associate which URLs a particular
+Page Object would be used. The URL matching rules is handled by another library
+called `url-matcher <https://url-matcher.readthedocs.io>`_.
+
+Using such matching rules establishes the core concept of Overrides wherein
+its able to use specific Page Objects in lieu of the original one.
+
+This enables ``web-poet`` to be used effectively by other frameworks like 
+`scrapy-poet <https://scrapy-poet.readthedocs.io>`_.


todo: think about how can we simplify the description here. There is nothing wrong with the current description, but it was a bit hard for me to understand what it means. It becomes much more clear in the next section.

kmike · 2022-02-24T20:45:44Z

docs/intro/overrides.rst

+called `url-matcher <https://url-matcher.readthedocs.io>`_.
+
+Using such matching rules establishes the core concept of Overrides wherein
+its able to use specific Page Objects in lieu of the original one.


Suggested change

its able to use specific Page Objects in lieu of the original one.

it's able to use specific Page Objects in lieu of the original one.

kmike · 2022-03-30T18:14:08Z

Closing it in favor of #27.

ivanprado added 8 commits November 29, 2021 16:50

meta module

a5a8f42

CMD for listing overrides

ec80b69

Refactoring with better names and structures and meta inclusion

308bd1d

docstring

aa8000d

Fix url_matcher dep

a2d5cb6

Fix CI tests

1f1f410

Make mypy happy again

bdb8987

Documentation fixed

a3e3eea

ivanprado requested review from kmike and sortafreel November 30, 2021 10:45

Minor changes

ef9945b

BurnzZ self-requested a review December 1, 2021 08:22

url-matcher has now been released.

f6fdac4

BurnzZ reviewed Dec 3, 2021

View reviewed changes

CHANGELOG.rst Outdated Show resolved Hide resolved

docs/requirements.txt Outdated Show resolved Hide resolved

tests/po_lib/__init__.py Outdated Show resolved Hide resolved

web_poet/__main__.py Outdated Show resolved Hide resolved

ivanprado mentioned this pull request Dec 8, 2021

url-matcher integration with scrapy-poet scrapinghub/scrapy-poet#56

Merged

2 tasks

kmike reviewed Dec 15, 2021

View reviewed changes

web_poet/overrides.py Outdated Show resolved Hide resolved

sortafreel reviewed Dec 16, 2021

View reviewed changes

setup.py Show resolved Hide resolved

web_poet/overrides.py Outdated Show resolved Hide resolved

tests/test_overrides.py Outdated Show resolved Hide resolved

web_poet/overrides.py Outdated Show resolved Hide resolved

BurnzZ and others added 6 commits December 20, 2021 12:32

Merge branch 'master' into handle_urls

b050d01

add entry point for CLI command

ba52ce0

fix import which fails tests

ba61626

refactor namespace to be classes instead

f5cffef

fix failing mypy tests after refactoring

c3579b9

Merge branch 'master' of github.com:scrapinghub/web-poet into handle_…

234b8d9

…urls

BurnzZ added 3 commits December 21, 2021 11:40

update tests to improve coverage

531752f

add missing import for find_page_object_overrides

7495b58

add docs for overrides

0a0ee12

kmike reviewed Dec 21, 2021

View reviewed changes

web_poet/overrides.py Outdated Show resolved Hide resolved

refactor by removing the need for find_page_object_overrides()

46d40e7

BurnzZ force-pushed the handle_urls branch from 1efb506 to 46d40e7 Compare December 22, 2021 13:46

kmike reviewed Jan 5, 2022

View reviewed changes

docs/intro/overrides.rst Outdated Show resolved Hide resolved

kmike reviewed Jan 5, 2022

View reviewed changes

update override docs to showcase url-matcher patterns

0a2d779

BurnzZ force-pushed the handle_urls branch from 7035ab3 to 0a2d779 Compare January 6, 2022 03:29

BurnzZ added 4 commits January 6, 2022 11:48

rename get_overrides_from_module into get_overrides_from

c000cbc

fix bug where module substring paths are not filtered out correctly

10dff5b

create consume_modules() to properly load annotations in get_overrides()

daa3ff9

update get_overrides_from to accept an arbitrary number of str inputs

3b05c07

BurnzZ force-pushed the handle_urls branch from 5d73e58 to 3b05c07 Compare January 6, 2022 11:55

BurnzZ added 3 commits January 6, 2022 20:13

add more warning docs to get_overrides() to use consume_modules()

f626efc

enable ease of combining external Page Object packages

bd3a88e

refactor get_overrides() to have a simpler interface with consume_mod…

0cbeb0b

…ules()

BurnzZ force-pushed the handle_urls branch from 94fed0f to 0cbeb0b Compare January 12, 2022 06:31

introduce concept of 'registry_pool' to access all PageObjectRegistry…

de5563a

… instances

implement __hash__() in OverrideRule to easily identify uniqueness

e7cca69

BurnzZ force-pushed the handle_urls branch from ac36c88 to e7cca69 Compare January 13, 2022 09:55

polish documentation with better examples and discussion

eab277a

BurnzZ changed the title ~~handle_urls decorator and find_page_object_overrides method~~ handle_urls decorator using a new PageObjectRegistry Jan 17, 2022

BurnzZ reviewed Jan 17, 2022

View reviewed changes

web_poet/overrides.py Outdated Show resolved Hide resolved

BurnzZ added 2 commits January 18, 2022 19:40

add more tests when PageObjectRegistry is instantiated

38e56cd

update PageObjectRegistry API for manipulating rules from different r…

bf0b3e5

…egistries

update OverrideRule __hash__() implementation after url-matcher==0.2.…

d5a5d75

…0 update

kmike reviewed Feb 24, 2022

View reviewed changes

kmike closed this Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle_urls decorator using a new PageObjectRegistry #16

handle_urls decorator using a new PageObjectRegistry #16

ivanprado commented Nov 30, 2021 •

edited by BurnzZ

Loading

BurnzZ left a comment

sortafreel left a comment •

edited

Loading

codecov bot commented Dec 21, 2021 •

edited

Loading

kmike Jan 5, 2022

BurnzZ Jan 6, 2022

BurnzZ Jan 12, 2022 •

edited

Loading

kmike Jan 5, 2022

BurnzZ Jan 6, 2022

BurnzZ Jan 7, 2022 •

edited

Loading

kmike Jan 5, 2022 •

edited

Loading

BurnzZ Jan 6, 2022 •

edited

Loading

BurnzZ commented Jan 13, 2022

BurnzZ commented Jan 20, 2022

kmike Feb 24, 2022

kmike Feb 24, 2022

kmike commented Mar 30, 2022

	its able to use specific Page Objects in lieu of the original one.
	it's able to use specific Page Objects in lieu of the original one.

handle_urls decorator using a new PageObjectRegistry #16

handle_urls decorator using a new PageObjectRegistry #16

Conversation

ivanprado commented Nov 30, 2021 • edited by BurnzZ Loading

BurnzZ left a comment

Choose a reason for hiding this comment

sortafreel left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Dec 21, 2021 • edited Loading

Codecov Report

kmike Jan 5, 2022

Choose a reason for hiding this comment

BurnzZ Jan 6, 2022

Choose a reason for hiding this comment

BurnzZ Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

kmike Jan 5, 2022

Choose a reason for hiding this comment

BurnzZ Jan 6, 2022

Choose a reason for hiding this comment

BurnzZ Jan 7, 2022 • edited Loading

Choose a reason for hiding this comment

kmike Jan 5, 2022 • edited Loading

Choose a reason for hiding this comment

BurnzZ Jan 6, 2022 • edited Loading

Choose a reason for hiding this comment

BurnzZ commented Jan 13, 2022

BurnzZ commented Jan 20, 2022

kmike Feb 24, 2022

Choose a reason for hiding this comment

kmike Feb 24, 2022

Choose a reason for hiding this comment

kmike commented Mar 30, 2022

ivanprado commented Nov 30, 2021 •

edited by BurnzZ

Loading

sortafreel left a comment •

edited

Loading

codecov bot commented Dec 21, 2021 •

edited

Loading

BurnzZ Jan 12, 2022 •

edited

Loading

BurnzZ Jan 7, 2022 •

edited

Loading

kmike Jan 5, 2022 •

edited

Loading

BurnzZ Jan 6, 2022 •

edited

Loading