diff --git a/CHANGELOG.rst b/CHANGELOG.rst index 28734261..f5c28b1c 100644 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -5,10 +5,12 @@ Changelog TBR ------------------ +* added a ``PageObjectRegistry`` class which has the ``handle_urls`` decorator + to write override rules. +* new CLI tool for displaying all available Page Objects: ``web_poet `` * removed support for Python 3.6 * added support for Python 3.10 - 0.1.1 (2021-06-02) ------------------ diff --git a/docs/api_reference.rst b/docs/api_reference.rst index 011f878e..e4d06484 100644 --- a/docs/api_reference.rst +++ b/docs/api_reference.rst @@ -1,3 +1,5 @@ +.. _`api-reference`: + ============= API Reference ============= @@ -45,3 +47,15 @@ Mixins .. autoclass:: web_poet.mixins.ResponseShortcutsMixin :members: :no-special-members: + + +.. _`api-overrides`: + +Overrides +========= + +.. autofunction:: web_poet.handle_urls + +.. automodule:: web_poet.overrides + :members: + :exclude-members: handle_urls diff --git a/docs/conf.py b/docs/conf.py index 353e5968..09dfab08 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -192,4 +192,5 @@ intersphinx_mapping = { 'python': ('https://docs.python.org/3', None, ), 'scrapy': ('https://docs.scrapy.org/en/latest', None, ), + 'url-matcher': ('https://url-matcher.readthedocs.io/en/stable/', None, ), } diff --git a/docs/index.rst b/docs/index.rst index d6d2e269..db4d852d 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -33,6 +33,7 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`. intro/tutorial intro/from-ground-up + intro/overrides .. toctree:: :caption: Reference diff --git a/docs/intro/overrides.rst b/docs/intro/overrides.rst new file mode 100644 index 00000000..fa4c6eaa --- /dev/null +++ b/docs/intro/overrides.rst @@ -0,0 +1,678 @@ +.. _`intro-overrides`: + +Overrides +========= + +Overrides contains mapping rules to associate which URLs a particular +Page Object would be used. The URL matching rules is handled by another library +called `url-matcher `_. + +Using such matching rules establishes the core concept of Overrides wherein +its able to use specific Page Objects in lieu of the original one. + +This enables ``web-poet`` to be used effectively by other frameworks like +`scrapy-poet `_. + +Example Use Case +---------------- + +Let's explore an example use case for the Overrides concept. + +Suppose we're using Page Objects for our broadcrawl project which explores +eCommerce websites to discover product pages. It wouldn't be entirely possible +for us to create parsers for all websites since we don't know which sites we're +going to crawl beforehand. + +However, we could at least create a generic Page Object to support parsing of +some fields in well-known locations of product information like ````. +This enables our broadcrawler to at least parse some useful information. Let's +call such Page Object to be ``GenericProductPage``. + +Assuming that one of our project requirements is to fully support parsing of the +`top 3 eCommerce websites`, then we'd need to create a Page Object for each one +to parse more specific fields. + +Here's where the Overrides concept comes in: + + 1. The ``GenericProductPage`` is used to parse all eCommerce product pages + `by default`. + 2. Whenever one of our declared URL rules matches with a given page URL, + then the Page Object associated with that rule `overrides (or replaces)` + the default ``GenericProductPage``. + +This enables us to conveniently declare which Page Object would be used for a +given webpage `(based on a page's URL pattern)`. + +Let's see this in action by declaring the Overrides in the Page Objects below. + + +Creating Overrides +------------------ + +Using Default Registry +~~~~~~~~~~~~~~~~~~~~~~ + +Let's take a look at how the following code is structured: + +.. code-block:: python + + from web_poet import handle_urls + from web_poet.pages import ItemWebPage + + + class GenericProductPage(ItemWebPage): + def to_item(self): + return {"product title": self.css("title::text").get()} + + + @handle_urls("example.com", overrides=GenericProductPage) + class ExampleProductPage(ItemWebPage): + def to_item(self): + ... # more specific parsing + + + @handle_urls("anotherexample.com", overrides=GenericProductPage, exclude="/digital-goods/") + class AnotherExampleProductPage(ItemWebPage): + def to_item(self): + ... # more specific parsing + + + @handle_urls(["dualexample.com/shop/?product=*", "dualexample.net/store/?pid=*"], overrides=GenericProductPage) + class DualExampleProductPage(ItemWebPage): + def to_item(self): + ... # more specific parsing + +The code above declares that: + + - For sites that match the ``example.com`` pattern, ``ExampleProductPage`` + would be used instead of ``GenericProductPage``. + - The same is true for ``DualExampleProductPage`` where it is used + instead of ``GenericProductPage`` for two URL patterns which works as: + + - :sub:`(match) https://www.dualexample.com/shop/electronics/?product=123` + - :sub:`(match) https://www.dualexample.com/shop/books/paperback/?product=849` + - :sub:`(NO match) https://www.dualexample.com/on-sale/books/?product=923` + - :sub:`(match) https://www.dualexample.net/store/kitchen/?pid=776` + - :sub:`(match) https://www.dualexample.net/store/?pid=892` + - :sub:`(NO match) https://www.dualexample.net/new-offers/fitness/?pid=892` + + - On the other hand, ``AnotherExampleProductPage`` is only used instead of + ``GenericProductPage`` when we're parsing pages from ``anotherexample.com`` + that doesn't contain ``/digital-goods/`` in its URL path. + +.. tip:: + + The URL patterns declared in the :func:`web_poet.handle_urls` can still be + further customized. You can read some of the specific parameters and + alternative ways in the API section <api-overrides> of + :func:`web_poet.handle_urls`. + +Using Multiple Registries +~~~~~~~~~~~~~~~~~~~~~~~~~ + +To demonstrate another alternative way to declare the Override rules, see the +code example below: + +.. code-block:: python + + from web_poet import handle_urls, PageObjectRegistry + from web_poet.pages import ItemWebPage + + + clothes_registry = PageObjectRegistry(name="clothes") + + + class GenericProductPage(ItemWebPage): + def to_item(self): + return {"product title": self.css("title::text").get()} + + + @handle_urls(["dualexample.com/shop/?product=*", "dualexample.net/store/?pid=*"], overrides=GenericProductPage) + @clothes_registry.handle_urls("dualexample.com/shop/?category=clothes&product=*", overrides=GenericProductPage) + class DualExampleProductPage(ItemWebPage): + def to_item(self): + ... # more specific parsing + +In the example above, we're splitting the Page Objects into two separate Registries. +If you may notice, ``DualExampleProductPage`` is being declared into both of them +but with a different URL pattern. + +If you need more control over the Registry, you could instantiate your very +own :class:`~.PageObjectRegistry` and use its ``@handle_urls`` to annotate and +register the rules. This might benefit you in certain project use cases where you +need more organizational control over your rules. + +Such an approach could be useful especially when you're publishing your Page +Objects as an external dependency. Other projects may use it and could import +a specific Registry containing the URL rules that they may need. + +Viewing all available Overrides +------------------------------- + +A convenience function is available discover and retrieve all :class:`~.OverrideRule` +from your project. Make sure to check out the :meth:`~.PageObjectRegistry.get_overrides` +API section to see other functionalities. + +.. code-block:: python + + from web_poet import default_registry + + # Retrieves all OverrideRules that were registered in the registry + rules = default_registry.get_overrides() + + # Or, we could also filter out the OverrideRules by the module they were defined in + rules = default_registry.get_overrides(filters="my_project.page_objects") + + print(len(rules)) # 3 + print(rules[0]) # OverrideRule(for_patterns=Patterns(include=['example.com'], exclude=[], priority=500), use=<class 'my_project.page_objects.ExampleProductPage'>, instead_of=<class 'my_project.page_objects.GenericProductPage'>, meta={}) + +.. note:: + + Notice in the code sample above where we could filter out the Override rules + per module via the ``filters`` param. This could also offer another alternative + way to organize your Page Object rules by module hierarchies in your project. + This could require on solely using the ``default_registry``. There's no need + to declare multiple :class:`~.PageObjectRegistry` instances and use multiple + annotations. + +.. warning:: + + :meth:`~.PageObjectRegistry.get_overrides` relies on the fact that all essential + packages/modules which contains the :func:`web_poet.handle_urls` + annotations are properly loaded. + + Thus, for cases like importing Page Objects from another external package, you'd + need to properly load all :meth:`web_poet.handle_urls` annotations + from the external module. This ensures that the external Page Objects have + their annotations properly loaded. + + This can be done via the function named :func:`~.web_poet.overrides.consume_modules`. + Here's an example: + + .. code-block:: python + + from web_poet import default_registry, consume_modules + + consume_modules("external_package_A.po", "another_ext_package.lib") + rules = default_registry.get_overrides() + + # Fortunately, `get_overrides()` provides a shortcut for the lines above: + rules = default_registry.get_overrides(consume=["external_package_A.po", "another_ext_package.lib"]) + +A handy CLI tool is also available at your disposal to quickly see the available +:class:`~.OverrideRule` in a given module in your project. For example, invoking +something like ``web_poet my_project.page_objects`` would produce the following: + +.. code-block:: + + Registry Use this instead of for the URL patterns except for the patterns with priority meta + --------- ---------------------------------------------------- ------------------------------------------ ------------------------------------------------------------------- ------------------------- --------------- ------ + default my_project.page_objects.ExampleProductPage my_project.page_objects.GenericProductPage ['example.com'] [] 500 {} + default my_project.page_objects.AnotherExampleProductPage my_project.page_objects.GenericProductPage ['anotherexample.com'] ['/digital-goods/'] 500 {} + default my_project.page_objects.DualExampleProductPage my_project.page_objects.GenericProductPage ['dualexample.com/shop/?product=*', 'dualexample.net/store/?pid=*'] [] 500 {} + +You can also filter them via the **name** of :class:`~.PageObjectRegistry`. For example, +invoking ``web_poet my_project.page_objects --registry_name=custom`` would produce +something like: + +.. code-block:: + + Registry Use this instead of for the URL patterns except for the patterns with priority meta + ---------- ---------------------------------------------------- ------------------------------------------ ---------------------- ------------------------- --------------- ------ + custom my_project.page_objects.CustomProductPage my_project.page_objects.GenericProductPage ['example.com'] [] 500 {} + custom my_project.page_objects.AnotherCustomProductPage my_project.page_objects.GenericProductPage ['anotherexample.com'] ['/digital-goods/'] 500 {} + +Organizing Page Object Overrides +-------------------------------- + +After tackling the two (2) different approaches from the previous chapters on how +to declare overrides, we can now explore how to organize them in our projects. +Although it's mostly up to the developer which override declaration method to +use. Yet, we'll present a few different approaches depending on the situation. + +To put this thought into action, let's suppose we are tasked to create a Page +Object Project with overrides for eCommerce websites. + +Package-based Approach +~~~~~~~~~~~~~~~~~~~~~~ + +Using the **package-based** approach, we might organize them into something like: + +.. code-block:: + + my_page_obj_project + ├── cool_gadget_site + | ├── us + | | ├── __init__.py + | | ├── products.py + | | └── product_listings.py + | ├── fr + | | ├── __init__.py + | | ├── products.py + | | └── product_listings.py + | └── __init__.py + └── furniture_shop + ├── __init__.py + ├── products.py + └── product_listings.py + +Assuming that we've declared the Page Objects in each of the modules to use the +``default_registry`` as something like: + +.. code-block:: python + + # my_page_obj_project/cool_gadget_site/us/products.py + + from web_poet import handle_urls # remember that this uses the default_registry + from web_poet.pages import ItemWebPage + + @handle_urls("coolgadgetsite.com", overrides=GenericProductPage) + class CoolGadgetUsSiteProductPage(ItemWebPage): + def to_item(self): + ... # parsers here + +Then we could easily retrieve all :class:`~.OverrideRule` filtered per subpackage +or module like this: + +.. code-block:: python + + from web_poet import default_registry, consume_modules + + # We can do it per website. + rules_gadget = default_registry.get_overrides(filters="my_page_obj_project.cool_gadget_site") + rules_furniture = default_registry.get_overrides(filters="my_page_obj_project.furniture_site") + + # It can also drill down to the country domains on a given site. + rules_gadget_us = default_registry.get_overrides(filters="my_page_obj_project.cool_gadget_site.us") + rules_gadget_fr = default_registry.get_overrides(filters="my_page_obj_project.cool_gadget_site.fr") + + # Or even drill down further to the specific module. + rules_gadget_us_products = default_registry.get_overrides(filters="my_page_obj_project.cool_gadget_site.us.products") + rules_gadget_us_listings = default_registry.get_overrides(filters="my_page_obj_project.cool_gadget_site.us.product_listings") + + # Or simply all of the Override rules ever declared. + rules = default_registry.get_overrides() + + # Lastly, you'd need to properly load external packages/modules for the + # @handle_urls annotation to be correctly read. If there are any. + consume_modules("external_package_A.po", "another_ext_package.lib") + rules = default_registry.get_overrides() + + # Remember, a shortcut for consuming imports would be: + rules = default_registry.get_overrides(consume=["external_package_A.po", "another_ext_package.lib"]) + + +.. warning:: + + Remember to consider calling :func:`~.web_poet.overrides.consume_modules` + or the ``consume`` param of :meth:`~.PageObjectRegistry.get_overrides` for the + imports to properly load. Most especially if you intend to use Page Objects + from externally imported packages. + + This enables the :meth:`~.PageObjectRegistry.handle_urls` that annotates + the external Page Objects to be properly loaded. + +Multiple Registry Approach +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The **package-based** approach heavily relies on how the developer organizes the +project modules into intuitive hierarchies depending on the nature of the project. +There might be cases that for some reason, a developer would want to use a **flat +hierarchy** like this: + +.. code-block:: + + my_page_obj_project + ├── __init__.py + ├── cool_gadget_site_us_products.py + ├── cool_gadget_site_us_product_listings.py + ├── cool_gadget_site_fr_products.py + ├── cool_gadget_site_fr_product_listings.py + ├── furniture_shop_products.py + └── furniture_shop_product_listings.py + +As such, calling :meth:`~.PageObjectRegistry.get_overrides` with a ``from`` +filter parameter would not effectively work on projects with a **flat hierarchy**. +Thus, we can organize them using our own instances of the :class:`~.PageObjectRegistry` +instead: + +.. code-block:: python + + # my_page_obj_project/__init__.py + + from web_poet import PageObjectRegistry + + cool_gadget_registry = PageObjectRegistry(name="cool_gadget") + cool_gadget_us_registry = PageObjectRegistry(name="cool_gadget_us") + cool_gadget_fr_registry = PageObjectRegistry(name="cool_gadget_fr") + furniture_shop_registry = PageObjectRegistry(name="furniture_shop") + +.. tip:: + + Later on, you can access all of the :class:`~.PageObjectRegistry` that were + ever instantiated. This can be done via ``web_poet.registry_pool`` which + simply a holds a mapping structured as ``Dict[str, PageObjectRegistry]``. + + So after declaring the :class:`~.PageObjectRegistry` instances above, we can + view them via: + + .. code-block:: python + + from web_poet import registry_pool + + print(registry_pool) + # { + # 'default': <web_poet.overrides.PageObjectRegistry object at 0x7f47d654d8b0>, + # 'cool_gadget' = <my_page_obj_project.PageObjectRegistry object at 0x7f47d654382a>, + # 'cool_gadget_us' = <my_page_obj_project.PageObjectRegistry object at 0xb247d65433c3>, + # 'cool_gadget_fr' = <my_page_obj_project.PageObjectRegistry object at 0xd93746549dea>, + # 'furniture_shop' = <my_page_obj_project.PageObjectRegistry object at 0x82n78654441b> + # } + + Notice that the ``default`` registry will always be present. + +.. warning:: + + Please be aware that there might be some :class:`~.PageObjectRegistry` + that are not available, most especially if you're using them from external + packages. + + Thus, it's imperative to use :func:`~.web_poet.overrides.consume_modules` + beforehand. Not only that it helps us find the :meth:`~.PageObjectRegistry.handle_urls` + annotated in external packages, but also finds the instances of + :class:`~.PageObjectRegistry` as well. + + Here's an example: + + .. code-block:: python + + from web_poet import registry_pool, consume_modules + + consume_modules("external_pkg") + + print(registry_pool) + # { + # 'default': <web_poet.overrides.PageObjectRegistry object at 0x7f47d654d8b0>, + # 'cool_gadget' = <my_page_obj_project.PageObjectRegistry object at 0x7f47d654382a>, + # 'cool_gadget_us' = <my_page_obj_project.PageObjectRegistry object at 0xb247d65433c3>, + # 'cool_gadget_fr' = <my_page_obj_project.PageObjectRegistry object at 0xd93746549dea>, + # 'furniture_shop' = <my_page_obj_project.PageObjectRegistry object at 0x82n78654441b>, + # 'ecommerce': <external_pkg.PageObjectRegistry object at 0xbc45d8328420> + # } + + Notice that the ``external_pkg.PageObjectRegistry`` named **ecommerce** has + now been successfully discovered. + +After declaring the :class:`~.PageObjectRegistry` instances, they can be used +in each of the Page Object packages like so: + +.. code-block:: python + + # my_page_obj_project/cool_gadget_site_us_products.py + + from . import cool_gadget_registry, cool_gadget_us_registry + from web_poet.pages import ItemWebPage + + + @cool_gadget_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage) + @cool_gadget_us_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage) + class CoolGadgetSiteProductPage(ItemWebPage): + def to_item(self): + ... # parsers here + +Retrieving the rules would simply be: + +.. code-block:: python + + from my_page_obj_project import ( + cool_gadget_registry, + cool_gadget_us_registry, + cool_gadget_fr_registry, + furniture_shop_registry, + ) + + rules = cool_gadget_registry.get_overrides() + rules = cool_gadget_us_registry.get_overrides() + rules = cool_gadget_fr_registry.get_overrides() + rules = furniture_shop_registry.get_overrides() + +Developers can create as much :class:`~.PageObjectRegistry` instances as they want +in order to satisfy their organization and classification needs. + +Mixed Approach +~~~~~~~~~~~~~~ + +Developers are free to choose whichever approach would best fit their particular +use case. They can even mix both approach together to handle some particular +cases. + +For instance, going back to our **package-based** approach organized as: + +.. code-block:: + + my_page_obj_project + ├── cool_gadget_site + | ├── us + | | ├── __init__.py + | | ├── products.py + | | └── product_listings.py + | ├── fr + | | ├── __init__.py + | | ├── products.py + | | └── product_listings.py + | └── __init__.py + └── furniture_shop + ├── __init__.py + ├── products.py + └── product_listings.py + +Suppose we'd want to get all the rules for all of the listings `(ignoring anything +else)`, then one way to retrieve such rules would be: + +.. code-block:: python + + from web_poet import default_registry + + product_listing_rules = default_registry.get_overrrides( + filters=[ + "my_page_obj_project.cool_gadget_site.us.product_listings", + "my_page_obj_project.cool_gadget_site.fr.product_listings", + "my_page_obj_project.furniture_shop.product_listings", + ] + ) + +On the other hand, we can also create another :class:`~.PageObjectRegistry` instance +that we'll be using aside from the ``default_registry`` to help us better organize +our :class:`~.OverrideRule`. + +.. code-block:: python + + # my_page_obj_project/__init__.py + + from web_poet import PageObjectRegistry + + product_listings_registry = PageObjectRegistry(name="product_listings") + +Using the new **product_listings_registr** instance above, we'll use it to +provide another annotation for the Page Objects in each of the +``product_listings.py`` module. For example: + +.. code-block:: python + + # my_page_obj_project/cool_gadget_site_us_product_listings.py + + from . import product_listings_registry + from web_poet import handle_urls # remember that this uses the default_registry + from web_poet.pages import ItemWebPage + + + @product_listings_registry.handle_urls("coolgadgetsite.com", overrides=GenericProductPage) + @handle_urls("coolgadgetsite.com", overrides=GenericProductPage) + class CoolGadgetSiteProductPage(ItemWebPage): + def to_item(self): + ... # parsers here + +Retrieving all of the Product Listing :class:`~.OverrideRule` would simply be: + +.. code-block:: python + + from my_page_obj_project import product_listings_registry + + # Getting all of the override rules for product listings. + rules = product_listings_registry.get_overrides() + + # We can also filter it down further on a per site basis if needed. + rules = product_listings_registry.get_overrides(filters="my_page_obj_project.cool_gadget_site") + +Using Overrides from External Packages +-------------------------------------- + +Developers have the option to import existing Page Objects alongside the +:class:`~.OverrideRule` attached to them. This section aims to showcase different +ways you can play with the Registries to manipulate the :class:`~.OverrideRule` +according to your needs. + +Let's suppose we have the following use case before us: + + - An **external** Python package named ``ecommerce_page_objects`` is available + which contains Page Objects for common websites. It's using the + ``default_registry`` from **web-poet**. + - Another similar package named ``gadget_sites_page_objects`` is available + for even more specific websites. It's using its own registry named + ``gadget_registry``. + - Your project's objectives is to handle as much eCommerce websites as you + can. Thus, you'd want to use the already available packages above and + perhaps improve on them or create new Page Objects for new websites. + +Assuming that you'd want to **use all existing** :class:`~.OverrideRule` **from +the external packages** in your project, you can do it like: + +.. code-block:: python + + import ecommerce_page_objects + import gadget_sites_page_objects + from web_poet import PageObjectRegistry, consume_modules, default_registry + + # We're using `consume_modules()` here instead of the `consume` param of + # `PageObjectRegistry.get_overrides()` since we need to properly load all + # of the annotated rules from the registry. + consume_modules("ecommerce_page_objects", "gadget_sites_page_objects") + + combined_registry = PageObjectRegistry(name="combined") + combined_registry.copy_overrides_from( + # Since ecommerce_page_objects is using web_poet.default_registry, then + # it functions like a global registry which we can access simply as: + default_registry, + + # External packages not using the web_poet.default_registry would need + # to have their own registry accessed. + gadget_sites_page_objects.gadget_registry + ) + + combined_rules = combined_registry.get_overrides() + + # The combined_rules would be as follows: + # 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) + # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) + # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + # 4. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + +As you can see in the example above, we can easily combine the rules from multiple +different registries. There won't be any duplication of :class:`~.OverrideRule` +entries since :meth:`PageObjectRegistry.copy_overrides_from` already deduplicates +the rules. + +You might've observed that combining the two Registries above may result in a +conflict for the :class:`~.OverrideRule` for **#2** and **#3**: + +.. code-block:: python + + # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'ecommerce_page_objects.EcomGenericPage'>, meta={}) + # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'gadget_sites_page_objects.GadgetGenericPage'>, meta={}) + +The `url-matcher`_ library is the one responsible breaking such conflicts. It's +specifically discussed in this section: `rules-conflict-resolution +<https://url-matcher.readthedocs.io/en/stable/intro.html#rules-conflict-resolution>`_. + +However, it's technically **NOT** a conflict, **yet**, since: + + - ``ecommerce_page_objects.site_2.EcomSite2`` would only be used in **site_2.com** + if ``ecommerce_page_objects.EcomGenericPage`` is to be replaced. + - The same case with ``gadget_sites_page_objects.site_2.GadgetSite2`` wherein + it's only going to be utilized for **site_2.com** if the following is to be + replaced: ``gadget_sites_page_objects.GadgetGenericPage``. + +It would be only become a conflict if the **#2** and **#3** :class:`~.OverrideRule` +for **site_2.com** both `intend to replace the` **same** `Page Object`. In fact, +none of the :class:`~.OverrideRule` above would ever be used if your project never +intends to use the following Page Objects *(since there's nothing to override)*. +You can import these Page Objects into your project and use them so they can be +overridden: + + - ``ecommerce_page_objects.EcomGenericPage`` + - ``gadget_sites_page_objects.GadgetGenericPage`` + +However, let's assume that you want to create your own generic Page Object and +only intend to use it instead of the ones above. We can easily replace them like: + +.. code-block:: python + + # Our new generic Page Object that we'd prefer instead of: + # - ecommerce_page_objects.EcomGenericPage + # - gadget_sites_page_objects.GadgetGenericPage + class ImprovedEcommerceGenericPage: + def to_item(self): + ... # different type of generic parsers + + for rule in combined_registry.get_overrides(): + combined_registry.replace_override(rule, instead_of=ImprovedEcommerceGenericPage) + + updated_rules = combined_registry.get_overrides() + + # The updated_rules would be as follows: + # 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + # 3. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + # 4. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + +Now, **#2** and **#3** have a conflict since they now both intend to replace +``ImprovedEcommerceGenericPage``. As mentioned earlier, the `url-matcher`_ +would be the one to resolve such conflicts. + +However, it would help prevent future confusion if we could remove the source of +ambiguity in our :class:`~.OverrideRule`. + +Suppose, we prefer ``gadget_sites_page_objects.site_2.GadgetSite2`` more than +``ecommerce_page_objects.site_2.EcomSite2``. As such, we could remove the latter: + +.. code-block:: python + + rules = combined_registry.search_overrides(use=ecommerce_page_objects.site_2.EcomSite2) + combined_registry.remove_overrides(*rules) + + updated_rules = combined_registry.get_overrides() + + # The newly updated_rules would be as follows: + # 1. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_1.EcomSite1'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + # 2. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'ecommerce_page_objects.site_2.EcomSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + # 3. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + +Now, suppose we want to improve ``ecommerce_page_objects.site_1.EcomSite1`` +from **#1** above by perhaps adding/fixing fields. We can do that by: + +.. code-block:: python + + class ImprovedEcomSite1(ecommerce_page_objects.site_1.EcomSite1): + def to_item(self): + ... # replace and improve some of the parsers here + + rules = combined_registry.search_overrides(use=ecommerce_page_objects.site_1.EcomSite1) + for rule in rules: + combined_registry.replace_override(rules, use=ImprovedEcomSite1) + + updated_rules = combined_registry.get_overrides() + + # The newly updated_rules would be as follows: + # 1. OverrideRule(for_patterns=Patterns(include=['site_2.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_2.GadgetSite2'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + # 2. OverrideRule(for_patterns=Patterns(include=['site_3.com'], exclude=[], priority=500), use=<class 'gadget_sites_page_objects.site_3.GadgetSite3'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) + # 3. OverrideRule(for_patterns=Patterns(include=['site_1.com'], exclude=[], priority=500), use=<class 'my_project.ImprovedEcomSite1'>, instead_of=<class 'my_project.ImprovedEcommerceGenericPage'>, meta={}) diff --git a/docs/intro/tutorial.rst b/docs/intro/tutorial.rst index 5afbbd6c..9b8f7c01 100644 --- a/docs/intro/tutorial.rst +++ b/docs/intro/tutorial.rst @@ -131,4 +131,4 @@ As you can see, it's possible to use web-poet with built-in libraries such as `scrapy-poet <https://scrapy-poet.readthedocs.io>`_. If you want to understand the idea behind web-poet better, -check the :ref:`from-ground-up` tutorial. \ No newline at end of file +check the :ref:`from-ground-up` tutorial. diff --git a/docs/requirements.txt b/docs/requirements.txt index d9806645..6a9e9681 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -5,4 +5,4 @@ sphinxcontrib-devhelp==1.0.2 sphinxcontrib-htmlhelp==2.0.0 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.3 -sphinxcontrib-serializinghtml==1.1.5 +sphinxcontrib-serializinghtml==1.1.5 \ No newline at end of file diff --git a/setup.py b/setup.py index 9553a73c..86e103bd 100644 --- a/setup.py +++ b/setup.py @@ -14,6 +14,7 @@ author='Scrapinghub', author_email='info@scrapinghub.com', url='https://github.com/scrapinghub/web-poet', + entry_points={'console_scripts': ['web_poet = web_poet.__main__:main']}, packages=find_packages( exclude=( 'tests', @@ -22,6 +23,8 @@ install_requires=( 'attrs', 'parsel', + 'url-matcher', + 'tabulate', ), classifiers=( 'Development Status :: 2 - Pre-Alpha', diff --git a/tests/po_lib/__init__.py b/tests/po_lib/__init__.py new file mode 100644 index 00000000..51e9631c --- /dev/null +++ b/tests/po_lib/__init__.py @@ -0,0 +1,44 @@ +""" +This package is just for overrides testing purposes. +""" +from typing import Dict, Any, Callable + +from url_matcher import Patterns + +from .. import po_lib_sub # NOTE: this module contains a PO with @handle_rules +from web_poet import handle_urls, PageObjectRegistry + + +class POBase: + expected_overrides: Callable + expected_patterns: Patterns + expected_meta: Dict[str, Any] + + +class POTopLevelOverriden1: + ... + + +class POTopLevelOverriden2: + ... + + +secondary_registry = PageObjectRegistry(name="secondary") + + +# This first annotation is ignored. A single annotation per registry is allowed +@handle_urls("example.com", POTopLevelOverriden1) +@handle_urls("example.com", POTopLevelOverriden1, exclude="/*.jpg|", priority=300) +class POTopLevel1(POBase): + expected_overrides = POTopLevelOverriden1 + expected_patterns = Patterns(["example.com"], ["/*.jpg|"], priority=300) + expected_meta = {} # type: ignore + + +# The second annotation is for a different registry +@handle_urls("example.com", POTopLevelOverriden2) +@secondary_registry.handle_urls("example.org", POTopLevelOverriden2) +class POTopLevel2(POBase): + expected_overrides = POTopLevelOverriden2 + expected_patterns = Patterns(["example.com"]) + expected_meta = {} # type: ignore diff --git a/tests/po_lib/a_module.py b/tests/po_lib/a_module.py new file mode 100644 index 00000000..0dcf04c6 --- /dev/null +++ b/tests/po_lib/a_module.py @@ -0,0 +1,16 @@ +from url_matcher import Patterns + +from tests.po_lib import POBase +from web_poet import handle_urls + + +class POModuleOverriden: + ... + + +@handle_urls("example.com", overrides=POModuleOverriden, extra_arg="foo") +class POModule(POBase): + expected_overrides = POModuleOverriden + expected_patterns = Patterns(["example.com"]) + expected_meta = {"extra_arg": "foo"} # type: ignore + diff --git a/tests/po_lib/an_empty_module.py b/tests/po_lib/an_empty_module.py new file mode 100644 index 00000000..e69de29b diff --git a/tests/po_lib/an_empty_package/__init__.py b/tests/po_lib/an_empty_package/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/tests/po_lib/nested_package/__init__.py b/tests/po_lib/nested_package/__init__.py new file mode 100644 index 00000000..537a995d --- /dev/null +++ b/tests/po_lib/nested_package/__init__.py @@ -0,0 +1,15 @@ +from url_matcher import Patterns + +from tests.po_lib import POBase +from web_poet import handle_urls + + +class PONestedPkgOverriden: + ... + + +@handle_urls(include=["example.com", "example.org"], exclude=["/*.jpg|"], overrides=PONestedPkgOverriden) +class PONestedPkg(POBase): + expected_overrides = PONestedPkgOverriden + expected_patterns = Patterns(["example.com", "example.org"], ["/*.jpg|"]) + expected_meta = {} # type: ignore diff --git a/tests/po_lib/nested_package/a_nested_module.py b/tests/po_lib/nested_package/a_nested_module.py new file mode 100644 index 00000000..bc2424fe --- /dev/null +++ b/tests/po_lib/nested_package/a_nested_module.py @@ -0,0 +1,21 @@ +from url_matcher import Patterns + +from tests.po_lib import POBase, secondary_registry +from web_poet import handle_urls + + +class PONestedModuleOverriden: + ... + + +class PONestedModuleOverridenSecondary: + ... + + +@handle_urls(include=["example.com", "example.org"], exclude=["/*.jpg|"], overrides=PONestedModuleOverriden) +@secondary_registry.handle_urls("example.com", PONestedModuleOverridenSecondary) +class PONestedModule(POBase): + expected_overrides = PONestedModuleOverriden + expected_patterns = Patterns(include=["example.com", "example.org"], exclude=["/*.jpg|"]) + expected_meta = {} # type: ignore + diff --git a/tests/po_lib_sub/__init__.py b/tests/po_lib_sub/__init__.py new file mode 100644 index 00000000..33850f5c --- /dev/null +++ b/tests/po_lib_sub/__init__.py @@ -0,0 +1,25 @@ +"""This package is being used by tests/po_lib to validate some behaviors on +external depedencies. +""" +from typing import Dict, Any, Callable + +from url_matcher import Patterns + +from web_poet import handle_urls + + +class POBase: + expected_overrides: Callable + expected_patterns: Patterns + expected_meta: Dict[str, Any] + + +class POLibSubOverriden: + ... + + +@handle_urls("sub_example.com", POLibSubOverriden) +class POLibSub(POBase): + expected_overrides = POLibSubOverriden + expected_patterns = Patterns(["sub_example.com"]) + expected_meta = {} # type: ignore diff --git a/tests/test_overrides.py b/tests/test_overrides.py new file mode 100644 index 00000000..567d18ff --- /dev/null +++ b/tests/test_overrides.py @@ -0,0 +1,268 @@ +import argparse +import dataclasses + +import pytest +from url_matcher import Patterns + +from tests.po_lib_sub import POLibSub +from tests.po_lib import ( + POTopLevel1, + POTopLevel2, + POTopLevelOverriden2, + secondary_registry, +) +from tests.po_lib.a_module import POModule +from tests.po_lib.nested_package import PONestedPkg +from tests.po_lib.nested_package.a_nested_module import ( + PONestedModule, + PONestedModuleOverridenSecondary, +) +from web_poet import PageObjectRegistry, default_registry, registry_pool +from web_poet.overrides import OverrideRule + + +POS = {POTopLevel1, POTopLevel2, POModule, PONestedPkg, PONestedModule} + + +def test_override_rule_uniqueness(): + """The same instance of an OverrideRule with the same attribute values should + have the same hash identity. + """ + + patterns = Patterns(include=["example.com"], exclude=["example.com/blog"]) + + rule1 = OverrideRule( + for_patterns=patterns, + use=POTopLevel1, + instead_of=POTopLevelOverriden2, + meta={"key_1": 1} + ) + rule2 = OverrideRule( + for_patterns=patterns, + use=POTopLevel1, + instead_of=POTopLevelOverriden2, + meta={"key_2": 2} + ) + + assert hash(rule1) == hash(rule2) + + +def test_list_page_objects_all(): + rules = default_registry.get_overrides() + page_objects = {po.use for po in rules} + + # Note that the 'tests_extra.po_lib_sub_not_imported.POLibSubNotImported' + # Page Object is not included here since it was never imported anywhere in + # our test package. It would only be included if we run any of the following + # below. (Note that they should run before `get_overrides` is called.) + # - from tests_extra import po_lib_sub_not_imported + # - import tests_extra.po_lib_sub_not_imported + # - web_poet.consume_modules("tests_extra") + # Merely having `import tests_extra` won't work since the subpackages and + # modules needs to be traversed and imported as well. + assert all(["po_lib_sub_not_imported" not in po.__module__ for po in page_objects]) + + # Ensure that ALL Override Rules are returned as long as the given + # registry's @handle_urls annotation was used. + assert page_objects == POS.union({POLibSub}) + for rule in rules: + assert rule.instead_of == rule.use.expected_overrides, rule.use + assert rule.for_patterns == rule.use.expected_patterns, rule.use + assert rule.meta == rule.use.expected_meta, rule.use + + +def test_list_page_objects_all_consume(): + """A test similar to the one above but calls ``consume_modules()`` to properly + load the @handle_urls annotations from other modules/packages. + """ + rules = default_registry.get_overrides(consume="tests_extra") + page_objects = {po.use for po in rules} + assert any(["po_lib_sub_not_imported" in po.__module__ for po in page_objects]) + + +def test_list_page_objects_from_pkg(): + """Tests that metadata is extracted properly from the po_lib package""" + rules = default_registry.get_overrides(filters="tests.po_lib") + page_objects = {po.use for po in rules} + + # Ensure that the "tests.po_lib", which imports another module named + # "tests.po_lib_sub" which contains @handle_urls decorators, does not + # retrieve the override rules from the external package. + assert POLibSub not in page_objects + + assert page_objects == POS + for rule in rules: + assert rule.instead_of == rule.use.expected_overrides, rule.use + assert rule.for_patterns == rule.use.expected_patterns, rule.use + assert rule.meta == rule.use.expected_meta, rule.use + + +def test_list_page_objects_from_single(): + rules = default_registry.get_overrides(filters="tests.po_lib.a_module") + assert len(rules) == 1 + rule = rules[0] + assert rule.use == POModule + assert rule.for_patterns == POModule.expected_patterns + assert rule.instead_of == POModule.expected_overrides + + +def test_list_page_objects_from_multiple(): + rules = default_registry.get_overrides( + filters=[ + "tests.po_lib.a_module", + "tests.po_lib.nested_package.a_nested_module", + ] + ) + assert len(rules) == 2 + + assert rules[0].use == POModule + assert rules[0].for_patterns == POModule.expected_patterns + assert rules[0].instead_of == POModule.expected_overrides + + assert rules[1].use == PONestedModule + assert rules[1].for_patterns == PONestedModule.expected_patterns + assert rules[1].instead_of == PONestedModule.expected_overrides + + +def test_list_page_objects_from_empty_module(): + rules = default_registry.get_overrides(filters="tests.po_lib.an_empty_module") + assert len(rules) == 0 + + +def test_list_page_objects_from_empty_pkg(): + rules = default_registry.get_overrides(filters="tests.po_lib.an_empty_package") + assert len(rules) == 0 + + +def test_list_page_objects_from_unknown_module(): + with pytest.raises(ImportError): + default_registry.get_overrides(filters="tests.po_lib.unknown_module") + + +def test_list_page_objects_from_imported_registry(): + rules = secondary_registry.get_overrides(filters="tests.po_lib") + assert len(rules) == 2 + rule_for = {po.use: po for po in rules} + + potop2 = rule_for[POTopLevel2] + assert potop2.for_patterns == Patterns(["example.org"]) + assert potop2.instead_of == POTopLevelOverriden2 + + pones = rule_for[PONestedModule] + assert pones.for_patterns == Patterns(["example.com"]) + assert pones.instead_of == PONestedModuleOverridenSecondary + + +def test_registry_name_conflict(): + """Registries can only have valid unique names.""" + + PageObjectRegistry("main") + + assert "main" in registry_pool + + with pytest.raises(ValueError): + PageObjectRegistry("main") # a duplicate name + + with pytest.raises(TypeError): + PageObjectRegistry() + + with pytest.raises(ValueError): + PageObjectRegistry("") + + +def test_registry_copy_overrides_from(): + combined_registry = PageObjectRegistry("combined") + combined_registry.copy_overrides_from(default_registry, secondary_registry) + + # Copying overrides from other PageObjectRegistries should have duplicate + # OverrideRules removed. + combined_rule_count = combined_registry.get_overrides() + assert len(combined_rule_count) == 7 + + raw_count = len(default_registry.get_overrides()) + len(secondary_registry.get_overrides()) + assert len(combined_rule_count) < raw_count + + # Copying overrides again does not result in duplicates + combined_registry.copy_overrides_from(default_registry, secondary_registry) + combined_registry.copy_overrides_from(default_registry, secondary_registry) + combined_registry.copy_overrides_from(default_registry, secondary_registry) + assert len(combined_rule_count) == 7 + + +def test_registry_replace_override(): + registry = PageObjectRegistry("replace") + registry.copy_overrides_from(secondary_registry) + rules = registry.get_overrides() + + replacement_rule = registry.replace_override(rules[0], instead_of=POTopLevel1) + + new_rules = registry.get_overrides() + assert len(new_rules) == 2 + assert new_rules[-1].instead_of == POTopLevel1 # newly replace rules at the bottom + assert replacement_rule.instead_of == POTopLevel1 # newly replace rules at the bottom + + # Replacing a rule not in the registry would result in ValueError + rule_not_in_registry = dataclasses.replace(new_rules[0], instead_of=POTopLevelOverriden2) + with pytest.raises(ValueError): + registry.replace_override(rule_not_in_registry, instead_of=POTopLevel2) + + +def test_registry_search_overrides(): + registry = PageObjectRegistry("search") + registry.copy_overrides_from(secondary_registry) + + rules = registry.search_overrides(use=POTopLevel2) + assert len(rules) == 1 + assert rules[0].use == POTopLevel2 + + rules = registry.search_overrides(instead_of=POTopLevelOverriden2) + assert len(rules) == 1 + assert rules[0].instead_of == POTopLevelOverriden2 + + rules = registry.search_overrides( + instead_of=PONestedModuleOverridenSecondary, use=PONestedModule + ) + assert len(rules) == 1 + assert rules[0].instead_of == PONestedModuleOverridenSecondary + assert rules[0].use == PONestedModule + + # These rules doesn't exist + rules = registry.search_overrides(use=POTopLevel1) + assert len(rules) == 0 + + rules = registry.search_overrides(instead_of=POTopLevel1) + assert len(rules) == 0 + + +def test_registry_remove_overrides(): + registry = PageObjectRegistry("remove") + registry.copy_overrides_from(secondary_registry) + + rules = registry.get_overrides() + + registry.remove_overrides(*rules) + assert len(registry.get_overrides()) == 0 + + # Removing non-existing rules won't error out. + registry.remove_overrides(*rules) + assert len(registry.get_overrides()) == 0 + + +def test_cli_tool(): + """Ensure that CLI parameters returns the expected results. + + There's no need to check each specific OverrideRule below as we already have + extensive tests for those above. We can simply count how many rules there are + for a given registry. + """ + + from web_poet.__main__ import main + + results = main(["tests"]) + assert len(results) == 6 + + results = main(["tests", "--registry_name=secondary"]) + assert len(results) == 2 + + results = main(["tests", "--registry_name=not_exist"]) + assert not results diff --git a/tests_extra/__init__.py b/tests_extra/__init__.py new file mode 100644 index 00000000..62c40098 --- /dev/null +++ b/tests_extra/__init__.py @@ -0,0 +1,5 @@ +""" +This test package was created separately to see the behavior of retrieving the +Override rules declared on a registry where @handle_urls is defined on another +package. +""" diff --git a/tests_extra/po_lib_sub_not_imported/__init__.py b/tests_extra/po_lib_sub_not_imported/__init__.py new file mode 100644 index 00000000..8f68f79b --- /dev/null +++ b/tests_extra/po_lib_sub_not_imported/__init__.py @@ -0,0 +1,28 @@ +""" +This package quite is similar to tests/po_lib_sub in terms of code contents. + +What we're ultimately trying to test here is to see if the `default_registry` +captures the rules annotated in this module if it was not imported. +""" +from typing import Dict, Any, Callable + +from url_matcher import Patterns + +from web_poet import handle_urls + + +class POBase: + expected_overrides: Callable + expected_patterns: Patterns + expected_meta: Dict[str, Any] + + +class POLibSubOverridenNotImported: + ... + + +@handle_urls("sub_example_not_imported.com", POLibSubOverridenNotImported) +class POLibSubNotImported(POBase): + expected_overrides = POLibSubOverridenNotImported + expected_patterns = Patterns(["sub_example_not_imported.com"]) + expected_meta = {} # type: ignore diff --git a/tox.ini b/tox.ini index ea67701a..e86fd602 100644 --- a/tox.ini +++ b/tox.ini @@ -15,6 +15,7 @@ commands = [testenv:mypy] deps = mypy + types-tabulate commands = mypy --ignore-missing-imports web_poet tests diff --git a/web_poet/__init__.py b/web_poet/__init__.py index e5bf1f54..7c865e1e 100644 --- a/web_poet/__init__.py +++ b/web_poet/__init__.py @@ -1,2 +1,15 @@ +from typing import Dict + from .pages import WebPage, ItemPage, ItemWebPage, Injectable -from .page_inputs import ResponseData \ No newline at end of file +from .page_inputs import ResponseData +from .overrides import ( + PageObjectRegistry, + consume_modules, + registry_pool, +) + + +# For ease of use, we'll create a default registry so that users can simply +# use its `handle_urls()` method directly by `from web_poet import handle_urls` +default_registry = PageObjectRegistry(name="default") +handle_urls = default_registry.handle_urls diff --git a/web_poet/__main__.py b/web_poet/__main__.py new file mode 100644 index 00000000..34978e67 --- /dev/null +++ b/web_poet/__main__.py @@ -0,0 +1,79 @@ +"""Returns all Override Rules from the default registry.""" + +import argparse +from typing import Callable, Optional, List + +import tabulate + +from web_poet import registry_pool, consume_modules, PageObjectRegistry +from web_poet.overrides import OverrideRule + + +def qualified_name(cls: Callable) -> str: + return f"{cls.__module__}.{cls.__name__}" + + +def parse_args(raw_args: Optional[List[str]] = None) -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Tool that list the Page Object overrides from a package or module recursively" + ) + parser.add_argument( + "module", + metavar="PKG_OR_MODULE", + type=str, + help="A package or module to list overrides from", + ) + parser.add_argument( + "--registry_name", + default="default", + type=str, + help="Name of the registry to retrieve the rules from.", + ) + return parser.parse_args(args=raw_args) + + +def load_registry(args: argparse.Namespace) -> Optional[PageObjectRegistry]: + consume_modules(args.module) + registry = registry_pool.get(args.registry_name) + return registry + + +def display_table(registry_name: str, rules: List[OverrideRule]) -> None: + headers = [ + "Registry", + "Use this", + "instead of", + "for the URL patterns", + "except for the patterns", + "with priority", + "meta", + ] + + table = [ + ( + registry_name, + qualified_name(rule.use), + qualified_name(rule.instead_of), + rule.for_patterns.include, + rule.for_patterns.exclude, + rule.for_patterns.priority, + rule.meta, + ) + for rule in rules + ] + print(tabulate.tabulate(table, headers=headers)) + + +def main(raw_args: Optional[List[str]] = None) -> Optional[List[OverrideRule]]: + args = parse_args(raw_args) # pragma: no cover + registry = load_registry(args) + if not registry: + print(f"No registry named {args.registry_name} found.") + return None + rules = registry.get_overrides(filters=args.module) + display_table(registry.name, rules) + return rules # for ease of testing + + +if __name__ == "__main__": + main() # pragma: no cover diff --git a/web_poet/overrides.py b/web_poet/overrides.py new file mode 100644 index 00000000..f1b92548 --- /dev/null +++ b/web_poet/overrides.py @@ -0,0 +1,446 @@ +from __future__ import annotations # https://www.python.org/dev/peps/pep-0563/ + +import importlib +import importlib.util +import warnings +import pkgutil +from collections import deque +from dataclasses import dataclass, field, replace +from operator import attrgetter +from types import ModuleType +from typing import Iterable, Optional, Union, List, Callable, Dict, Any + +from url_matcher import Patterns + +Strings = Union[str, Iterable[str]] + + +@dataclass(frozen=True) +class OverrideRule: + """A single override rule that specifies when a Page Object should be used + instead of another. + + This is instantiated when using the :func:`web_poet.handle_urls` decorator. + It's also being returned as a ``List[OverrideRule]`` when calling + :meth:`~.PageObjectRegistry.get_overrides`. + + You can access any of its attributes: + + * ``for_patterns: Patterns`` - contains the URL patterns associated + with this rule. You can read the API documentation of the + `url-matcher <https://url-matcher.readthedocs.io/>`_ package for more + information. + * ``use: Callable`` - the Page Object that will be used. + * ``instead_of: Callable`` - the Page Object that will be **replaced**. + * ``meta: Dict[str, Any] = field(default_factory=dict)`` - Any other + information you many want to store. This doesn't do anything for now + but may be useful for future API updates. + """ + + for_patterns: Patterns + use: Callable + instead_of: Callable + meta: Dict[str, Any] = field(default_factory=dict) + + def __hash__(self): + return hash((self.for_patterns, self.use, self.instead_of)) + + +def _as_list(value: Optional[Strings]) -> List[str]: + """ + >>> _as_list(None) + [] + >>> _as_list("foo") + ['foo'] + >>> _as_list(["foo", "bar"]) + ['foo', 'bar'] + """ + if value is None: + return [] + if isinstance(value, str): + return [value] + return list(value) + + +class PageObjectRegistry: + """This contains the mapping rules that associates the Page Objects available + for a given URL matching rule. + + Different Registry classes can be used to create different groups of + annotations. Here's an example usage: + + .. code-block:: python + + from web_poet import PageObjectRegistry + + main_registry = PageObjectRegistry(name="main") + secondary_registry = PageObjectRegistry(name="secondary") + + @main_registry.handle_urls("example.com", overrides=ProductPageObject) + @secondary_registry.handle_urls("example.com/shop/?id=*", overrides=ProductPageObject) + class ExampleComProductPage(ItemPage): + ... + + .. warning:: + + Each :class:`~.PageObjectRegistry` instance should have a unique **name** + value. Otherwise, a ``ValueError`` is raised. + + The annotation indicates that the ``ExampleComProductPage`` + Page Object should be used instead of the ``ProductPageObject`` Page + Object for all the URLs whose top level domain is ``example.com``. + + Moreover, this rule is available for the two (2) registries we've declared. + This could be useful in cases wherein you want to categorize the rules by + :class:`~.PageObjectRegistry`. They could each be accessed via: + + .. code-block:: python + + rules_main = main_registry.get_overrides() + rules_secondary = main_registry.get_overrides() + + On the other hand, ``web-poet`` already provides a default Registry named + ``default_registry`` for convenience. It can be directly accessed via: + + .. code-block:: python + + from web_poet import handle_urls, default_registry + + @handle_urls("example.com", overrides=ProductPageObject) + class ExampleComProductPage(ItemPage): + ... + + override_rules = default_registry.get_overrides() + + Notice that the ``handle_urls`` that we've imported is a part of + ``default_registry``. This provides a shorter and quicker way to interact + with the built-in default Registry. + + In addition, if you need to organize your Page Objects in your project, a + single (1) default instance of the :class:`~.PageObjectRegistry` would work, + as long as you organize your files into modules. + + The rules could then be accessed using this method: + + * ``default_registry.get_overrides(filters="my_scrapy_project.page_objects.site_A")`` + * ``default_registry.get_overrides(filters="my_scrapy_project.page_objects.site_B")`` + + Lastly, you can access all of the :class:`~.PageObjectRegistry` that were + ever instantiated via ``web_poet.registry_pool`` which is simply a mapping + structured as ``Dict[str, PageObjectRegistry]``: + + .. code-block:: python + + from web_poet import registry_pool + + print(registry_pool) + # { + # 'default': <web_poet.overrides.PageObjectRegistry object at 0x7f47d654d8b0>, + # 'main': <web_poet.overrides.PageObjectRegistry object at 0x7f47d525c3d0>, + # 'secondary': <web_poet.overrides.PageObjectRegistry object at 0x7f47d52024c0> + # } + + .. warning:: + + Please be aware that there might be some :class:`~.PageObjectRegistry` + that are not available, most especially if you're using them from external + packages. + + Thus, it's imperative to use :func:`~.web_poet.overrides.consume_modules` + beforehand: + + .. code-block:: python + + from web_poet import registry_pool, consume_modules + + consume_modules("external_pkg") + + print(registry_pool) + # { + # 'default': <web_poet.overrides.PageObjectRegistry object at 0x7f47d654d8b0>, + # 'main': <web_poet.overrides.PageObjectRegistry object at 0x7f47d525c3d0>, + # 'secondary': <web_poet.overrides.PageObjectRegistry object at 0x7f47d52024c0> + # 'ecommerce': <external_pkg.PageObjectRegistry object at 0x7f47d8328420> + # } + """ + + def __init__(self, name: str): + self._data: Dict[Callable, OverrideRule] = {} + + if not name: + raise ValueError("A registry should have a name.") + + if name in registry_pool: + raise ValueError(f"A registry named '{name}' already exists.") + registry_pool[name] = self + self.name = name + + def handle_urls( + self, + include: Strings, + overrides: Callable, + *, + exclude: Optional[Strings] = None, + priority: int = 500, + **kwargs, + ): + """ + Class decorator that indicates that the decorated Page Object should be + used instead of the overridden one for a particular set the URLs. + + Which Page Object is overridden is determined by the ``overrides`` + parameter. + + Over which URLs the override happens is determined by the ``include``, + ``exclude`` and ``priority`` parameters. See the documentation of the + `url-matcher <https://url-matcher.readthedocs.io/>`_ package for more + information about them. + + Any extra parameters are stored as meta information that can be later used. + + :param include: Defines the URLs that should be handled by the overridden Page Object. + :param overrides: The Page Object that should be replaced by the annotated one. + :param exclude: Defines URLs over which the override should not happen. + :param priority: The resolution priority in case of conflicting annotations. + """ + + def wrapper(cls): + rule = OverrideRule( + for_patterns=Patterns( + include=_as_list(include), + exclude=_as_list(exclude), + priority=priority, + ), + use=cls, + instead_of=overrides, + meta=kwargs, + ) + # If it was already defined, we don't want to override it + if cls not in self._data: + self._data[cls] = rule + else: + warnings.warn( + f"Multiple @handle_urls annotations with the same 'overrides' " + f"are ignored in the same Registry. Ignoring duplicate " + f"annotation on '{include}' for {cls}." + ) + + return cls + + return wrapper + + def get_overrides( + self, consume: Optional[Strings] = None, filters: Optional[Strings] = None + ) -> List[OverrideRule]: + """Returns a ``List`` of :class:`~.OverrideRule` that were declared using + ``@handle_urls``. + + :param consume: packages/modules that need to be imported so that it can + properly load the :meth:`~.PageObjectRegistry.handle_urls` annotations. + :param filters: packages/modules that are of interest can be declared + here to easily extract the rules from them. Use this when you need + to pinpoint specific rules. + + .. warning:: + + Remember to consider using the ``consume`` parameter to properly load + the :meth:`~.PageObjectRegistry.handle_urls` from external Page + Objects + + The ``consume`` parameter provides a convenient shortcut for calling + :func:`~.web_poet.overrides.consume_modules`. + """ + if consume: + consume_modules(*_as_list(consume)) + + if not filters: + return list(self._data.values()) + + else: + # Dict ensures that no duplicates are collected and returned. + rules: Dict[Callable, OverrideRule] = {} + + for item in _as_list(filters): + for mod in walk_module(item): + rules.update(self._filter_from_module(mod.__name__)) + + return list(rules.values()) + + def _filter_from_module(self, module: str) -> Dict[Callable, OverrideRule]: + return { + cls: rule + for cls, rule in self._data.items() + # A "." is added at the end to prevent incorrect matching on cases + # where package names are substrings of one another. For example, + # if module = "my_project.po_lib", then it filters like so: + # - "my_project.po_lib_sub.POLibSub" (filtered out) + # - "my_project.po_lib.POTopLevel1" (accepted) + # - "my_project.po_lib.nested_package.PONestedPkg" (accepted) + if cls.__module__.startswith(module + ".") or cls.__module__ == module + } + + def copy_overrides_from(self, *page_object_registries: PageObjectRegistry) -> None: + """Copies the :class:`OverrideRule` data from one or more + :class:`PageObjectRegistry` instances. + + Any duplicate :class:`OverrideRule` are also removed. + """ + + for registry in page_object_registries: + for rule in registry.get_overrides(): + if rule.use not in self._data: + self._data[rule.use] = rule + + def replace_override(self, rule: OverrideRule, **kwargs) -> OverrideRule: + """Given a :class:`OverrideRule`, replace its attributes with the new + ones specified. + + If the supplied :class:`OverrideRule` instance does not belong in the + registry, a ``ValueError`` is raised. + + .. note:: + + Since :class:`OverrideRule` are frozen dataclasses, this method + removes the instance of the old rule completely and instead, creates + a new instance with the newly replaced attributes. + + The new instance of the :class:`OverrideRule` with the new specified + attribites is returned. + """ + + if rule not in self._data.values(): + raise ValueError(f"The given rule is not present in {self}: {rule}") + + new_rule = replace(rule, **kwargs) + del self._data[rule.use] + self._data[new_rule.use] = new_rule + + return new_rule + + def search_overrides(self, **kwargs) -> List[OverrideRule]: + """Returns a list of :class:`OverrideRule` if any of the attributes + matches the rules inside the registry. + + Sample usage: + + .. code-block:: python + + rules = registry.search_overrides(use=ProductPO, instead_of=GenericPO) + print(len(rules)) # 1 + + """ + + # Short-circuit operation if "use" is the only search param used, since + # we know that it's being used as the dict key. + if set(["use"]) == kwargs.keys(): + rule = self._data.get(kwargs["use"]) + if rule: + return [rule] + return [] + + getter = attrgetter(*kwargs.keys()) + + def matcher(rule: OverrideRule): + attribs = getter(rule) + if not isinstance(attribs, tuple): + attribs = tuple([attribs]) + if list(attribs) == list(kwargs.values()): + return True + return False + + results = [] + for rule in self.get_overrides(): + if matcher(rule): + results.append(rule) + return results + + def remove_overrides(self, *rules: OverrideRule) -> None: + """Given a list of :class:`OverrideRule`, remove them from the Registry. + + Non-existing rules won't pose an issue as no errors will be raised. + """ + + for rule in rules: + if rule.use in self._data: + del self._data[rule.use] + + def __repr__(self): + return f"PageObjectRegistry(name='{self.name}')" + + +# When the `PageObjectRegistry` class is instantiated, it records itself to +# this pool so that all instances can easily be accessed later on. +registry_pool: Dict[str, PageObjectRegistry] = {} + + +def walk_module(module: str) -> Iterable: + """Return all modules from a module recursively. + + Note that this will import all the modules and submodules. It returns the + provided module as well. + """ + + def onerror(err): + raise err # pragma: no cover + + spec = importlib.util.find_spec(module) + if not spec: + raise ImportError(f"Module {module} not found") + mod = importlib.import_module(spec.name) + yield mod + if spec.submodule_search_locations: + for info in pkgutil.walk_packages( + spec.submodule_search_locations, f"{spec.name}.", onerror + ): + mod = importlib.import_module(info.name) + yield mod + + +def consume_modules(*modules: str) -> None: + """A quick wrapper for :func:`~.walk_module` to efficiently consume the + generator and recursively load all packages/modules. + + This function is essential to be run before attempting to retrieve all + :meth:`~.PageObjectRegistry.handle_urls` annotations from :class:`~.PageObjectRegistry` + to ensure that they are properly acknowledged by importing them in runtime. + + Let's take a look at an example: + + .. code-block:: python + + # my_page_obj_project/load_rules.py + + from web_poet import default_registry, consume_modules + + consume_modules("other_external_pkg.po", "another_pkg.lib") + rules = default_registry.get_overrides() + + For this case, the ``List`` of :class:`~.OverrideRule` are coming from: + + - ``my_page_obj_project`` `(since it's the same module as the file above)` + - ``other_external_pkg.po`` + - ``another_pkg.lib`` + + So if the ``default_registry`` had other ``@handle_urls`` annotations outside + of the packages/modules listed above, then the :class:`~.OverrideRule` won't + be returned. + + .. note:: + + :meth:`~.PageObjectRegistry.get_overrides` provides a shortcut for this + using its ``consume`` parameter. Thus, the code example above could be + shortened even further by: + + .. code-block:: python + + from web_poet import default_registry + + rules = default_registry.get_overrides(consume=["other_external_pkg.po", "another_pkg.lib"]) + """ + + for module in modules: + gen = walk_module(module) + + # Inspired by itertools recipe: https://docs.python.org/3/library/itertools.html + # Using a deque() results in a tiny bit performance improvement that list(). + deque(gen, maxlen=0)