Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advanced pagination #433

Merged
merged 11 commits into from
Mar 11, 2024
53 changes: 51 additions & 2 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ tab for developer documentation._

**Q:** Is there documentation detailing the internal workings of the code?

**A:** _Absolutely! For an in-depth look at the CRIPT Python SDK code,
**A:** _Absolutely! For an in-depth look at the CRIPT Python SDK code,
consult the [GitHub repository wiki internal documentation](https://github.com/C-Accel-CRIPT/Python-SDK/wiki)._

---
Expand Down Expand Up @@ -84,7 +84,7 @@ A GitHub account is required._

**Q:** Where can I find the release notes for each SDK version?

**A:** _The release notes can be found on our
**A:** _The release notes can be found on our
[CRIPT Python SDK repository releases section](https://github.com/C-Accel-CRIPT/Python-SDK/releases)_

---
Expand All @@ -97,6 +97,55 @@ the code is written to get a better grasp of it?
There you will find documentation on everything from how our code is structure,
how we aim to write our documentation, CI/CD, and more._

---

**Q:** What can I do, when my `api.search(...)` fails with a `cript.nodes.exception.CRIPTJsonDeserializationError` or similar?

**A:** _There is a solution for you. Sometimes CRIPT can contain nodes formatted in a way that the Python SDK does not understand. We can disable the automatic conversion from the API response into SDK nodes. Here is an example of how to achieve this:
```python
# Create API object in with statement, here it assumes host, token, and storage token are in your environment variables
with cript.API() as api:
# Find the paginator object, which is a python iterator over the search results.
materials_paginator = cript_api.search(node_type=cript.Material, search_mode=cript.SearchModes.NODE_TYPE)
# Usually you would do
# `materials_list = list(materials_paginator)`
# or
# for node in materials_paginator:
# #do node stuff
# But now we want more control over the iteration to ignore failing node decoding.
# And store the result in a list of valid nodes
materials_list = []
# We use a while True loop to iterate over the results
while True:
# This first try catches, when we reach the end of the search results.
# The `next()` function raises a StopIteration exception in that case
try:
# First we try to convert the current response into a node directly
try:
material_node = next(materials_paginator)
# But if that fails, we catch the exception from CRIPT
except cript.CRIPTException as exc:
# In case of failure, we disable the auto_load_function temporarily
materials_paginator.auto_load_nodes = False
# And only obtain the unloaded node JSON instead
material_json = next(materials_paginator)
# Here you can inspect and manually handle the problem.
# In the example, we just print it and ignore it otherwise
print(exc, material_json)
else:
# After a valid node is loaded (try block didn't fail)
# we store the valid node in the list
materials_list += [material_node]
finally:
# No matter what happened, for the next iteration we want to try to obtain
# an auto loaded node again, so we reset the paginator state.
materials_paginator.auto_load_nodes = True
except StopIteration:
# If next() of the paginator indicates an end of the search results, break the loop
break
```


_We try to also have type hinting, comments, and docstrings for all the code that we work on so it is clear and easy for anyone reading it to easily understand._

_if all else fails, contact us on our [GitHub Repository](https://github.com/C-Accel-CRIPT/Python-SDK)._
28 changes: 18 additions & 10 deletions src/cript/api/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,10 +59,10 @@ class API:
_host: str = ""
_api_token: str = ""
_storage_token: str = ""
_http_headers: dict = {}
_db_schema: Optional[DataSchema] = None
_api_prefix: str = "api"
_api_version: str = "v1"
_api_request_session: Union[None, requests.Session] = None

# trunk-ignore-begin(cspell)
# AWS S3 constants
Expand Down Expand Up @@ -213,9 +213,6 @@ def __init__(self, host: Union[str, None] = None, api_token: Union[str, None] =
self._api_token = api_token # type: ignore
self._storage_token = storage_token # type: ignore

# add Bearer to token for HTTP requests
self._http_headers = {"Authorization": f"Bearer {self._api_token}", "Content-Type": "application/json"}

# set a logger instance to use for the class logs
self._init_logger(default_log_level)

Expand Down Expand Up @@ -322,6 +319,14 @@ def connect(self):
CRIPTConnectionError
raised when the host does not give the expected response
"""

# Establish a requests session object
if self._api_request_session:
self.disconnect()
self._api_request_session = requests.Session()
# add Bearer to token for HTTP requests
self._api_request_session.headers = {"Authorization": f"Bearer {self._api_token}", "Content-Type": "application/json"}

# As a form to check our connection, we pull and establish the data schema
try:
self._db_schema = DataSchema(self)
Expand All @@ -344,6 +349,10 @@ def disconnect(self):

For manual connection: nested API object are discouraged.
"""
# Disconnect request session
if self._api_request_session:
self._api_request_session.close()

# Restore the previously active global API (might be None)
global _global_cached_api
_global_cached_api = self._previous_global_cached_api
Expand Down Expand Up @@ -946,7 +955,7 @@ def delete_node_by_uuid(self, node_type: str, node_uuid: str) -> None:

self.logger.info(f"Deleted '{node_type.title()}' with UUID of '{node_uuid}' from CRIPT API.")

def _capsule_request(self, url_path: str, method: str, api_request: bool = True, headers: Optional[Dict] = None, timeout: int = _API_TIMEOUT, **kwargs) -> requests.Response:
def _capsule_request(self, url_path: str, method: str, api_request: bool = True, timeout: int = _API_TIMEOUT, **kwargs) -> requests.Response:
"""Helper function that capsules every request call we make against the backend.

Please *always* use this methods instead of `requests` directly.
Expand All @@ -971,9 +980,6 @@ def _capsule_request(self, url_path: str, method: str, api_request: bool = True,
additional keyword arguments that are passed to `request.request`
"""

if headers is None:
headers = self._http_headers

url: str = self.host
if api_request:
url += f"/{self.api_prefix}/{self.api_version}"
Expand All @@ -985,10 +991,12 @@ def _capsule_request(self, url_path: str, method: str, api_request: bool = True,
pre_log_message += "..."
self.logger.debug(pre_log_message)

response: requests.Response = requests.request(url=url, method=method, headers=headers, timeout=timeout, **kwargs)
if self._api_request_session is None:
raise CRIPTAPIRequiredError
response: requests.Response = self._api_request_session.request(url=url, method=method, timeout=timeout, **kwargs)
post_log_message: str = f"Request return with {response.status_code}"
if self.extra_api_log_debug_info:
post_log_message += f" {response.raw}"
post_log_message += f" {response.text}"
self.logger.debug(post_log_message)

return response
11 changes: 10 additions & 1 deletion src/cript/api/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ class CRIPTAPIRequiredError(CRIPTException):
"""
## Definition
Exception to be raised when the API object is requested, but no cript.API object exists yet.
Also make sure to use it in a context manager `with cript.API as api:` or manually call
`connect` and `disconnect`.

The CRIPT Python SDK relies on a cript.API object for creation, validation, and modification of nodes.
The cript.API object may be explicitly called by the user to perform operations to the API, or
Expand All @@ -83,14 +85,21 @@ class CRIPTAPIRequiredError(CRIPTException):
my_token = "123456" # To use your token securely, please consider using environment variables

my_api = cript.API(host=my_host, token=my_token)
my_api.connect()
# Your code
my_api.disconnect()
```
"""

def __init__(self):
pass

def __str__(self) -> str:
error_message = "cript.API object is required for an operation, but it does not exist." "Please instantiate a cript.API object to continue." "See the documentation for more details."
error_message = (
"cript.API object is required for an operation, but it does not exist."
"Please instantiate a cript.API object and connect it to API for example with a context manager `with cript.API() as api:` to continue."
"See the documentation for more details."
)

return error_message

Expand Down
99 changes: 91 additions & 8 deletions src/cript/api/paginator.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from json import JSONDecodeError
import json
from typing import Dict, Union
from urllib.parse import quote

Expand Down Expand Up @@ -33,6 +33,9 @@ class Paginator:
_current_position: int
_fetched_nodes: list
_number_fetched_pages: int = 0
_limit_page_fetches: Union[int, None] = None
_num_skip_pages: int = 0
auto_load_nodes: bool = True

@beartype
def __init__(
Expand Down Expand Up @@ -102,11 +105,15 @@ def _fetch_next_page(self) -> None:
None
"""

# Check if we are supposed to fetch more pages
if self._limit_page_fetches and self._number_fetched_pages >= self._limit_page_fetches:
raise StopIteration

# Composition of the query URL
temp_url_path: str = self._url_path
temp_url_path += f"/?q={self._query}"
if self._initial_page_number is not None:
temp_url_path += f"&page={self._initial_page_number + self._number_fetched_pages}"
temp_url_path += f"&page={self.page_number}"
self._number_fetched_pages += 1

response: requests.Response = self._api._capsule_request(url_path=temp_url_path, method="GET")
Expand All @@ -119,8 +126,11 @@ def _fetch_next_page(self) -> None:
# if converting API response to JSON gives an error
# then there must have been an API error, so raise the requests error
# this is to avoid bad indirect errors and make the errors more direct for users
except JSONDecodeError:
response.raise_for_status()
except json.JSONDecodeError as json_exc:
try:
response.raise_for_status()
except Exception as exc:
raise exc from json_exc

# handling both cases in case there is result inside of data or just data
try:
Expand All @@ -137,8 +147,10 @@ def _fetch_next_page(self) -> None:
if api_response["code"] != 200:
raise APIError(api_error=str(response.json()), http_method="GET", api_url=temp_url_path)

node_list = load_nodes_from_json(current_page_results)
self._fetched_nodes += node_list
# Here we only load the JSON into the temporary results.
# This delays error checking, and allows users to disable auto node conversion
json_list = current_page_results
self._fetched_nodes += json_list

def __next__(self):
if self._current_position >= len(self._fetched_nodes):
Expand All @@ -147,14 +159,85 @@ def __next__(self):
raise StopIteration
self._fetch_next_page()

self._current_position += 1
try:
return self._fetched_nodes[self._current_position - 1]
next_node_json = self._fetched_nodes[self._current_position - 1]
except IndexError: # This is not a random access iteration.
# So if fetching a next page wasn't enough to get the index inbound,
# The iteration stops
raise StopIteration

if self.auto_load_nodes:
return_data = load_nodes_from_json(next_node_json)
else:
return_data = next_node_json

# Advance position last, so if an exception occurs, for example when
# node decoding fails, we do not advance, and users can try again without decoding
self._current_position += 1

return return_data

def __iter__(self):
self._current_position = 0
return self

@property
def page_number(self) -> Union[int, None]:
"""Obtain the current page number the paginator is fetching next.

Returns
-------
int
positive number of the next page this paginator is fetching.
None
if no page number is associated with the pagination
"""
if self._initial_page_number is not None:
return self._num_skip_pages + self._initial_page_number + self._number_fetched_pages

@beartype
def limit_page_fetches(self, max_num_pages: Union[int, None]) -> None:
"""Limit pagination to a maximum number of pages.

This can be used for very large searches with the paginator, so the search can be split into
smaller portions.

Parameters
----------
max_num_pages: Union[int, None],
positive integer with maximum number of page fetches.
or None, indicating unlimited number of page fetches are permitted.
"""
self._limit_page_fetches = max_num_pages

def skip_pages(self, skip_pages: int) -> int:
"""Skip pages in the pagination.

Warning this function is advanced usage and may not produce the results you expect.
In particular, every search is different, even if we search for the same values there is
no guarantee that the results are in the same order. (And results can change if data is
added or removed from CRIPT.) So if you break up your search with `limit_page_fetches` and
`skip_pages` there is no guarantee that it is the same as one continuous search.
If the paginator associated search does not accept pages, there is no effect.

Parameters
----------
skip_pages:int
Number of pages that the paginator skips now before fetching the next page.
The parameter is added to the internal state, so repeated calls skip more pages.

Returns
-------
int
The number this paginator is skipping. Internal skip count.

Raises
------
RuntimeError
If the total number of skipped pages is negative.
"""
num_skip_pages = self._num_skip_pages + skip_pages
if self._num_skip_pages < 0:
RuntimeError(f"Invalid number of skipped pages. The total number of pages skipped is negative {num_skip_pages}, requested to skip {skip_pages}.")
self._num_skip_pages = num_skip_pages
return self._num_skip_pages
23 changes: 20 additions & 3 deletions tests/api/test_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,28 @@ def test_api_search_node_type(cript_api: cript.API) -> None:

# test search results
assert isinstance(materials_paginator, Paginator)
materials_list = list(materials_paginator)
materials_paginator.skip_pages(3)
materials_paginator.limit_page_fetches(3)
materials_list = []
while True:
try:
try:
material_node = next(materials_paginator)
except cript.CRIPTException as exc:
materials_paginator.auto_load_nodes = False
material_json = next(materials_paginator)
print(exc, material_json)
else:
materials_list += [material_node]
finally:
materials_paginator.auto_load_nodes = True
except StopIteration:
break

# Assure that we paginated more then one page
assert materials_paginator._current_page_number > 0
assert materials_paginator.page_number == 6
assert len(materials_list) > 5
first_page_first_result = materials_list[0]["name"]
first_page_first_result = materials_list[0].name
# just checking that the word has a few characters in it
assert len(first_page_first_result) > 3

Expand Down
Loading