From 4f477c42679a78c8478556f4d257ea1fec35eab5 Mon Sep 17 00:00:00 2001 From: Philip de Nier Date: Tue, 17 Sep 2024 17:36:12 +0100 Subject: [PATCH] adr: Add ADR for query option to filter flow segment get_urls --- docs/README.md | 2 + docs/adr/0023-filter-segment-get-urls.md | 78 ++++++++++++++++++++++++ 2 files changed, 80 insertions(+) create mode 100644 docs/adr/0023-filter-segment-get-urls.md diff --git a/docs/README.md b/docs/README.md index c7b3704..0122349 100644 --- a/docs/README.md +++ b/docs/README.md @@ -30,6 +30,8 @@ For more information on how we use ADRs, see [here](./adr/README.md). | [0018](./adr/0018-restrict-direct-source-modification.md) | Restrict direct Source modification | | [0019](./adr/0019-consolidate-modified-updated-terms.md) | Rename `modified_by` properties in Source and Flow schemas to `updated_by` | | [0020](./adr/0020-version-signalling.md) | Improving the signalling of the supported API version in implementations | +| [0021](./adr/0021-storage-label-format.md) | Guidance for use of `get_urls` labels on Flow Segments | +| [0023](./adr/0023-filter-segment-get-urls.md) | Add query option to filter Flow Segment `get_urls` by `label` | \* Note: ADR 0004a was the unintended result of a number clash in the early development of TAMS which wasn't caught before publication diff --git a/docs/adr/0023-filter-segment-get-urls.md b/docs/adr/0023-filter-segment-get-urls.md new file mode 100644 index 0000000..21a8d95 --- /dev/null +++ b/docs/adr/0023-filter-segment-get-urls.md @@ -0,0 +1,78 @@ +--- +status: "proposed" +--- +# Query Option to Filter get_urls + +## Context and Problem Statement + +Each Flow Segment returned by the `/flows//segment` endpoint includes a `get_urls` propery that may include (labelled) URLs for fetching media objects. +The calculation of pre-signed URLs for example can result in an not insignificant increased request time. +Clients do not need the `get_urls` if they don't intend to fetch the media objects. +This ADR proposes adding a query option to filter the URLs that need to be included by a TAMS. + +It was found that calculating a pre-signed URL for fetching a media object takes around 0.35 milliseconds when using the Python [boto3](https://github.com/boto/boto3) library. +That means a request to get 5000 Flow Segments will include around 1.75 seconds overhead to calculate the pre-signed URLs. +This was observed in a request to a TAMS which took around 4 seconds to return Flow Segments with pre-signed URLs and only around 2 seconds to return the same Flow Segments without pre-signed URLs. + +The [boto3](https://github.com/boto/boto3) method to generate pre-signed URLs could also be optimised to reduce the time taken. +It was found that the older deprecated [boto](https://github.com/boto/boto) library took around 0.14 milliseconds to calculate a pre-signed URL, a 60% speed reduction when compared to [boto3](https://github.com/boto/boto3). +There are likely even more optimisations that could be made. + +The pre-signed URL calculation may use a highly optimised implementation that avoids the large time overhead. +This would still result in larger response sizes because the pre-signed URLs tend to be quite large as they include signatures, keys and tokens. + +It is difficult to predict what the expected number of flow segments will be in practice. +If it is low in all cases then including or not including pre-signed URLs doesn't make much difference. +If it is unknown then having an option to avoid the pre-signed calculation overhead may be useful in some cases. + +## Considered Options + +* Option 1: Always return all `get_urls` +* Option 2: Add a query option to return none or all `get_urls` +* Option 3: Add a simple query option to filter `get_urls` based on the `label` +* Option 4: Add more complex query option(s) to filter `get_urls` + +## Decision Outcome + +Chosen option: "Option 3: Add a simple query option to filter `get_urls` based on the `label`", because this covers the requirement to disable calculation of pre-signed URLs and at the same time allows other URLs to be retained. + +### Implementation + +Implemented in [PR #88](https://github.com/bbc/tams/pull/88). + +## Pros and Cons of the Options + +### Option 1: Always return all `get_urls` + +This is the current state. + +* Bad, because TAMS is doing work to calculate pre-signed URLs even though clients may not need them + +### Option 2: Add a query option to return none or all `get_urls` + +This adds a boolean option named `include_get_urls` that if set to `false` results in no `get_urls` in the response and therefore no pre-signed URLs are calculated by TAMS. + +* Good, because it allows clients to indicate that `get_urls` are not required +* Bad, because some TAMS implementations may provide different types of URLs and a client does not have the option to filter out the ones that cause increased request time whilst retaining ones that don't + +### Option 3: Add a simple query option to filter `get_urls` based on the `label` + +This extends [Option 2](#option-2-add-a-query-option-to-return-none-or-all-get_urls) to add a (comma-separated) list option named `accept_get_urls` that specifies the `labels` associated with URLS to include in the response. +Omitting `accept_get_urls` will result in all URLs in the response. +Setting `accept_get_urls` to an empty string will result in no URLs in the response. +If `accept_get_urls` includes a `default` label then URLs with label `default` or no label are included in the response. + +* Good, because it allows clients to indicate that `get_urls` are not required +* Good, because clients can retain URLs that don't cause increased request times +* Neutral, because clients may want more complex filters based on what is available + +### Option 4: Add more complex query option(s) to filter `get_urls` + +This changes [Option 3](#option-3-add-a-simple-query-option-to-filter-get_urls-based-on-the-label) to allow more complex filtering based on what URLs are available. +E.g. include a URL for direct access to the object if available and otherwise a pre-signed URL. + +* Good, because it allows clients to indicate that `get_urls` are not required +* Good, because clients can retain URLs that don't cause increased request times +* Neutral, because clients can use more complex filters based on what is available +* Neutral, because it isn't clear whether there is a requirement for more complex filters +* Neutral, because [0021](./0021-storage-label-format.md) thus far only proposes an App note for guidance on the use of `get_urls` labels on flow segments