GA4GH-wide pagination syntax #29

jb-adams · 2021-05-10T19:00:19Z

Cloud WS is looking to incorporate a harmonized, GA4GH-wide pagination syntax into their specs. This is most pertinent for DRS, as we want to update the spec to allow bulk object request and hence paginated results.

Pagination is currently handled by a number of GA4GH specs, but there is not necessarily interop between them. A goal would be to converge upon a minimal common pagination schema that existing specs can make use of, that the DRS spec will then uptake if approved.

@dglazer @ianfore

jb-adams · 2021-05-10T19:14:18Z

Currently, I'm only aware of the following GA4GH specs that incorporate pagination:

looping in @mcupak and @denis-yuen from Data Connect (DC) and TRS, respectively. Miro and Denis, would you be interested in contributing to a pagination schema that could be used across multiple GA4GH specs?

denis-yuen · 2021-05-10T19:40:35Z

Yup, thanks for the tag.
TRS has pagination for one endpoint https://github.com/ga4gh/tool-registry-service-schemas/blob/2.0.0/openapi/openapi.yaml#L209 based on github pagination https://docs.github.com/en/rest/guides/traversing-with-pagination

We have other endpoints which have grown in usage to the point that we need to add pagination to them as well so definitely would like to keep an eye on this conversation and/or contribute

dglazer · 2021-05-10T19:49:08Z

As another input to the discussion, here are Google's best practices for API pagination: https://cloud.google.com/apis/design/design_patterns#list_pagination

mcupak · 2021-05-10T21:05:38Z

Thanks @jb-adams! We also deferred a decision on pagination to a future release of the Service Registry so as to start small and not have to commit to a particular opinionated way of doing it, so that would be a good candidate for taking this up when it's ready as well.

Some more details on pagination in Data Connect. To sum up, we go very minimal - we don't specify a way of requesting pages on the client side, we simply say the server can provide a next page URL, which can be any URL (relative or absolute). That way implementations choose their own pagination strategy, only relying on clients resolving the links. The next page URL is null/empty on the last page, and it's the only field we use in pagination.

denis-yuen · 2021-05-10T21:51:51Z

FWIW, that seems to match TRS. We provide next_page, last_page, and self_link and a client could easily only use the first for pagination

jb-adams · 2021-05-11T12:46:55Z

I was thinking this too, that currently DC's next_page_url and TRS's next_page are more or less analogous. TRS is therefore a bit more opinionated in that it provides:

last_page to jump to the end of results
self_link for regenerating the same result set later
current_offset, current_limit to capture the number of records per paged result.

One idea for a path to a unified schema could involve:

using TRS as the starting pagination schema
DC for now only supports the next_page attribute, potentially supporting the other attributes later
DC has redundant fields: next_page and next_page_url that contain the same URL. A deprecation notice is given that next_page_url will eventually be dropped

Thoughts on the above?

I also see that DC's pagination is given in the response body, whereas TRS's is in the header. I don't think this should be a blocker for the development of the schema itself, as it could be usable in either location (although represented differently either as JSON or HTTP headers, and this will affect the way it's written in the OpenAPI)

dglazer · 2021-05-11T14:22:27Z

Generally feels good. A few notes:

I like next_page as the crux of the answer -- simple and sufficient. I suggest we get the security folks to weigh in on the anti-tampering thoughts in the Google best practices, and whether that might lead us to prefer a next_page_token instead of a full URL; I don't have strong personal opinions
does last_page in TRS mean the previous page or the final page? Either way, I suggest making it optional for new APIs. (It feels only occasionally useful for clients, and potentially inefficient for servers to implement.)
I think self-link should also be optional, unless I'm missing some compelling use case

We should also standardize request parameters:

I think we do want to require an equivalent of the TRS limit parameter
I'm not a fan of requiring the TRS offset parameter (although it's a fine option if some APIs want it)

On syntax:

it would be nice if all the pagination fields were named consistently with each other (e.g. page_size, next_page_url). But I don't feel strongly about the benefits of that vs, the benefits of compatibility with existing TRS and DC
I prefer next_page_url (or next_page_token) to next_page for clarity. But again I don't feel strongly about the benefits of that vs. backwards compatibility.
We should take a stand on "in the header" vs. "in the body". My strong-ish preference is "in the body" with all the other response info, but that's mostly for aesthetic vs. hardcore technical reasons. Are there technical arguments either way?

denis-yuen · 2021-05-11T15:22:39Z

the previous page or the final page? Either way, I suggest making it optional for new APIs

The final page. I agree that this could be optional since it is a just a convenience to quickly reach the last page (which may be more or less useful depending on how the pages is sorted). Ditto with self link.

We should take a stand on "in the header" vs. "in the body". My strong-ish preference is "in the body" with all the other response info, but that's mostly for aesthetic vs. hardcore technical reasons. Are there technical arguments either way?

I was under the impression that having these kind of links in the header is a standard. Searching around, this argues that GitHub is following rfc 5988

In this matter several choices to implement that appear :

Wrap the items within an envelope in the HTTP body, this envelope will also contain field having information on the pagination. You can find variation on the exact implementation some choose json-api, some their home brewed solution. This is probably the most common approach as this does require less HTTP knowledge.
And it is likely preferred by pure Javascript clients as there’s no need to peek outside the body.

Use the HTTP response with pagination meta-data being placed in the headers. That’s the path chosen by Github in their v3 API. That’s also this option that this blog post will explore.

More specifically the Github API is using the Link header form the HTTP standard. It allows express relations and how to reach them for a resource.

https://blog.arkey.fr/2017/10/02/pagination_hateoas_link.en/

dglazer · 2021-05-11T16:56:00Z

Thanks @denis-yuen for the pointer to GitHub API pagination -- I hadn't seen that before, but like it as a pattern, and saying "we're the same as GitHub" is a good story. It has pretty much all the same features as TRS (so would be easy for any TRS server/client to adopt), but uses different syntax (which would be a temporary nuisance).

Mapping my previous opinions to the GitHub syntax, every API that supports pagination:

Response
- MUST be able to return a Link: header
- MUST include links of type rel="next" in that header
- MAY include links of type rel="[first|prev|last]" in that header
Request
- SHOULD support a per_page request parameter (which the servers should take as a request not a constraint)
- MAY support a page request parameter for random access

ianfore · 2021-05-12T15:10:44Z

A couple of quick comments

The DataConnect use of pagination is using it as a way of polling if results are ready yet. i.e. you follow "next page" URLs until data appears. Is that a common paging need? Or a different use case?
Beacon is another GA4GH API with a current approach to pagination that should be brought into the discussion

jb-adams · 2021-05-12T19:12:38Z

I'm supportive of the Link header pagination as described in Github docs. It's a bigger departure from how existing GA4GH specs handle it, which presents a challenge in retrofitting current specs. Though any solution we come up with will involve a retrofit, so better to put in the hard work now to align ourselves with internet standards/best practices.

| The DataConnect use of pagination is using it as a way of polling if results are ready yet. i.e. you follow "next page" URLs until data appears. Is that a common paging need? Or a different use case?

The Link header could potentially serve as a way to poll. If search results are not ready, then a Link header with a single link (i.e. rel=next) could be returned, allowing the client to poll the same endpoint. This might not be considered a best practice as it could be seen as overloading the header.

| Beacon is another GA4GH API with a current approach to pagination that should be brought into the discussion

looping @jrambla into this discussion. Jordi, we are discussing if we can converge around an approach to pagination that will be picked up by multiple GA4GH specs.

denis-yuen · 2021-05-12T19:30:24Z

I haven't looked at it in a while, but I vaguely recall some of the difference for TRS might be due to the fact that we were working in swagger 2.0 as opposed to openapi 3.0 now. But I'm commenting totally without double-checking

dglazer · 2021-05-12T19:54:04Z

Re polling support in DataConnect -- if I'm reading it right, I don't see a problem. The DC spec uses an empty data payload ("a TableData object whose data property is a zero element array") to mean "this page of results isn't ready yet", and clients are expected to use the next_page_url to try again later. It seems fine to keep that convention, but have clients use the rel="next" link instead.

I agree with @ianfore that it's a different use case, and I don't expect most APIs will need to support long-running queries, but I don't mind the slight overloading for those that do, like DataConnect. (And it feels equally clean in either the current syntax or the newly proposed syntax.)

jrambla · 2021-07-04T06:12:26Z

Beacon approach is using a naive old version of pagination, that we haven't revised in several years. For Beacon v2 we are reviewing it. Therefore I would not consider the current approach (skip and limit) as a something to preserve.

The discussion above triggers a couple of fundamental question for me:

How much the APIs must rely on the "transport" protocol? If the answer is that we can use the API request and response structures over other mechanisms like message queues, then relying on HTTP specific stuff (headers) should not be the right approach. As mention above citing a blog, one design principal could be to allow a client to simply tear off the wrapper and rely in the request or response payload. Combining information in both places breaks that design.
Complete protocol agnosticism would not be possible for aspects like security (e.g. OAuth token in the header), but could this dependencies be constrained just to "transversal" aspects (security, logging, etc.)?
Is pagination something "transversal" or "core" for the specification? The fact that we are discussing it here means that is transversal or just a common issue that all of us need to address?
Which design principles are "too advanced" for non-expert implementers? (for Beacon that is important as we aim for humble projects/institutions too, with novice to intermediate developers).

My personal preference in Beacon has been for making it as much as protocol agnostic as possible.
But both the HTTP header and the next_page token options have been raised by different Beacon team members.
I'm still in need of reading more about pagination for performance aspects to have a more informed opinion.

jb-adams · 2021-08-10T12:46:34Z

Notes from the FASP hackathon on this issue can be found here under item 2B

dglazer · 2021-08-10T13:01:55Z

Thanks for the notes. I can't tell current status -- do we feel better or worse about the proposed GitHub-like approach, and (assuming better) how close are we to having a votable PR?

…

On Tue, Aug 10, 2021 at 5:46 AM Jeremy Adams ***@***.***> wrote: Notes from the FASP hackathon on this issue can be found here <https://docs.google.com/document/d/1h1EP8Z8SyTS_T4pwMVO_SfvEynHw2U7hNk8I8qhp4OA/edit#heading=h.1n4wzfw8085v> under item 2B — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#29 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABMOBXJBYWJT53PGPDP5TTTT4ENTLANCNFSM44R7YPJQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

dglazer · 2022-06-22T19:53:43Z

Checking in, since this was mentioned at the Steering Committee just now as ongoing, but I haven't seen any updates in almost a year -- what's the current status, and how close are we to having a votable PR?

denis-yuen · 2022-06-23T14:31:38Z

FWIW, probably not very useful, but I haven't seen/heard anything on my end.

jrambla · 2022-06-26T05:46:13Z

Me neither, althogh, as you can deduct, we have something already in the approved Beacon v2

mamanambiya · 2022-06-28T11:59:34Z

@dglazer, you are right; nothing much has happened since the last discussions before last year's FASP hackathon. We will table this issue for discussion at TASC's upcoming meeting. It looks like the GitHub-like approach is what most would go for.

patmagee · 2022-11-28T22:53:15Z

In most of the REST API's that I have used, Pagination is not strictly in the header but gets incorporated directly in the response body. As an implementor, putting the pagination in the headers adds a bit of a conundrum since we often fetch data from a remote serve and pass the response body back to a UI or client. At this stage all of the context around headers has been stripped out and is no longer accessible to the end users. Additionally, placing pagination outside of the Response Body means that any internal handling of that response needs to pass more information around then just translating the body from json into some sort of Object.

Another issue is one of practical integration with other tools ie curl. IMO it's a lot easier to pipe the JSON content of a request to something like jq and check to see if there is any pagination information, then it is to try to extract the pagination info from the curl output. Not to mention headers are not even displayed by default, so the user would not necessarily know the response has any pagination at all

I am definitely in favour of having all information related to pagination directly in the response body. It is easy to interpret, it is disjoint from the actual original request (ie you can look at a JSON alone and determine what the next page should be), and IMO its going to be the easiest thing for people to implement on both sides of the implementation spectrum (IE servers and clients)

I am a bit concerned about the Github Header based approach of pagination

patmagee · 2022-11-28T23:06:40Z

It is also worth asking the question of who the end user is, and how we expect they will interact with our apis.

GitHub's approach is a tried and true method as evidenced by its usage in the GitHub api. That being said, I don't really know anyone other then application developers who've used the GitHub api directly.
If we anticipate end users to interact with the apis directly ie through curl, then not having info in the request body makes their life harder

jrambla · 2022-11-29T07:52:27Z

I'm still more keen on the body option, as stated in my comment on Jul 4, 2021.
Making the protocols agnostic to the transport/serialization make them more "portable" and robust.

patmagee · 2022-11-29T14:10:19Z

@jrambla That is a good point. I think Data Connect demonstrated the power of using pagination within the body by allowing you to create a fully functioning API WITH pagination strictly by using static files plopped in a bucket.

denis-yuen · 2022-11-29T16:14:15Z

I was under the impression that having these kind of links in the header is a standard. Searching around, this argues that GitHub is following rfc 5988

FWIW, did a bit more looking around. It looks like rfc 5988 is followed by rfc 8288 which has a few more examples.
https://www.rfc-editor.org/rfc/rfc5988#section-5.5
https://www.rfc-editor.org/rfc/rfc8288#section-3.5

Can also search github for examples of libraries and code that use them, for example https://github.com/search?q=rfc8288&type=commits

If we do end up rolling our own standard in the response body, I would suggest keeping the link headers so that we remain compatible even if it is redundant.

patmagee · 2022-11-29T17:25:20Z

@denis-yuen thanks for the links to the rfc. I think it's 100% acceptable to allow both approaches, but just decide that there is a minimum approach that all apis should have in regards to pagination.

andrewyatz · 2023-07-07T09:13:07Z

Draft proposal will be distributed asap for Monday's TASC meeting

andrewyatz · 2024-11-25T12:48:51Z

Closing this as #55 will ensure the change is moved to approved

jb-adams self-assigned this May 10, 2021

jrambla mentioned this issue May 12, 2021

Pagination ga4gh-beacon/specification-v2#26

Closed

denis-yuen mentioned this issue Jul 28, 2022

Use more standard paging for GET /runs endpoint ga4gh/workflow-execution-service-schemas#159

Open

mamanambiya pinned this issue Apr 19, 2023

mamanambiya assigned andrewyatz and mamanambiya and unassigned jb-adams Jun 5, 2023

This was referenced Jul 17, 2023

Pagination support ga4gh/task-execution-schemas#193

Open

feat: add total_runs in response of GET /runs endpoint ga4gh/workflow-execution-service-schemas#183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GA4GH-wide pagination syntax #29

GA4GH-wide pagination syntax #29

jb-adams commented May 10, 2021

jb-adams commented May 10, 2021

denis-yuen commented May 10, 2021 •

edited

Loading

dglazer commented May 10, 2021

mcupak commented May 10, 2021

denis-yuen commented May 10, 2021

jb-adams commented May 11, 2021

dglazer commented May 11, 2021

denis-yuen commented May 11, 2021

dglazer commented May 11, 2021

ianfore commented May 12, 2021 •

edited

Loading

jb-adams commented May 12, 2021

denis-yuen commented May 12, 2021

dglazer commented May 12, 2021

jrambla commented Jul 4, 2021

jb-adams commented Aug 10, 2021

dglazer commented Aug 10, 2021 via email

dglazer commented Jun 22, 2022

denis-yuen commented Jun 23, 2022

jrambla commented Jun 26, 2022

mamanambiya commented Jun 28, 2022

patmagee commented Nov 28, 2022

patmagee commented Nov 28, 2022

jrambla commented Nov 29, 2022

patmagee commented Nov 29, 2022

denis-yuen commented Nov 29, 2022

patmagee commented Nov 29, 2022

andrewyatz commented Jul 7, 2023

andrewyatz commented Nov 25, 2024

GA4GH-wide pagination syntax #29

GA4GH-wide pagination syntax #29

Comments

jb-adams commented May 10, 2021

jb-adams commented May 10, 2021

denis-yuen commented May 10, 2021 • edited Loading

dglazer commented May 10, 2021

mcupak commented May 10, 2021

denis-yuen commented May 10, 2021

jb-adams commented May 11, 2021

dglazer commented May 11, 2021

denis-yuen commented May 11, 2021

dglazer commented May 11, 2021

ianfore commented May 12, 2021 • edited Loading

jb-adams commented May 12, 2021

denis-yuen commented May 12, 2021

dglazer commented May 12, 2021

jrambla commented Jul 4, 2021

jb-adams commented Aug 10, 2021

dglazer commented Aug 10, 2021 via email

dglazer commented Jun 22, 2022

denis-yuen commented Jun 23, 2022

jrambla commented Jun 26, 2022

mamanambiya commented Jun 28, 2022

patmagee commented Nov 28, 2022

patmagee commented Nov 28, 2022

jrambla commented Nov 29, 2022

patmagee commented Nov 29, 2022

denis-yuen commented Nov 29, 2022

patmagee commented Nov 29, 2022

andrewyatz commented Jul 7, 2023

andrewyatz commented Nov 25, 2024

denis-yuen commented May 10, 2021 •

edited

Loading

ianfore commented May 12, 2021 •

edited

Loading