Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/posts endpoint pagination missing posts between two paginations #76

Open
fireattack opened this issue Feb 21, 2022 · 6 comments
Open

Comments

@fireattack
Copy link

fireattack commented Feb 21, 2022

I use https://api.tumblr.com/v2/blog/animage/posts?api_key={mykey}&before=1369715530

Response:

{
"meta": {
"status": 200,
"msg": "OK"
},
"response": {
"blog": { ... },
"posts": [ ... (20 posts) ],
"total_posts": 21918,
"_links": {
"next": {
"href": "/v2/blog/animage/posts?before=1369715530&tumblelog=animage&page_number=VCAMEqDRFTJ4JBptzxiHLECCfqCm_3KxgsPdWS_xW1eHdU55SVNqU3lYMHhGUVFCSUxwSjVFU0Fyb0dwT0RWdklIeTVoZHBlTVdMUzdQWnR5S3JzbERoWW1RWFV6MG5IZm50MWxWaUFTQWpoK2xKN2ZhbThoemNJU3ZGb29jNWdORk5SaFJYY1U4Mk5HcXBmSGV3ZGlicllqTTd2Y3NsZ1dCc2laTUdMcTU3ajQxeGk1Mm1uV1pFb05RQ05EdVlPYXVkRCtuaFlUUkJvcnVjUzZGNFFNOHVPV2hkTWViMU0wdnJQWTkrU3A4ZHdMT0hHWEJBd25tUnpXNHpYaE1uVHpZVUJKMTdzSEp2OXNRUHdJYXVkdjIzNlgrUjVPL3VBenpsVFRiVDlscW1HUkdvUWNaUTRsZUpSTVUxR3lZaEZScllhK2VXbmtBVlc4Yyt6MzdBc2Q0RHZUWkIzejNLMmtOMkl2R1kybzJ1VU5SakcweVlWTGlXU0IzdzZmYWtlUTZqR1JsYTgwUStSVDRhcUdRblgvNmN0MnprZHB6d3A2",
"method": "GET",
"query_params": {
"before": "1369715530",
"tumblelog": "animage",
"page_number": "VCAMEqDRFTJ4JBptzxiHLECCfqCm_3KxgsPdWS_xW1eHdU55SVNqU3lYMHhGUVFCSUxwSjVFU0Fyb0dwT0RWdklIeTVoZHBlTVdMUzdQWnR5S3JzbERoWW1RWFV6MG5IZm50MWxWaUFTQWpoK2xKN2ZhbThoemNJU3ZGb29jNWdORk5SaFJYY1U4Mk5HcXBmSGV3ZGlicllqTTd2Y3NsZ1dCc2laTUdMcTU3ajQxeGk1Mm1uV1pFb05RQ05EdVlPYXVkRCtuaFlUUkJvcnVjUzZGNFFNOHVPV2hkTWViMU0wdnJQWTkrU3A4ZHdMT0hHWEJBd25tUnpXNHpYaE1uVHpZVUJKMTdzSEp2OXNRUHdJYXVkdjIzNlgrUjVPL3VBenpsVFRiVDlscW1HUkdvUWNaUTRsZUpSTVUxR3lZaEZScllhK2VXbmtBVlc4Yyt6MzdBc2Q0RHZUWkIzejNLMmtOMkl2R1kybzJ1VU5SakcweVlWTGlXU0IzdzZmYWtlUTZqR1JsYTgwUStSVDRhcUdRblgvNmN0MnprZHB6d3A2"
}
}
}
}
}

The 20 posts are:

51536827916
51536811310
51536793729
51536775352
51525122664
51524712324
51524341873
51506609103
51506266688
51505880360
51474731651
51474675594
51474708523
51474721571
51474660142
51474642573
51463776796
51460980995
51460952445
51460951105

I then request again for next page using the provided query_params:

https://api.tumblr.com/v2/blog/animage/posts?api_key={mykey}&before=1369715530&page_number=VCAMEqDRFTJ4JBptzxiHLECCfqCm_3KxgsPdWS_xW1eHdU55SVNqU3lYMHhGUVFCSUxwSjVFU0Fyb0dwT0RWdklIeTVoZHBlTVdMUzdQWnR5S3JzbERoWW1RWFV6MG5IZm50MWxWaUFTQWpoK2xKN2ZhbThoemNJU3ZGb29jNWdORk5SaFJYY1U4Mk5HcXBmSGV3ZGlicllqTTd2Y3NsZ1dCc2laTUdMcTU3ajQxeGk1Mm1uV1pFb05RQ05EdVlPYXVkRCtuaFlUUkJvcnVjUzZGNFFNOHVPV2hkTWViMU0wdnJQWTkrU3A4ZHdMT0hHWEJBd25tUnpXNHpYaE1uVHpZVUJKMTdzSEp2OXNRUHdJYXVkdjIzNlgrUjVPL3VBenpsVFRiVDlscW1HUkdvUWNaUTRsZUpSTVUxR3lZaEZScllhK2VXbmtBVlc4Yyt6MzdBc2Q0RHZUWkIzejNLMmtOMkl2R1kybzJ1VU5SakcweVlWTGlXU0IzdzZmYWtlUTZqR1JsYTgwUStSVDRhcUdRblgvNmN0MnprZHB6d3A2

You got next 20 posts:

 51450914667
 51439618771
 51439359109
 51439149930
 51438948039
 51423347876
 51423005393
 51422645443
 51395109963
 51395092840
 51395074190
 51395021193
 51395004555
 51394920534
 51370177352
 51370171419
 51370163416
 51370155416
 51348879503
 51348709311

However, there are two posts in-between:

51460950351
51460948891

are missing.

This is definitely caused by pagination, as if you just change the before timestamp to a little bit earlier, they will appear just fine:

E.g. change before to 1369642261:

51460952445 <--- posts from call #1 (second-last)
51460951105 <--- posts from call #1 (last)
51460950351 <--- missing post #1
51460948891 <--- missing post #2
51450914667 <--- posts from call #2 (first)
51439618771
51439359109
51439149930
51438948039
51423347876
51423005393
51422645443
51395109963
51395092840
51395074190
51395021193
51395004555
51394920534
51370177352
51370171419
@fireattack
Copy link
Author

Some notes:

This only seems to happen with older posts.

before isn't required to trigger the bug. When I discover this I'm just pulling all the posts from this account and the bug started to appear once it gets to around 2013.

@nightpool
Copy link

I've been noticing something kind of like this on the Android app too—if I navigate to a specific post on somebody's blog, it's started showing me posts both before and after that one, instead of an API response that starts with that post.

@cyle
Copy link
Member

cyle commented Feb 22, 2022

Hello! Looks like you found a bug. The underlying issue is that those "missing" posts and the last post(s) on the "page" all share the same user-set publish time. Because we store that timestamp in second-precision Unix epoch format, this can happen when using that as a pagination boundary. We'll see what we can do here to mitigate this.

@cyle
Copy link
Member

cyle commented Mar 8, 2022

@fireattack Just wanted to let you know that unfortunately we aren't going to be able to pick up fixing this anytime soon. Fixing this would require us to most likely refactor how pagination works for blog posts, which is a pretty large undertaking. To mitigate this for now, I'd suggest reaching out to that blog to ask them not to set the post's public date to the same as other posts on their blog. Even adding 1 second between each timestamp would fix this issue.

@fireattack
Copy link
Author

Take your time, but I'd say it has to be fixed eventually since it's a pretty serious bug.

To make it clear, the good ol' offset works just fine with these "simultaneous" posts:

https://api.tumblr.com/v2/blog/animage/posts?api_key={your_api_key}&offset=7989 gives 51450914667~51348709311 and https://api.tumblr.com/v2/blog/animage/posts?api_key={your_api_key}&offset=8009 gives 51460950351~51370155416, without missing any inbetween.

And this was exactly what the official data['response']['_links']['next']['query_params'] used to give about half a year ago. I don't know why it was changed for the worse.

Another workaround I used is to just query again with before = last (earliest) post's timestamp + 1. This will have some overlapping, so you have to de-duplicate manually.

(Anyway, I think you can understand that to ask random strangers to change their posting behavior is unrealistic, especially considering these posts are 8 years old :P ).

@cyle
Copy link
Member

cyle commented Mar 9, 2022

I don't know why it was changed for the worse.

Using offset in a database query frequently does not scale well in production at Tumblr's size; we've been moving away from that in favor of more index-friendly strategies for awhile.

(Anyway, I think you can understand that to ask random strangers to change their posting behavior is unrealistic, especially considering these posts are 8 years old :P ).

Definitely, it's frustrating, but if someone chooses to use the platform in a counterintuitive way, there are likely to be counterintuitive consequences. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants