Fetch All Replies v2 - Service Edition #44

sneakers-the-rat · 2024-04-16T01:07:58Z

Fix: #43

More discussion/description at #43

Current status:

the core of the feature is complete, but we need to restructure it a bit for the frontend and to mitigate one abuse scenario
24-09-29: refactored into separate services/workers to prevent abuse and touch as little of upstream as we can. will split into at least 2 PRs: this one will be for the backend service, next one will be for integrating with frontend. Still need to write tests before we merge this.

Sub-branches

fetch-all-replies-service-stash-streaming - some partial work on getting streaming to work, stashing to work on other things

TODO:

Rate limiting: only want to get some replies, but there isn't any pagination in the context endpoint. Need to figure out some way of limiting to a certain n statuses or only doing a request once (ie. if the next TODO is completed this one should be)
Filtering statuses that are already present in the DB: I can't tell yet if the workers and services called to fetch the replies will return a cached version of a local status rather than doing the remote request yet. If they do, then we're good. otherwise, we need to make it happen
Anti-abuse: someone just raise a good point in a DM, we need to add a global recursion guard to protect against an adversarial instance that keeps manufacturing a broad and deep thread tree that we would just keep requesting forever. eventually the global ratelimits would kick in and it would be basically us DDoSing them as much as them DDoSing us, but we do need some cap across the whole operation rather than just at each level.
- Split into separate worker: I think this means that we need to make a totally different worker instead of patching into the existing ones, which actually will be good, i just did some work to clean things up for integrating with the distribution system so it shouldn't be too hard.
Tests:
- ~~Get replies on fake other server on loading context endpoint~~ (testing context endpoint deferred to next PR)
- Only get replies once
- break infinite trees - limits on fetch are tested but not adversarially

Pushed to next PR:

Web UI: push posts as we get them into the web ui
backend: timelines:thread:{root_id} streaming channel (almost done)
frontend: subscribe to streaming channel when in a DetailedStatus component
Tests:
- stream to ux

sneakers-the-rat · 2024-05-08T09:02:24Z

Working on this again, leaving it in a little bit of a rough state bc i need to go home and don't have my remote workflow down quite yet, but we are successfully getting the JSON-LD collection of replies, just need a few more tweaks to handle pagination, write tests, and then it'll be time to clean up and do a little perf before we roll it out for live testing.

sneakers-the-rat · 2024-09-19T04:20:29Z

ok this works now :)

We

only fetch all replies when explicitly asked to by a logged in user via the context endpoint - in the UI this means we open the post in the detailed view (ie. go the post's url, not just see it as a reply)
we debounce attempts at fetching all replies so it only happens every 30m
we paginate through the first layer of all replies that the source server will give us, collecting reply URIs, then launch a worker to go fetch them. This is done via activitypub rather than the masto API (like the old version of fetch all replies) so it should work with all instance types.
then each of those replies spawns a secondary worker with some respectful delay, so we recursively get the whole tree of replies. this delay can be long-ish, which sort of sucks for small threads, but is necessary for large threads, and i didn't want to forward all the state of knowing how deep in the tree we are

The only thing that is not working is that you have to refresh the page to get the new replies, rather than them being inserted as they are fetched. going to see if i can make that happen now tho

Gargron · 2024-09-19T04:29:44Z

app/controllers/api/v1/statuses_controller.rb

@@ -47,6 +48,22 @@ def context
      ancestors_limit         = ANCESTORS_LIMIT
      descendants_limit       = DESCENDANTS_LIMIT
      descendants_depth_limit = DESCENDANTS_DEPTH_LIMIT
+    else
+      unless @status.local? && [email protected]_fetch_replies?
+        json_status = fetch_resource(@status.uri, true, current_account)


This will take a while and possibly fail due to networking issues. What if you extracted this section to a worker, and returned HTTP 206 Partial Response to signal to the client that there might be more replies coming in a future response?

I was thinking of adding the reply URI to the status model to avoid the network request in the context endpoint, because we have this at the time any status is created via activitypub (and should be able to fill it in for local posts without too much difficulty). I didn't mostly just to get it working but also i figured that adding new columns would be discouraged, but it makes sense to have since this at least doubles the number of requests per status we have to do.

thx for the tip on 206, i didn't even think of that

Saving the replies collection URI kind of makes sense but it's also going to add gigabytes of data into the statuses table (on mastodon.social this table has roughly 701977540 rows, an example replies URI has 73 characters, I think that's roughly 51 GB of data if all the rows had it from the start?) so if it can be avoided maybe it should.

More on HTTP 206, we do a similar trick when uploading media. The GET /api/v1/media/:id endpoint returns HTTP 206 while the media is processing; the client keeps polling it until it gets HTTP 200, this is how the "Processing..." progress bar state is implemented in the composer.

Would it help the case for storing the replies uri if we stored it as a relative uri vs. The status urI? We could hide that behind a getter/setter so it always appeared as the full URL everywhere else but on disk we could store it cheaply?

There are probably only a few dozen unique relative paths from a status to its replies (I would bet ~hundreds, realistically, given whacky single-user homebrew impls), not sure if the db compresses at all or if there's a cheap way to store discrete values as an int:string map in postgres, but that would probably be orders of magnitude cheaper

app/services/activitypub/fetch_replies_service.rb

db/migrate/20240918233930_add_fetched_replies_at_to_status.rb

…pull/44\#discussion_r1766143286 and https://github.com/NeuromatchAcademy/mastodon/pull/44\#discussion_r1766148179

app/services/fan_out_on_write_service.rb

sneakers-the-rat · 2024-09-30T06:48:15Z

Committing some recent changes from needed refactor to mitigate infinite tree abuse scenario, need to go to bed for the night so didn't test the changes, i'm pulling this back into draft status until i can write tests, but the strategy I think here will be to make this the minimal deployable 'fetch all replies sans abuse' backend patch, and then i will follow up with the frontend changes for streaming replies as they are fetched to another PR.

Some notes for now before i can write the full docs on this

split out into separate workers and services - it started getting to the point where we needed to pass around too much state which amounted to a switch between all_replies/ not all_replies, so I tried to keep any improvements to the existing services/workers that might be relevant to them and then split off fetch all stuff into its own services/worker.
don't recursively fetch, but instead have one top-level worker gathering all the URIs for statuses that are being fetched. Made a parallel version of FetchRepliesService that only needs status URIs rather than the full Status object so that we don't need to do complicated batching/async awaiting to only try and get replies once a status is fully fetched. This should do the "expansion" part of finding all the status we need to fetch synchronously (within the FetchAllRepliesWorker, which is called async) and then all the actual fetching should be async. That sets us up to integrate with the frontend by hooking into the distribution workers.

sneakers-the-rat · 2024-10-13T08:01:26Z

Alright i'm marking this as ready for review, i'm genuinely not sure what tests we should/can have here, but i want to get this deployed ASAP and start testing it live. if i can get an independent confirmation that it does indeed fetch all replies i'm going to merge this and punt rigorous testing to later revisions if we want to pull upstream. I think this is reasonably clean. Moving out of draft status.

Recall that the plan is to split up the PR into two - this one that does the service that fetches all the replies when a context endpoint is requested, and another that will do the work of injecting the fetched replies into the web UI and handle the changes to the context API endpoint to reflect that.

edit: added some more tests for the worker, that should be mostly good?

sneakers-the-rat · 2024-10-14T01:52:59Z

@Gargron I think this is in a reviewable state - it's not finished yet (splitting off the web UI/streaming/modifying the context API response to indicate fetching stuff for a second PR), but when we finish this, if y'all would be interested in having it upstream, is there anything else you'd be looking for here? tests, limits, etc? I think i have it covered so the worker can't run off into infinity and handles fetch failures gracefully, but you obviously know way better what kinda stuff we'd need here.

I split it off into a new worker/service because it got to a point where i was mangling the existing service/workers and it felt like i was shoving too many unrelated things in together, so hopefully that makes it easier to consider as something that might be mergeable - the only changes i made to the existing service were to improve pagination handling and adding an instance attribute for switching the filtering by matching domain in reply and OP. the behavior could be globally turned on/off by adding an env variable flag that makes should_fetch_replies always false.

edit: ofc no hurry or obligation, appreciate all the time you've given so far. we might merge this relatively soon, but we can open a draft PR against upstream if you can't get to it until after we do. second PR will come from this branch so it should be additive.

sneakers-the-rat · 2024-10-14T05:08:28Z

db/schema.rb

-      SELECT user_id,
-      ip,
-      max(used_at) AS used_at
+      SELECT t0.user_id,


anyone know why rails always wants to do this for me when i run a migration? i didn't do anything special to the config or anything, and it seems to not do anything, but it keeps happening?

sneakers-the-rat · 2024-10-14T19:40:56Z

There is one privacy concern here - as is we include the current account in the on_behalf_of field when fetching the additional replies. This is so that we can identify requests that the requester should be able to see but might be gated by other instances who have authorized fetch/etc. on (i believe that's how that works, i haven't tested it). However that does let other instances know that an account has requested a fetch, so in effect it provides "read receipts" which isn't exactly desirable. We may want to switch that behavior based on whether an account is discoverable or not or remove it entirely - i'll test this in live federation to see if including the on_behalf_of does what i think it does or not, and if it doesn't then we'll take it out.

edit: this turns out not to be true - when fetching replies, mastodon will always use its internal instance representative account (Account.representative). When rendering a replies collection, mastodon does not do any filtering for blocked/suspended accounts, and will only include public/unlisted replies even if we were to sign a request for a reply on behalf of an account that should be able to see it. This makes sense, because when someone makes a 'followers only' post, they are doing so with the expected recipients at the time they make the post, and so if you weren't a follower at that time, you shouldn't be able to backfill the reply later. Since this is the case, i've removed all signing on behalf of an individual account and all requests are signed on behalf of the instance

app/models/concerns/status/fetch_replies_concern.rb

sneakers-the-rat · 2024-10-15T16:27:03Z

Oop I seem to have broken it while adding env vars. Hold up ill be able to fix it tonight

sneakers-the-rat · 2024-10-19T06:13:23Z

OK actually it wasn't broken it was just a my env problem. added some more debug messages (i assume loglevel of most prod servers isn't debug so those should be free right?) and was watching it and fetching a bunch of requests tonight. there are a lot of instances that will return nil for the activitypub collection, that's sorta weird but in all the cases i saw we were passing the correct URI to the JSON-LD helper so that's not a this-PR problem. Same thing with pagination of the AP collections - i just copied that from FetchFeaturedTagsCollectionService so if that's busted it's like a general "there should be a generic way to paginate through activitypub collections rather than reinventing it in every worker/service" problem rather than a this PR problem.

I wanna merge this, see how it works, and keep iterating until we get to something we can pull upstream, sound gud?

… are almost there baby

…requested from context endpoint, but async

…pull/44\#discussion_r1766143286 and https://github.com/NeuromatchAcademy/mastodon/pull/44\#discussion_r1766148179

This reverts commit 98c280e.

…mit to global maximum statuses fetched (untested, this might not work yet)

…etching

…ways caused us to only do one page). Rename some variables to make purpose clearer. Return the array of all fetched uris instead of just the number we got

…this context anyway

…, deduplicate uris to fetch, prevent multiple fetch-alls for the parent status

sneakers-the-rat · 2024-10-20T06:57:14Z

lets do it

Gargron · 2024-10-20T09:18:38Z

Feel free to submit the patch to the upstream repository and we'll try to review it as soon as we can.

sneakers-the-rat mentioned this pull request Apr 16, 2024

[feature] fetch all replies #8

Closed

sneakers-the-rat force-pushed the fetch-all-replies-service branch from 01afb6e to 1696221 Compare September 19, 2024 02:45

sneakers-the-rat marked this pull request as ready for review September 19, 2024 04:08

Gargron reviewed Sep 19, 2024

View reviewed changes

app/services/activitypub/fetch_replies_service.rb Outdated Show resolved Hide resolved

Gargron reviewed Sep 19, 2024

View reviewed changes

app/services/activitypub/fetch_replies_service.rb Outdated Show resolved Hide resolved

Gargron reviewed Sep 19, 2024

View reviewed changes

db/migrate/20240918233930_add_fetched_replies_at_to_status.rb Outdated Show resolved Hide resolved

sneakers-the-rat referenced this pull request Sep 19, 2024

accept review comments https://github.com/NeuromatchAcademy/mastodon/…

18e05fd

…pull/44\#discussion_r1766143286 and https://github.com/NeuromatchAcademy/mastodon/pull/44\#discussion_r1766148179

Gargron reviewed Sep 23, 2024

View reviewed changes

app/services/fan_out_on_write_service.rb Outdated Show resolved Hide resolved

sneakers-the-rat force-pushed the fetch-all-replies-service branch from 9bdd7c7 to 33b5964 Compare September 30, 2024 06:27

sneakers-the-rat marked this pull request as draft September 30, 2024 06:48

sneakers-the-rat marked this pull request as ready for review October 13, 2024 08:01

sneakers-the-rat requested a review from hobgoblina October 13, 2024 08:02

sneakers-the-rat commented Oct 14, 2024

View reviewed changes

panarom reviewed Oct 15, 2024

View reviewed changes

app/models/concerns/status/fetch_replies_concern.rb Show resolved Hide resolved

sneakers-the-rat mentioned this pull request Oct 19, 2024

Context Pagination mastodon/mastodon#11029

Open

sneakers-the-rat and others added 6 commits October 19, 2024 23:43

Initial draft of fetching all replies on context load

23b165e

committing all ugly with a bunch of logger calls in the middle but we…

e53938f

… are almost there baby

working (i think?) recursive fetch

5bb473d

don't do it for every create, only do recursive reply expansion when …

d7c0ce0

…requested from context endpoint, but async

correct number of args to replies worker, recursive fetching is working

bf86d13

accept review comments https://github.com/NeuromatchAcademy/mastodon/…

2b16ac9

…pull/44\#discussion_r1766143286 and https://github.com/NeuromatchAcademy/mastodon/pull/44\#discussion_r1766148179

sneakers-the-rat added 16 commits October 19, 2024 23:44

stashing some partial work on streaming

714e490

Revert "stashing some partial work on streaming"

40fbfea

This reverts commit 98c280e.

Remove recursion, separate out into separate workers/services, add li…

1632d2f

…mit to global maximum statuses fetched (untested, this might not work yet)

rm redundant request to fetch replies worker in controller

4aaf91e

rm zombie code in fetch_replies_service

a65e29f

rm spurious imports and reformatting

9d404e2

rm more spurious formatting

37e2c0b

Working version of fetch all replies service with global maximum on f…

fc447f5

…etching

Fix limit in fetch_replies_service to not always limit by 5 (which al…

2eeb6c4

…ways caused us to only do one page). Rename some variables to make purpose clearer. Return the array of all fetched uris instead of just the number we got

the most basic test you could imagine

6d3ccb9

tests for the fetch all reply worker

f7b309f

add configurability via .env file

3271217

remove account on_behalf_of entirely since it doesn't do anything in …

cc1a1be

…this context anyway

make limits configurable too while we're at it

f6c0079

remove redundant params - forgot i subclassed

ab35055

add some more debugging messages, catch errors in fetching child uris…

8048524

…, deduplicate uris to fetch, prevent multiple fetch-alls for the parent status

sneakers-the-rat force-pushed the fetch-all-replies-service branch from aa9f4a2 to 8048524 Compare October 20, 2024 06:45

sneakers-the-rat merged commit 00d9cb9 into main Oct 20, 2024
29 checks passed

sneakers-the-rat mentioned this pull request Oct 20, 2024

Fetch all replies jointakahe/takahe#710

Open

sneakers-the-rat mentioned this pull request Oct 21, 2024

Add Fetch All Replies Part 1: Backend mastodon/mastodon#32615

Open

6 tasks

Tamschi mentioned this pull request Oct 31, 2024

Secondary bridging snarfed/bridgy-fed#549

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch All Replies v2 - Service Edition #44

Fetch All Replies v2 - Service Edition #44

sneakers-the-rat commented Apr 16, 2024 •

edited

Loading

sneakers-the-rat commented May 8, 2024

sneakers-the-rat commented Sep 19, 2024

Gargron Sep 19, 2024

sneakers-the-rat Sep 19, 2024

Gargron Sep 19, 2024

sneakers-the-rat Sep 19, 2024

sneakers-the-rat commented Sep 30, 2024

sneakers-the-rat commented Oct 13, 2024 •

edited

Loading

sneakers-the-rat commented Oct 14, 2024 •

edited

Loading

sneakers-the-rat Oct 14, 2024

sneakers-the-rat commented Oct 14, 2024 •

edited

Loading

sneakers-the-rat commented Oct 15, 2024

sneakers-the-rat commented Oct 19, 2024

sneakers-the-rat commented Oct 20, 2024

Gargron commented Oct 20, 2024

Fetch All Replies v2 - Service Edition #44

Fetch All Replies v2 - Service Edition #44

Conversation

sneakers-the-rat commented Apr 16, 2024 • edited Loading

sneakers-the-rat commented May 8, 2024

sneakers-the-rat commented Sep 19, 2024

Gargron Sep 19, 2024

Choose a reason for hiding this comment

sneakers-the-rat Sep 19, 2024

Choose a reason for hiding this comment

Gargron Sep 19, 2024

Choose a reason for hiding this comment

sneakers-the-rat Sep 19, 2024

Choose a reason for hiding this comment

sneakers-the-rat commented Sep 30, 2024

sneakers-the-rat commented Oct 13, 2024 • edited Loading

sneakers-the-rat commented Oct 14, 2024 • edited Loading

sneakers-the-rat Oct 14, 2024

Choose a reason for hiding this comment

sneakers-the-rat commented Oct 14, 2024 • edited Loading

sneakers-the-rat commented Oct 15, 2024

sneakers-the-rat commented Oct 19, 2024

sneakers-the-rat commented Oct 20, 2024

Gargron commented Oct 20, 2024

sneakers-the-rat commented Apr 16, 2024 •

edited

Loading

sneakers-the-rat commented Oct 13, 2024 •

edited

Loading

sneakers-the-rat commented Oct 14, 2024 •

edited

Loading

sneakers-the-rat commented Oct 14, 2024 •

edited

Loading