Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch All Replies v2 - Service Edition #44

Merged
merged 22 commits into from
Oct 20, 2024

Conversation

sneakers-the-rat
Copy link
Collaborator

@sneakers-the-rat sneakers-the-rat commented Apr 16, 2024

Fix: #43

More discussion/description at #43

Current status:

  • the core of the feature is complete, but we need to restructure it a bit for the frontend and to mitigate one abuse scenario
  • 24-09-29: refactored into separate services/workers to prevent abuse and touch as little of upstream as we can. will split into at least 2 PRs: this one will be for the backend service, next one will be for integrating with frontend. Still need to write tests before we merge this.

Sub-branches

TODO:

  • Rate limiting: only want to get some replies, but there isn't any pagination in the context endpoint. Need to figure out some way of limiting to a certain n statuses or only doing a request once (ie. if the next TODO is completed this one should be)
  • Filtering statuses that are already present in the DB: I can't tell yet if the workers and services called to fetch the replies will return a cached version of a local status rather than doing the remote request yet. If they do, then we're good. otherwise, we need to make it happen
  • Anti-abuse: someone just raise a good point in a DM, we need to add a global recursion guard to protect against an adversarial instance that keeps manufacturing a broad and deep thread tree that we would just keep requesting forever. eventually the global ratelimits would kick in and it would be basically us DDoSing them as much as them DDoSing us, but we do need some cap across the whole operation rather than just at each level.
    • Split into separate worker: I think this means that we need to make a totally different worker instead of patching into the existing ones, which actually will be good, i just did some work to clean things up for integrating with the distribution system so it shouldn't be too hard.
  • Tests:
    • Get replies on fake other server on loading context endpoint (testing context endpoint deferred to next PR)
    • Only get replies once
    • break infinite trees - limits on fetch are tested but not adversarially

Pushed to next PR:

  • Web UI: push posts as we get them into the web ui
  • backend: timelines:thread:{root_id} streaming channel (almost done)
  • frontend: subscribe to streaming channel when in a DetailedStatus component
  • Tests:
    • stream to ux

@sneakers-the-rat
Copy link
Collaborator Author

Working on this again, leaving it in a little bit of a rough state bc i need to go home and don't have my remote workflow down quite yet, but we are successfully getting the JSON-LD collection of replies, just need a few more tweaks to handle pagination, write tests, and then it'll be time to clean up and do a little perf before we roll it out for live testing.

@sneakers-the-rat sneakers-the-rat force-pushed the fetch-all-replies-service branch from 01afb6e to 1696221 Compare September 19, 2024 02:45
@sneakers-the-rat sneakers-the-rat marked this pull request as ready for review September 19, 2024 04:08
@sneakers-the-rat
Copy link
Collaborator Author

ok this works now :)

We

  • only fetch all replies when explicitly asked to by a logged in user via the context endpoint - in the UI this means we open the post in the detailed view (ie. go the post's url, not just see it as a reply)
  • we debounce attempts at fetching all replies so it only happens every 30m
  • we paginate through the first layer of all replies that the source server will give us, collecting reply URIs, then launch a worker to go fetch them. This is done via activitypub rather than the masto API (like the old version of fetch all replies) so it should work with all instance types.
  • then each of those replies spawns a secondary worker with some respectful delay, so we recursively get the whole tree of replies. this delay can be long-ish, which sort of sucks for small threads, but is necessary for large threads, and i didn't want to forward all the state of knowing how deep in the tree we are

The only thing that is not working is that you have to refresh the page to get the new replies, rather than them being inserted as they are fetched. going to see if i can make that happen now tho

@@ -47,6 +48,22 @@ def context
ancestors_limit = ANCESTORS_LIMIT
descendants_limit = DESCENDANTS_LIMIT
descendants_depth_limit = DESCENDANTS_DEPTH_LIMIT
else
unless @status.local? && [email protected]_fetch_replies?
json_status = fetch_resource(@status.uri, true, current_account)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will take a while and possibly fail due to networking issues. What if you extracted this section to a worker, and returned HTTP 206 Partial Response to signal to the client that there might be more replies coming in a future response?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of adding the reply URI to the status model to avoid the network request in the context endpoint, because we have this at the time any status is created via activitypub (and should be able to fill it in for local posts without too much difficulty). I didn't mostly just to get it working but also i figured that adding new columns would be discouraged, but it makes sense to have since this at least doubles the number of requests per status we have to do.

thx for the tip on 206, i didn't even think of that

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saving the replies collection URI kind of makes sense but it's also going to add gigabytes of data into the statuses table (on mastodon.social this table has roughly 701977540 rows, an example replies URI has 73 characters, I think that's roughly 51 GB of data if all the rows had it from the start?) so if it can be avoided maybe it should.

More on HTTP 206, we do a similar trick when uploading media. The GET /api/v1/media/:id endpoint returns HTTP 206 while the media is processing; the client keeps polling it until it gets HTTP 200, this is how the "Processing..." progress bar state is implemented in the composer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it help the case for storing the replies uri if we stored it as a relative uri vs. The status urI? We could hide that behind a getter/setter so it always appeared as the full URL everywhere else but on disk we could store it cheaply?

There are probably only a few dozen unique relative paths from a status to its replies (I would bet ~hundreds, realistically, given whacky single-user homebrew impls), not sure if the db compresses at all or if there's a cheap way to store discrete values as an int:string map in postgres, but that would probably be orders of magnitude cheaper

@sneakers-the-rat sneakers-the-rat force-pushed the fetch-all-replies-service branch from 9bdd7c7 to 33b5964 Compare September 30, 2024 06:27
@sneakers-the-rat
Copy link
Collaborator Author

Committing some recent changes from needed refactor to mitigate infinite tree abuse scenario, need to go to bed for the night so didn't test the changes, i'm pulling this back into draft status until i can write tests, but the strategy I think here will be to make this the minimal deployable 'fetch all replies sans abuse' backend patch, and then i will follow up with the frontend changes for streaming replies as they are fetched to another PR.

Some notes for now before i can write the full docs on this

  • split out into separate workers and services - it started getting to the point where we needed to pass around too much state which amounted to a switch between all_replies/ not all_replies, so I tried to keep any improvements to the existing services/workers that might be relevant to them and then split off fetch all stuff into its own services/worker.
  • don't recursively fetch, but instead have one top-level worker gathering all the URIs for statuses that are being fetched. Made a parallel version of FetchRepliesService that only needs status URIs rather than the full Status object so that we don't need to do complicated batching/async awaiting to only try and get replies once a status is fully fetched. This should do the "expansion" part of finding all the status we need to fetch synchronously (within the FetchAllRepliesWorker, which is called async) and then all the actual fetching should be async. That sets us up to integrate with the frontend by hooking into the distribution workers.

@sneakers-the-rat sneakers-the-rat marked this pull request as draft September 30, 2024 06:48
@sneakers-the-rat
Copy link
Collaborator Author

sneakers-the-rat commented Oct 13, 2024

Alright i'm marking this as ready for review, i'm genuinely not sure what tests we should/can have here, but i want to get this deployed ASAP and start testing it live. if i can get an independent confirmation that it does indeed fetch all replies i'm going to merge this and punt rigorous testing to later revisions if we want to pull upstream. I think this is reasonably clean. Moving out of draft status.

Recall that the plan is to split up the PR into two - this one that does the service that fetches all the replies when a context endpoint is requested, and another that will do the work of injecting the fetched replies into the web UI and handle the changes to the context API endpoint to reflect that.

edit: added some more tests for the worker, that should be mostly good?

@sneakers-the-rat sneakers-the-rat marked this pull request as ready for review October 13, 2024 08:01
@sneakers-the-rat
Copy link
Collaborator Author

sneakers-the-rat commented Oct 14, 2024

@Gargron I think this is in a reviewable state - it's not finished yet (splitting off the web UI/streaming/modifying the context API response to indicate fetching stuff for a second PR), but when we finish this, if y'all would be interested in having it upstream, is there anything else you'd be looking for here? tests, limits, etc? I think i have it covered so the worker can't run off into infinity and handles fetch failures gracefully, but you obviously know way better what kinda stuff we'd need here.

I split it off into a new worker/service because it got to a point where i was mangling the existing service/workers and it felt like i was shoving too many unrelated things in together, so hopefully that makes it easier to consider as something that might be mergeable - the only changes i made to the existing service were to improve pagination handling and adding an instance attribute for switching the filtering by matching domain in reply and OP. the behavior could be globally turned on/off by adding an env variable flag that makes should_fetch_replies always false.

edit: ofc no hurry or obligation, appreciate all the time you've given so far. we might merge this relatively soon, but we can open a draft PR against upstream if you can't get to it until after we do. second PR will come from this branch so it should be additive.

SELECT user_id,
ip,
max(used_at) AS used_at
SELECT t0.user_id,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyone know why rails always wants to do this for me when i run a migration? i didn't do anything special to the config or anything, and it seems to not do anything, but it keeps happening?

@sneakers-the-rat
Copy link
Collaborator Author

sneakers-the-rat commented Oct 14, 2024

There is one privacy concern here - as is we include the current account in the on_behalf_of field when fetching the additional replies. This is so that we can identify requests that the requester should be able to see but might be gated by other instances who have authorized fetch/etc. on (i believe that's how that works, i haven't tested it). However that does let other instances know that an account has requested a fetch, so in effect it provides "read receipts" which isn't exactly desirable. We may want to switch that behavior based on whether an account is discoverable or not or remove it entirely - i'll test this in live federation to see if including the on_behalf_of does what i think it does or not, and if it doesn't then we'll take it out.

edit: this turns out not to be true - when fetching replies, mastodon will always use its internal instance representative account (Account.representative). When rendering a replies collection, mastodon does not do any filtering for blocked/suspended accounts, and will only include public/unlisted replies even if we were to sign a request for a reply on behalf of an account that should be able to see it. This makes sense, because when someone makes a 'followers only' post, they are doing so with the expected recipients at the time they make the post, and so if you weren't a follower at that time, you shouldn't be able to backfill the reply later. Since this is the case, i've removed all signing on behalf of an individual account and all requests are signed on behalf of the instance

@sneakers-the-rat
Copy link
Collaborator Author

Oop I seem to have broken it while adding env vars. Hold up ill be able to fix it tonight

@sneakers-the-rat
Copy link
Collaborator Author

OK actually it wasn't broken it was just a my env problem. added some more debug messages (i assume loglevel of most prod servers isn't debug so those should be free right?) and was watching it and fetching a bunch of requests tonight. there are a lot of instances that will return nil for the activitypub collection, that's sorta weird but in all the cases i saw we were passing the correct URI to the JSON-LD helper so that's not a this-PR problem. Same thing with pagination of the AP collections - i just copied that from FetchFeaturedTagsCollectionService so if that's busted it's like a general "there should be a generic way to paginate through activitypub collections rather than reinventing it in every worker/service" problem rather than a this PR problem.

I wanna merge this, see how it works, and keep iterating until we get to something we can pull upstream, sound gud?

@sneakers-the-rat sneakers-the-rat force-pushed the fetch-all-replies-service branch from aa9f4a2 to 8048524 Compare October 20, 2024 06:45
@sneakers-the-rat
Copy link
Collaborator Author

lets do it

@sneakers-the-rat sneakers-the-rat merged commit 00d9cb9 into main Oct 20, 2024
29 checks passed
@Gargron
Copy link

Gargron commented Oct 20, 2024

Feel free to submit the patch to the upstream repository and we'll try to review it as soon as we can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fetch all replies
3 participants