Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make initial delay selection more sane #84

Closed
wants to merge 2 commits into from
Closed

Conversation

astei
Copy link

@astei astei commented Apr 19, 2021

See PaperMC/Velocity#476 for justification.

This should be less problematic by picking a random (up to five minute) delay in sending the first data.
@astei astei changed the title Ensure the second delay is strictly greater than the first delay Make initial delay selection more sane Apr 19, 2021
@Bastian
Copy link
Owner

Bastian commented Apr 19, 2021

Thank you for the PR!

The current design with two random delays was fully intentional and hitting ratelimits more often is an accepted compromise.

Let me start by explaining why the random delay even exists. The initial logic was very simple: Wait 5 minutes after server startup and then send the data every 30 minutes. What I didn't consider back then, was that many servers tend to restart at a fixed time at a full hour mark (xx:00). This caused a spike in requests every 30 minutes at xx:05 and xx:35 which resulted in a high CPU usage at these times:
image

The first idea to solve this issue was very similar to your PR: Just make the initial delay a bit random (e.g., MiniDigger suggested 5 min + random between 0 and 15 min). This however comes with a new set of issues:

  • If the random delay is too large (like the suggested 15 min), it might take a very long time for the server to send data initially and we could miss some
  • The load would still be unevenly distributed, just more spread out. This becomes especially relevant for smaller random delays like the 5 minutes in this PR. Instead of having about 100% more requests for ~1 minute at xx:05 and xx:35, these extra requests would now be distributed over 5 minutes. This would definitely better but still cause about ~20% more requests in that period.

This means that it is necessary to find a good balance and choose a not too short but also not too long random delay. Or alternatively, find another approach.

The current solution is a hybrid that tries to get the best of both worlds: It still uses a (short) initial random delay (between 3-6 minutes) but then instead of just sending the data every 30 minutes, it introduces a second delay between 0-30 minutes and only then starts the regular 30 minutes interval. This makes it very likely to hit a ratelimit once initially (in fact, in nearly 50% of all cases). Since ratelimits are super (CPU-)inexpensive for the backend to handle, this is no issue at all. However, it also makes sure that requests are evenly distributed and thus solves the issue quite nicely.

That being said, I agree that there is still space for improvements:

  1. It happens quite often that people report bugs because they received a 429 response from the backend. While you have to enable the logFailedRequests config option to see it (which is only intended for debugging), it indeed looks scary with its huge stacktrace and WARN log level. It might be a good idea to handle it differently and log a less verbose message without a stack trace. Something along these lines: "Received 429 ratelimit from bStats. This is expected and not causing issues".
  2. The fixed initial delay of 3 minutes (3 min + random between 0 and 3 min) is a remnant of the old bStats Metrics classes that sent the data for all plugins in a single request and thus needed it to give the other plugins some time to startup (otherwise the first loaded plugin would only send its own data initially). This is not really necessary anymore and can probably be removed.

@astei
Copy link
Author

astei commented Apr 20, 2021

Let me start by explaining why the random delay even exists.

I know why it exists - having made the Buycraft plugin, we ran into this issue too (and we had the advantage of doing near-infinite scaling on AWS!) and I implemented randomization to spread out the load from servers sending requests to pull purchases from Buycraft.

I'll probably implement your ideas here though, just to make the issue much less scary-looking.

@astei astei closed this May 5, 2021
@astei
Copy link
Author

astei commented May 5, 2021

Decided to not pursue this anymore - more pressing things to do.

@Bastian
Copy link
Owner

Bastian commented May 5, 2021

Thanks anyway 🙂

I've created a new issue for it to not forget it: #86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants