-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for TooManyRequests (429) error code #258
Conversation
@thesilentg Thanks for the well-formed PR and for thoroughly explaining the value this would provide. We discussed this yesterday internally, so I'm writing up the gist of that conversation here. Several folks agreed that the There was concern about where the line is drawn for adding new error codes, and whether it makes sense to adopt a similar approach to gRPC, where the project seems to discourage having definitive behavior based solely on error codes. From gRPC's status code docs:
My interpretation of gRPC's approach is that error codes mean different things for different APIs, and that error codes shouldn't always be handled the same way across all APIs. We haven't made up our minds on this yet, so it's certainly still possible we merge this as is, but hopefully this provides insight on why there is hesitation. One potential solution for the ambiguity of |
Thanks for the detailed response. Ultimately, the creation of rate limiting middleware is possible either with the creation of a new error code, using the Historical UsageAny new error code should have some amount of historical usage within the software engineering community prior to being included. By ensuring Twirp shares common terminology with the community at large, we make Twirp more likely to be adopted by others, easier to understand for new developers, and ensure we don't bloat the specification with terms and features that confuse others. Given its existence in the HTTP spec since 2012 (RFC 6585), I think the HTTP 429 error code which the Twirp error code would be modeled off of meets this criteria. Rate limiting and load shedding are also both well-established terms. Minimize Ambiguity with Existing Error CodesAny new error code should have a clear and non-ambiguous use. When a developer decides which error code to return due to some condition, we should strive to make the choice simple. I think the creation of a new error code actually helps here, as there is no clear existing error code to return when implementing rate limiting today. The nearest error code ( General Applicability/Opportunity for UsageAny new error code usage should be broadly applicable. Error codes which enable very niche purposes that only a small subset of developers will need to use are best left out. Although adding a new error code may seem to have no negative impact, the end result of error code bloat will make it harder for developers. To me, the prevalence of rate limiting seems to meet this criteria, and as such, so does the corresponding Twirp error code. Although not every developer may need to add rate limiting capabilities for their service, my guess is that a large subset of them will. Implementation Complexity for Desired Use CaseWhen comparing adding a new error code for rate limiting to the reasonable alternatives, I think the error code is actually the simplest option to implement. Lets look at the typical case of constructing some rate limiting server middleware for each of these options. Overall, each of these approaches is pretty simple/straightforward. 1. New Error Code TooManyRequests to indicate rate limitingRequestReceived: func(ctx context.Context) (context.Context, error) {
if rateLimiter.Allow(ctx) {
return ctx, nil
}
methodName, _ := twirp.MethodName(ctx)
return ctx, twirp.NewError(twirp.TooManyRequests, fmt.Sprintf("method %s has reached max rate limit %d", methodName, rateLimiter.MaxRate()))
}, This is super straightforward. You can pass a custom message along with the error code that describes why the request is being rate limited. 2. Use
|
This looks good 👍 Here are my thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should handle 429 errors from intermediary with the right error code:
twirp/protoc-gen-twirp/generator.go
Line 590 in d0e66ee
t.P(` case 429, 502, 503, 504: // Too Many Requests, Bad Gateway, Service Unavailable, Gateway Timeout`) |
We also have to update the handling from intermediaries in the Spec docs: https://github.com/twitchtv/twirp/blob/d0e66eecce5a8ef83db96ad71802978942598b0a/docs/errors.md#http-errors-from-intermediary-proxies
Also in the spec, note that the list of error codes has all 5xx errors at the bottom of the list. Would you mind moving the new definition for TooManyRequests up? I think next to ResourceExhausted is best, that will help gRPC users to see the specific code that Twirp uses next to the one they already use.
Backwards compatibility notes about adding the new code
|
The example @marioizquierdo shared from grpc-gateway makes it look like "resource exhausted" is synonymous with "too many requests"—that an API responding with error code @thesilentg , can you share what Google (as a notable gRPC user) does to communicate that a caller is exceeding their rate limit, either for public APIs (maps?) or for GCP? Do they send What examples are there of Twirp services communicating the idea of "too many requests" to their callers today? What status codes do they use, what additional information do they add to the response? What examples are there of Twirp services returning the status code Adding a new status code is a wire protocol change. Changing the HTTP code mapping is also a wire protocol change—and it's one of the details of HTTP that Twirp allows developers to ignore. If changes to the wire protocol are on the table, let's consider that whole set of options. |
That is how I read it.
I can do this is you prefer, but ultimately the link from @marioizquierdo was enough to shape my opinion here. Google is free to use The problem with using
Services using
The majority of services today use This
This codebase uses
Another usage of
This even uses
Ultimately, this splintering accross even the twitch-internal codebase says to me that it is not clear to Twirp users what should be returned for rate limiting.
Long story short, the plurality (but not majority) of use cases for
There seem to be some confusion about what specifically In regards to the desired http code of
In regards to if the user is able to perform an action:
In a different service:
Stepping back, here are my goals:
One way I thought of meeting these goals was by introducing a new erorr code which would get mapped to the http 429 error code. Another potential solution could be mapping the existing |
@thesilentg Excellent analysis, nice 👏 👏 Is clear that, in an ideal world, both gRPC and Twirp would have benefit from having a specific error
To complement the backwards compatibility analysis on comment #258 (comment), let's check what problems would be introduced if the code was updated. Backwards compatibility notes about changing the HTTP mapping code of
(✅ = it just works, |
The compatibility notes seem correct to me |
@marioizquierdo @rhysh What are your thought here in regards to:
With respect to 1 versus 2, compatibility seems far better for 1. I don't see a great third option, as issuecomment-673921997 shows the problems with a previously suggested option of using |
Thank you for the survey, @thesilentg . My read of the results are that having ResourceExhausted map to something other than 429 is a persistent source of confusion. · Some users specifically want a 429 status on the wire, and go out of their way to build their own (using ResourceExhausted as the trigger). What would a new Twirp status code for "please send fewer requests" say about uses of the old one? MDN is pretty clear on 429 being "the user has sent too many requests in a given amount of time" and 403 being "access is permanently forbidden", but the RFC for 403 is less clear (it doesn't say "permanent", and does say "a request might be forbidden for reasons unrelated to the credentials"). Twirp's description of ResourceExhausted doesn't mention time one way or another, saying "some resource has been exhausted". I think that @marioizquierdo 's compatibility analysis is technically correct (the best kind), but I don't agree with the implications of how important various points are. For example, do we know of any code that would ignore a Twirp status code in the message body and instead calculate its own Twirp status code based on the HTTP status (as if the response were from an intermediary)? If that code were to trigger today, per the protocol spec it would convert ResourceExhausted into PermissionDenied every time, even with no change here. It would convert InvalidArgument, Malformed, and OutOfRange into Internal. It would convert Internal into Unknown. I don't think earning a "❌ " from breaking that code further (by changing the remapping of ResourceExhausted from PermissionDenied to Unavailable) says much about the proposed change. How much of the problem could be solved by clarifying the meaning of ResourceExhausted in documentation? That would cover goals 1 and 3 very well. It doesn't cover goal 2, to be consistent with other web/http standards. Can you say more about why that goal is important @thesilentg ? |
I am conflicted myself with the options. Option 1 (add new error code) is the easy way out keeping backwards compatibility, but I agree that will create a point of confusion between gRPC and Twirp codes moving forward. Option 2 (changing existing error mapping) seems like a better end state, keeping things similar to the gRPC ecosystem, and breaking ambiguity on the mapping for 403. Twirp traditionally favors backwards compatibility, but this has been a constant source of friction. Heh, this is a hard decision, but at this moment, if I have to vote I would go with option 2: change the mapping of resource_exhausted from 403 to 429. We could draft "version 6" with some minor breaking changes like this one. A few users may have a hard time upgrading, but the vast majority of users will benefit from this code-mapping adjustment or not be affected at all. I'm also working in an update to routing (optional twirp prefix: #264 ) that could benefit from a major upgrade release. |
Let me add some more info into the mix.
This is not a suggestion to include all those changes in the same release, each of those updates is non-backwards compatible and should be taken with care. But maybe we can see an easy win in one of those and take advantage of the breaking update for "v6" if we end up updating the code for ResourceExhausted? In the other side of the discussion, Twirp has additional codes that do not exist in gRPC:
The first two, Malformed and BadRoute can only happen when receiving bad requests when doing manual json requests or using bogus clients, they are useful for debugging. That would not be the same with the new code for ratelimiting if added. |
In general, the more a framework (Twirp is a framework, just a very lightweight one) keeps consistent with established web standards, the easier it is for new users to onboard and understand what is going on. I think just updating the documentation introduces other challenges. If this was updated to say "ResourceExhausted is used for rate limiting", what we would recommend to people that are currently using |
Actually, this is actually true even if we move |
It's time to make a decision. Thanks for all the great information and analysis. Let's re-map the existing
We can close this PR and start working on a new PR with changes in code and docs, to be released as part of v7, as mentioned in #264 (comment) |
Closing in favor of #270 |
Add support for a "TooManyRequests" Twirp error code, which will get translated to the equivalent 429 HTTP error code.
I'm not super particular with the exact naming here and would be happy to move to something like
RateLimitExceeded
if others think this is better, but on initial glance exactly matching the corresponding code from the HTTP spec seemed valuable.Technically, exceeding a rate limit could be considered "exhausting a per-user quota" and be represented by the existing
ResourceExhausted
error. However, on first glance, my interpretation of receiving aResourceExhausted
error code as a client caller is that something is problematic or damaged on the server, not that my request pattern as a client is faulty. In fact, the Twirp service may be completely healthy, but choose to reject a client's traffic because it breaks some negotiated rate limit, even if it could handle the traffic just fine without exhausting its resources. In addition, there are many potential situations where there are no user-level limits and yet rate limiting is still expected. For example, a Twirp service may place global rate limit across all callers on a very expensive API operation to ensure a downstream remains healthy.Secondly, breaking this out into its own error code allows the very simple creation of rate-limiting Twirp middleware via
ServerHooks
. If we chose to reuse the existingResourceExhausted
error code, any middleware would have to inspect both the error code as well as actual error message to ensure that theResourceExhausted
error received was due to rate limiting instead of some other generic resource exhaustion. Checking the error message directly is more brittle to changes.In my mind, having a error code that either strictly or loosely correlates with the 429
TooManyRequests
error code from the HTTP spec is valuable for both Twirp clients, service developers and the Twirp ecosystem as a whole.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.