Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for TooManyRequests (429) error code #258

Closed
wants to merge 1 commit into from
Closed

Add support for TooManyRequests (429) error code #258

wants to merge 1 commit into from

Conversation

thesilentg
Copy link
Contributor

@thesilentg thesilentg commented Aug 11, 2020

Add support for a "TooManyRequests" Twirp error code, which will get translated to the equivalent 429 HTTP error code.

Why call this TooManyRequests instead of RateLimitExceeded?

I'm not super particular with the exact naming here and would be happy to move to something like RateLimitExceeded if others think this is better, but on initial glance exactly matching the corresponding code from the HTTP spec seemed valuable.

Why not use the existing ResourceExhausted error code to indicate the client should slow down

Technically, exceeding a rate limit could be considered "exhausting a per-user quota" and be represented by the existing ResourceExhausted error. However, on first glance, my interpretation of receiving a ResourceExhausted error code as a client caller is that something is problematic or damaged on the server, not that my request pattern as a client is faulty. In fact, the Twirp service may be completely healthy, but choose to reject a client's traffic because it breaks some negotiated rate limit, even if it could handle the traffic just fine without exhausting its resources. In addition, there are many potential situations where there are no user-level limits and yet rate limiting is still expected. For example, a Twirp service may place global rate limit across all callers on a very expensive API operation to ensure a downstream remains healthy.

Secondly, breaking this out into its own error code allows the very simple creation of rate-limiting Twirp middleware via ServerHooks. If we chose to reuse the existing ResourceExhausted error code, any middleware would have to inspect both the error code as well as actual error message to ensure that the ResourceExhausted error received was due to rate limiting instead of some other generic resource exhaustion. Checking the error message directly is more brittle to changes.

In my mind, having a error code that either strictly or loosely correlates with the 429 TooManyRequests error code from the HTTP spec is valuable for both Twirp clients, service developers and the Twirp ecosystem as a whole.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dpolansky
Copy link
Contributor

@thesilentg Thanks for the well-formed PR and for thoroughly explaining the value this would provide.

We discussed this yesterday internally, so I'm writing up the gist of that conversation here.

Several folks agreed that the ResourceExhausted error code is ambiguous and that it doesn't serve as a definitive indication of a rate limit being enforced. There was also agreement that the use case of rate limiting is common enough that it may make sense to have an error code dedicated to it.

There was concern about where the line is drawn for adding new error codes, and whether it makes sense to adopt a similar approach to gRPC, where the project seems to discourage having definitive behavior based solely on error codes. From gRPC's status code docs:

Applications that may wish to retry failed RPCs must decide which status codes on which to retry. As shown in the table above, the gRPC library can generate the same status code for different cases. Server applications can also return those same status codes. Therefore, there is no fixed list of status codes on which it is appropriate to retry in all applications. As a result, individual applications must make their own determination as to which status codes should cause an RPC to be retried.

My interpretation of gRPC's approach is that error codes mean different things for different APIs, and that error codes shouldn't always be handled the same way across all APIs. We haven't made up our minds on this yet, so it's certainly still possible we merge this as is, but hopefully this provides insight on why there is hesitation.

One potential solution for the ambiguity of ResourceExhausted (or other error codes) is to use Twirp's error meta feature to convey additional meaning about the error. For instance, a server could use the ResourceExhausted error code and use the error meta to indicate the time at which the caller can try again. Middleware or libraries could rely on data in the error meta in a standardized way. Have you considered this approach for implementing rate limiting?

@thesilentg
Copy link
Contributor Author

thesilentg commented Aug 14, 2020

Thanks for the detailed response. Ultimately, the creation of rate limiting middleware is possible either with the creation of a new error code, using the WithMeta feature to attach a more detailed reason to an existing error code (likely ResourceExhausted), or directly embedding the rate limiting reason in the error message. This means that this PR is more of a "nice-to-have" instead of a strict requirement. In mind, there are a few main areas of focus to evaluate when deciding whether to include a new error code.

Historical Usage

Any new error code should have some amount of historical usage within the software engineering community prior to being included. By ensuring Twirp shares common terminology with the community at large, we make Twirp more likely to be adopted by others, easier to understand for new developers, and ensure we don't bloat the specification with terms and features that confuse others. Given its existence in the HTTP spec since 2012 (RFC 6585), I think the HTTP 429 error code which the Twirp error code would be modeled off of meets this criteria. Rate limiting and load shedding are also both well-established terms.

Minimize Ambiguity with Existing Error Codes

Any new error code should have a clear and non-ambiguous use. When a developer decides which error code to return due to some condition, we should strive to make the choice simple. I think the creation of a new error code actually helps here, as there is no clear existing error code to return when implementing rate limiting today. The nearest error code (ResourceExhausted) has a wide set of uses, of which rate limiting is only tangentially related. By splitting out some of the "error space" to use this new code, it not only makes it clearer what should be done for rate limiting, but also simplifies the use cases for existing errors codes since it is currently not definitive that they should not be used to signify rate limiting is occurring.

General Applicability/Opportunity for Usage

Any new error code usage should be broadly applicable. Error codes which enable very niche purposes that only a small subset of developers will need to use are best left out. Although adding a new error code may seem to have no negative impact, the end result of error code bloat will make it harder for developers. To me, the prevalence of rate limiting seems to meet this criteria, and as such, so does the corresponding Twirp error code. Although not every developer may need to add rate limiting capabilities for their service, my guess is that a large subset of them will.

Implementation Complexity for Desired Use Case

When comparing adding a new error code for rate limiting to the reasonable alternatives, I think the error code is actually the simplest option to implement. Lets look at the typical case of constructing some rate limiting server middleware for each of these options. Overall, each of these approaches is pretty simple/straightforward.

1. New Error Code TooManyRequests to indicate rate limiting

RequestReceived: func(ctx context.Context) (context.Context, error) {
	if rateLimiter.Allow(ctx) {
		return ctx, nil
	}
	methodName, _ := twirp.MethodName(ctx)
	return ctx, twirp.NewError(twirp.TooManyRequests, fmt.Sprintf("method %s has reached max rate limit %d", methodName, rateLimiter.MaxRate()))
},

This is super straightforward. You can pass a custom message along with the error code that describes why the request is being rate limited.

2. Use ResourceExhausted and WithMeta to indicate rate limiting

const (
	ResourceExhaustionMetaKey = "resource_exhaustion_reason"
	RateLimitingMetaValue = "rate_limiting"
)

RequestReceived: func(ctx context.Context) (context.Context, error) {
    if rateLimiter.Allow(ctx) {
        return ctx, nil
    }
    methodName, _ := twirp.MethodName(ctx)
    return ctx, twirp.NewError(twirp.ResourceExhausted, fmt.Sprintf("method %s has reached max rate limit %d", methodName, rateLimiter.MaxRate())).WithMeta(ResourceExhaustionMetaKey, RateLimitingMetaValue)
},

This is still pretty straightforward. You can pass a custom message, but also need to define a shared meta key constant that any downstreams will use to determine why resource exhaustion is occurring. This shared meta key and value need to be known in advance for all services, but this can likely live in your rate limiting library.

3. Use ResourceExhausted with a common message to indicate rate limiting

const (
	RateLimitingMessage = "rate limiting"
)

RequestReceived: func(ctx context.Context) (context.Context, error) {
	if rateLimiter.Allow(ctx) {
		return ctx, nil
	}
	return ctx, twirp.NewError(twirp.ResourceExhausted, RateLimitingMessage)
},

This is super straightforward, but terrible. You lose the ability to provide any sort of custom message.

Client Understandability

Any changes to error codes should be understandable for clients of Twirp services. As mentioned before, I think TooManyRequests or RateLimitExceeded is far better here, as it helps the client understand that they are responsible for the error. ResourceExhausted leaves it ambiguous as to whether the client or server is responsible for the resource exhaustion, what steps the client should take to remedy the situation, or why they are are even receiving the error in the first place. More importantly, using the WithMeta approach is worse for one more important reason: logging. Most service's logging implementations will call .Error() on errors to log them. Structured loggers likely will call both .Error() for the full message and potentialy .Code() if they know the error is a Twirp error. Lets see what these print for the example below:

err := twirp.NewError(twirp.ResourceExhausted, "request rejected").WithMeta("Reason", "RateLimiting")
fmt.Println(err)
fmt.Printf("%v\n", err)
fmt.Println(err.Error())
fmt.Println(err.Msg())
fmt.Println(err.Code())

err2 := errors.Wrap(err, "failed to call downstream api1")
fmt.Println(err2)

Output:

twirp error resource_exhausted: request rejected
twirp error resource_exhausted: request rejected
twirp error resource_exhausted: request rejected
request rejected
resource_exhausted
failed to call downstream api1: twirp error resource_exhausted: request rejected

Yikes! None of the logging statements actually contain the important error info ("Reason": "RateLimiting") which would help a team debug why their request was being rejected. The only way the team could know that rate limiting is the reason for an error is by either:

  1. Always encoding some sort of rate limiting text into the error message field of the Twirp error. For example, having the server middleware call twirp.NewError(twirp.ResourceExhausted, "rate limiting"). This is not only brittle to any changes in naming, but also prevents service owners from filling in their own custom error message that is more informative (but may not contain the exact term "rate limiting"). Even though we use WithMeta(), we still have many of the disadvantages of just encoding the rate limiting text into the message itself.

  2. Have loggers always include the value of err.MetaMap() in any logging. This also has challenges. Not all Twirp services may implement rate limting, which leads to confusing logging statements with just empty MetaMap. Even for services which all implement rate limiting, not all error types will have non-empty MetaMap values, leading to error code that has to specifically check for this or just always log it. In addition, wrapping error messages is considered a best practice, with messages like error: failed to fetch channel status: could not determine user id: failed to lookup user: context timeout being common. The level where an error is logged in the client could be very far from where the original Twirp error was originally returned. At this point, the error has been wrapped many times and the code no longer even has the ability to call .MetaMap() (as the type is builtin.error, not twirp.Error)

Compare this with the output to the following code:

err := twirp.NewError(twirp.TooManyRequests, "request rejected")
fmt.Println(err)
fmt.Printf("%v\n", err)
fmt.Println(err.Error())
fmt.Println(err.Msg())
fmt.Println(err.Code())
err.MetaMap()

err2 := errors.Wrap(err, "failed to call downstream api1")
fmt.Println(err2)

Output:

twirp error too_many_requests: request rejected
twirp error too_many_requests: request rejected
twirp error too_many_requests: request rejected
request rejected
too_many_requests
failed to call downstream api1: twirp error too_many_requests: request rejected

The output is clear and gives the developer an immediate clue to the problem, even with what is a largely unhelpful error message body of request rejected passed to NewError().

Given these criteria, I think a new error code for the purposes of rate limiting "makes the cut".

@marioizquierdo
Copy link
Contributor

marioizquierdo commented Aug 14, 2020

This looks good 👍

Here are my thoughts:

  • The HTTP status code mapping for resource_exhausted should be 429 instead of the current value 403. The error status code mapping was originally done as a convenience, but in practice it is important, is often used for debugging, and makes it easy to understand the error code intentions. The original mapping was done to match gRPC Gateway, which at the time was using 403 for resource_exhausted. Later they changed the code to 429 (ref: Changed to use more appropriate http status code for ResourceExhausted grpc-ecosystem/grpc-gateway#580), but Twirp is more strict about breaking backwards compatibility.
  • We should not add new error codes; different client implementations may break in different ways. But the issue with rate-limiting is very common and deserves first-class error support. Updating an existing code has more issues with compatibility. We should re-consider the impact of updating the existing resource_exhausted code to 429, and if we are still not willing to do the update, then adding the new code for too_many_requests is a good solution.
  • too_many_requests is a good name for rate-limiting errors. In the other hand resource_exhausted could be ambiguous.

Copy link
Contributor

@marioizquierdo marioizquierdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle 429 errors from intermediary with the right error code:

t.P(` case 429, 502, 503, 504: // Too Many Requests, Bad Gateway, Service Unavailable, Gateway Timeout`)

We also have to update the handling from intermediaries in the Spec docs: https://github.com/twitchtv/twirp/blob/d0e66eecce5a8ef83db96ad71802978942598b0a/docs/errors.md#http-errors-from-intermediary-proxies

Also in the spec, note that the list of error codes has all 5xx errors at the bottom of the list. Would you mind moving the new definition for TooManyRequests up? I think next to ResourceExhausted is best, that will help gRPC users to see the specific code that Twirp uses next to the one they already use.

@marioizquierdo
Copy link
Contributor

marioizquierdo commented Aug 14, 2020

Backwards compatibility notes about adding the new code too_many_requests:

  • ✅ : New clients can handle all codes from old services.
  • ✅ : All existing middleware will work as expected (resource_exhausted still maps to 403).
  • ⚠️ : Old clients in Go handle unknown error codes as internal errors. New services returning too_many_requests will result in internal errors on old clients.
  • ⚠️ : Old clients in Go handle 429 errors from intermediaries (middleware) as unavailable (see code). New clients will handle them as too_many_requests.
  • ⚠️ : Old clients in different languages have different ways to handle unknown error codes, but most clients return internal errors too. For example the Ruby client returns internal errors. We should file an issue on other implementations to warn them of the update.

@rhysh
Copy link
Contributor

rhysh commented Aug 14, 2020

The example @marioizquierdo shared from grpc-gateway makes it look like "resource exhausted" is synonymous with "too many requests"—that an API responding with error code codes.ResourceExhausted means the gateway converts it unconditionally to HTTP status 429. Does that match how you read it?

@thesilentg , can you share what Google (as a notable gRPC user) does to communicate that a caller is exceeding their rate limit, either for public APIs (maps?) or for GCP? Do they send codes.ResourceExhausted alone, do they send that code but with meta info that spells out the rest of the story, or do they do something else? What other examples can you find for how API providers that ship type-safe standardized/generated SDKs communicate that condition?

What examples are there of Twirp services communicating the idea of "too many requests" to their callers today? What status codes do they use, what additional information do they add to the response?

What examples are there of Twirp services returning the status code twirp.ResourceExhausted to callers today? What do they intend to communicate when they return that response?

Adding a new status code is a wire protocol change. Changing the HTTP code mapping is also a wire protocol change—and it's one of the details of HTTP that Twirp allows developers to ignore. If changes to the wire protocol are on the table, let's consider that whole set of options.

@thesilentg
Copy link
Contributor Author

thesilentg commented Aug 24, 2020

The example @marioizquierdo shared from grpc-gateway makes it look like "resource exhausted" is synonymous with "too many requests"—that an API responding with error code codes.ResourceExhausted means the gateway converts it unconditionally to HTTP status 429. Does that match how you read it?

That is how I read it.

@thesilentg , can you share what Google (as a notable gRPC user) does to communicate that a caller is exceeding their rate limit, either for public APIs (maps?) or for GCP? Do they send codes.ResourceExhausted alone, do they send that code but with meta info that spells out the rest of the story, or do they do something else? What other examples can you find for how API providers that ship type-safe standardized/generated SDKs communicate that condition?

I can do this is you prefer, but ultimately the link from @marioizquierdo was enough to shape my opinion here. Google is free to use resource exhausted to indicate rate limiting/too many requests, but only because they are explicity mapping it back to HTTP status 429. That is to say, the HTTP status code is what is making the expected user behvior clear, not necessarily the name of the error itself. Google could pass back an error called "You've upset Sundar", and so long as this was accompanied by the 429 HTTP status, it would be clear to users that the purpose was rate limiting.

The problem with using twirp.ResourceExhausted in its current from is that it is mapped to HTTP status 403, which indicates a different thing than HTTP status 429. For example, see this excerpt from the Mozilla documentation on the 403 error code (the first Google result for "403 status code"):

The HTTP 403 Forbidden client error status response code indicates that the server understood the request but refuses to authorize it. This status is similar to 401, but in this case, re-authenticating will make no difference. The access is permanently forbidden and tied to the application logic, such as insufficient rights to a resource.

Services using twirp.ResourceExhausted for rate limiting are certainly not meaning to indicate "The access is permanently forbidden".

What examples are there of Twirp services communicating the idea of "too many requests" to their callers today? What status codes do they use, what additional information do they add to the response?

The majority of services today use twirp.ResourceExhausted to communicate this. However, I think this is purely because twirp.ResourceExhausted is the closest applicable error code that existed at the time the code was written, not because this is actually the best representation. See code comments below:

This //TODO comment specifically mentions that Twirp lacks an error code which translates to http status code 429!

// ErrTooManyRequests is thrown if a store is under too much request pressure and requires a cooldown period
// TODO twirp doesn't have a 429
ErrTooManyRequests = apiutils.NewErrorResponse(http.StatusTooManyRequests, "Too many requests")

This codebase uses twirp.Unavailable, and encodes the rate limiting context into the error message.

return &twirp.ServerHooks{
    RequestReceived: func(ctx context.Context) (context.Context, error) {
        if !lim.Allow() {
            return ctx, twirp.NewError(twirp.Unavailable, "rate limit exceeded")
        }
        return ctx, nil
    },
}

Another usage of twirp.Unavailable

if state.loadShedding {
    return twirp.NewError(twirp.Unavailable, "too many requests")
}

This even uses twirp.InvalidArgument, even though I don't think that is representative at all of rate limiting.

if isRateLimited {
    return twirp.NewError(twirp.InvalidArgument, "too many requests")
}

Ultimately, this splintering accross even the twitch-internal codebase says to me that it is not clear to Twirp users what should be returned for rate limiting.

What examples are there of Twirp services returning the status code twirp.ResourceExhausted to callers today? What do they intend to communicate when they return that response?

Long story short, the plurality (but not majority) of use cases for twirp.ResourceExhausted within the Twitch code base is for rate limiting. However, most usages of twirp.ResourceExhausted today are actually not for rate limiting purposes. Here's the breakdown:

10 Usages: Rate Limiting purely on request volume
9 Usages: User has reached some sort of Global account-level limit 
3 Usages: User has recently performed an action and must wait some time before doing it again 
5 other Usages: Specific to the given service

There seem to be some confusion about what specifically twirp.ResourceExhausted is meant for. Here are few comments from different pieces of the codebase:

In regards to the desired http code of twirp.ResourceExhausted:

// twirp.ServerHTTPStatusFromErrorCode maps twirp.ResourceExhausted to 403.
// We want 429

In regards to if the user is able to perform an action:

case 403: // Forbidden
    return twirp.PermissionDenied // could also be twirp.ResourceExhausted

In a different service:

case 403:
    return twirp.NewError(twirp.ResourceExhausted, "forbidden")

Adding a new status code is a wire protocol change. Changing the HTTP code mapping is also a wire protocol change—and it's one of the details of HTTP that Twirp allows developers to ignore. If changes to the wire protocol are on the table, let's consider that whole set of options.

Stepping back, here are my goals:

  1. Users of Twirp have a clear, consistent way of indicating rate limiting is occurring
  2. The Twirp implementation that indicates when rate limiting is occurring is generally consistent with other web/http standards
  3. There aren't significant usability/interpretability issues for old clients

One way I thought of meeting these goals was by introducing a new erorr code which would get mapped to the http 429 error code. Another potential solution could be mapping the existing twirp.ResourceExhausted error code to http status code 429 and in doing so bless it as the expected solution to be used to indicate rate limiting. There are likely other solutions as well. However, I feel that simply keeping the existing behavior for twirp.ResourceExhausted will continue the to propagate confusion among users as to what that they should to do to indicate rate limiting. I'd rather create a clear standard now as compared to continuing to let the number of use cases for twirp.ResourceExhausted grow.

@marioizquierdo
Copy link
Contributor

marioizquierdo commented Aug 24, 2020

@thesilentg Excellent analysis, nice 👏 👏

Is clear that, in an ideal world, both gRPC and Twirp would have benefit from having a specific error TooManyRequests, but only if it was added from the beginning. In gRPC the issue was somewhat fixed by explicitly mapping ResourceExhausted into 429, which besides being an HTML code, is also the well known industry standard for ratelimiting (most API developers know about this code). So, we are left this this dilema on Twirp:

  • If we do nothing then Twirp users will have to keep being inconsistent about signaling ratelimit errors.
  • If we add the new code TooManyRequests, Twirp will differ slightly from gRPC codes. Not a big deal, but it is nice to being able to transfer knowledge from one framework to another.
  • If we change the HTTP mapping on ResourceExhausted into 429, it will break backwards compatibility in Twirp.

To complement the backwards compatibility analysis on comment #258 (comment), let's check what problems would be introduced if the code was updated.

Backwards compatibility notes about changing the HTTP mapping code of resource_exhausted from 403 to 429:

  • ✅ : New clients can handle all codes from old services (HTTP codes are not used to deserialize errors).
  • ⚠️ : Old clients in Go handle 429 errors from intermediaries (non-twirp services or middleware) as unavailable (see code). New clients will handle them as resource_exhausted.
  • ⚠️ : Old clients in different languages may have different ways to handle errors from intermediaries. We should file an issue on other implementations to warn them of the update.
  • ⚠️ : Old clients receiving errors from new services will still map resource_exhausted as 403, even if in the server it is mapped to 429 (the mapping is done in code).
  • ❌ : Middleware and proxies that depend on HTTP codes need to be reviewed and updated to handle 429. This update is not easy, and in some cases may be impossible (maybe the middleware is managed by a different team). During the update, the middleware needs to handle both 429 and 403 during codes properly.
  • ❌ : Monitoring based on HTTP codes needs to be reviewed.

(✅ = it just works, ⚠️ = corner cases may need attention, ❌ = code depending on previous behavior needs to be updated)

@thesilentg
Copy link
Contributor Author

thesilentg commented Aug 24, 2020

The compatibility notes seem correct to me

@thesilentg
Copy link
Contributor Author

@marioizquierdo @rhysh What are your thought here in regards to:

  1. Adding a new Error code that gets mapped to 429
  2. Mapping the existing twirp.ResourceExhausted to 429
  3. Some other third option

With respect to 1 versus 2, compatibility seems far better for 1. I don't see a great third option, as issuecomment-673921997 shows the problems with a previously suggested option of using WithMeta. Maybe you have other ideas here.

@rhysh
Copy link
Contributor

rhysh commented Aug 25, 2020

Thank you for the survey, @thesilentg .

My read of the results are that having ResourceExhausted map to something other than 429 is a persistent source of confusion.

· Some users specifically want a 429 status on the wire, and go out of their way to build their own (using ResourceExhausted as the trigger).
· Some users don't care how they get a 403 status on the wire, and are confused that there are two options.
· Some users don't see a way to communicate the idea of a 429, so make something else up instead.

What would a new Twirp status code for "please send fewer requests" say about uses of the old one? MDN is pretty clear on 429 being "the user has sent too many requests in a given amount of time" and 403 being "access is permanently forbidden", but the RFC for 403 is less clear (it doesn't say "permanent", and does say "a request might be forbidden for reasons unrelated to the credentials"). Twirp's description of ResourceExhausted doesn't mention time one way or another, saying "some resource has been exhausted".

I think that @marioizquierdo 's compatibility analysis is technically correct (the best kind), but I don't agree with the implications of how important various points are. For example, do we know of any code that would ignore a Twirp status code in the message body and instead calculate its own Twirp status code based on the HTTP status (as if the response were from an intermediary)? If that code were to trigger today, per the protocol spec it would convert ResourceExhausted into PermissionDenied every time, even with no change here. It would convert InvalidArgument, Malformed, and OutOfRange into Internal. It would convert Internal into Unknown. I don't think earning a "❌ " from breaking that code further (by changing the remapping of ResourceExhausted from PermissionDenied to Unavailable) says much about the proposed change.

How much of the problem could be solved by clarifying the meaning of ResourceExhausted in documentation? That would cover goals 1 and 3 very well. It doesn't cover goal 2, to be consistent with other web/http standards. Can you say more about why that goal is important @thesilentg ?

@marioizquierdo
Copy link
Contributor

marioizquierdo commented Aug 25, 2020

I am conflicted myself with the options. Option 1 (add new error code) is the easy way out keeping backwards compatibility, but I agree that will create a point of confusion between gRPC and Twirp codes moving forward. Option 2 (changing existing error mapping) seems like a better end state, keeping things similar to the gRPC ecosystem, and breaking ambiguity on the mapping for 403. Twirp traditionally favors backwards compatibility, but this has been a constant source of friction.

Heh, this is a hard decision, but at this moment, if I have to vote I would go with option 2: change the mapping of resource_exhausted from 403 to 429.

We could draft "version 6" with some minor breaking changes like this one. A few users may have a hard time upgrading, but the vast majority of users will benefit from this code-mapping adjustment or not be affected at all. I'm also working in an update to routing (optional twirp prefix: #264 ) that could benefit from a major upgrade release.

@marioizquierdo
Copy link
Contributor

Let me add some more info into the mix.

ResourceExhausted is not the only error code that has drifted from gRPC canonical error codes mappings to HTTP codes since Twirp was created. There has been more updates to gRPC Gateway. Specifically, here are all the codes that we should change if we wanted to "match" the mapping on gRPC again:

  • Canceled: 408 Request Timeout => 499 Client Closed Request
  • DeadlineExceeded: 408 RequestTimeout => 504 Gateway Timeout
  • ResourceExhausted: 403 Forbidden => 429 Too Many Requests
  • FailedPrecondition: 412 Precondition Failed => 400 Bad Request

This is not a suggestion to include all those changes in the same release, each of those updates is non-backwards compatible and should be taken with care. But maybe we can see an easy win in one of those and take advantage of the breaking update for "v6" if we end up updating the code for ResourceExhausted?

In the other side of the discussion, Twirp has additional codes that do not exist in gRPC:

  • Malformed: 400 BadRequest
  • BadRoute: 404 Not Found
  • (potential) TooManyRequests: 429 Too Many Requests

The first two, Malformed and BadRoute can only happen when receiving bad requests when doing manual json requests or using bogus clients, they are useful for debugging. That would not be the same with the new code for ratelimiting if added.

@thesilentg
Copy link
Contributor Author

How much of the problem could be solved by clarifying the meaning of ResourceExhausted in documentation? That would cover goals 1 and 3 very well. It doesn't cover goal 2, to be consistent with other web/http standards. Can you say more about why that goal is important @thesilentg ?

In general, the more a framework (Twirp is a framework, just a very lightweight one) keeps consistent with established web standards, the easier it is for new users to onboard and understand what is going on.

I think just updating the documentation introduces other challenges. If this was updated to say "ResourceExhausted is used for rate limiting", what we would recommend to people that are currently using twirp.ResourceExhausted for purposes other than rate limiting? Should they move to some other error code? If so, this may actually add even more confusion, as twirp.ResourceExhausted seems like a pretty good description for things like global concurrent limits for a user. In addition, I think the inevitable question will be "if this is for rate limiting, why does it return 403 when the established error code is 429"?

@thesilentg
Copy link
Contributor Author

what we would recommend to people that are currently using twirp.ResourceExhausted for purposes other than rate limiting? Should they move to some other error code?

Actually, this is actually true even if we move twirp.resourceExhausted to return 429. How would we choose to communicate things like "you can't follow any more users" or "you can't have any more stream keys" if twirp.resourceExhausted is moved to communicate purely rate limiting

@marioizquierdo
Copy link
Contributor

marioizquierdo commented Sep 10, 2020

It's time to make a decision. Thanks for all the great information and analysis.

Let's re-map the existing ResourceExhausted to 429. Here is a summary of the reasons:

  • Clarifies what error should be returned for rate limiting.
  • ResourceExhausted can still be used for other purposes like "you can't follow any more users" or "you can't have any more stream keys", but the 429 code is still reasonable for those cases. It doesn't have to be purely used for rate limiting, just like in gRPC.
  • Developers used to work with gRPC will find more intuitive to have the same HTTP code mapping on the same error code.
  • Changing ResourceExhausted mapping from 403 to 429 is not entirely backwards compatible, but based on usage analysis, this update will be minor or even desired in some cases. This only affects middleware and monitoring; services and clients don't use the HTTP code mapping to decode errors.
  • Fixes the ambiguity between PermissionDenied and ResourceExhausted that were both mapping to 403.
  • Does not add a new error code.

We can close this PR and start working on a new PR with changes in code and docs, to be released as part of v7, as mentioned in #264 (comment)

@marioizquierdo
Copy link
Contributor

Closing in favor of #270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants