Skip to content

Commit

Permalink
docs: add FAQ/troubleshooting and operation timing guide
Browse files Browse the repository at this point in the history
  • Loading branch information
lucix-aws committed Dec 21, 2023
1 parent 54f1c1e commit aa70a99
Show file tree
Hide file tree
Showing 3 changed files with 346 additions and 6 deletions.
86 changes: 86 additions & 0 deletions content/en/docs/faq/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: "Frequently Asked Questions"
linkTitle: "FAQ / Troubleshooting"
description: "Answers to some commonly-asked questions about the {{% alias sdk-go %}}"
weight: 9
---

### How do I configure my SDK's HTTP client? Are there any guidelines or best practices?

We are unable to provide guidance to customers on how to configure their HTTP
workflow in a manner that is most effective for their particular workload. The
answer to this is the product of a multivariate equation, with input factors
including but not limited to:

* the network footprint of the application (TPS, throughput, etc.)
* the services being used
* the compute characteristics of the deployment
* the geographical nature of the deployment
* the desired application behavior or needs of the application itself (SLAs,
timings, etc.)

### How should I configure operation timeouts?

Much like the previous question, it depends. **The answer to this question,
however, should almost NEVER be based on pure empirical observation** - e.g. "I
made 1000 calls to this operation, it took at most 5 seconds so I will set the
timeout based on that with a safety factor of 2x to 10 seconds". Environment
conditions can change and these types of assumptions can become wrong without
warning, leaving you open to the possibility of self-induced production pain.

### Requests made by the SDK are timing out or taking too long, how do I fix this?

We are unable to assist with extended or timed-out operation calls due to
extended HTTP round-trip times. An HTTP round-trip is defined as the time
between the first byte of the request being sent and the last byte of the
response being received. Note that in certain SDKs this can differ from "HTTP
client call time", which is generally the sum of round-trip time and other
bootstrapping operations such as TCP connection management.

If you are experiencing issues due to operation latency or timeouts, your first
course of action should be to obtain telemetry of the SDK operation lifecycle
to determine the timing breakdown between the HTTP round-trip and the
surrounding overhead of the operation. See the guide on
[timing SDK operations]({{< ref "/docs/faq/timing-operations.md" >}}),
which contains a reusable code snippet that can achieve this.

### How do I fix a `read: connection reset` error?

The SDK retries any errors matching the `connection reset` pattern by default.
However, this error can still occur in a context **outside** of the retry loop:
certain service operations directly forward the API's HTTP response body to the
caller to be consumed from the wire directly via `io.ReadCloser` (e.g.
`GetObject`'s object payload). You may encounter this error when performing a
`Read` on the response body.

This error indicates that your host, the service or any intermediary party
(e.g. NAT gateways, proxies, load balancers) closed the connection while
attempting to read the response.

This can occur for several reasons:
* You did not consume the response body for time after the response itself was
received (after the service operation was called). **We recommend you consume
the HTTP response body as soon as possible for these types of operations.**
* You did not close a previously-received response body. This can cause
connection resets on certain platforms. **You MUST close any `io.ReadCloser`
instances provided in an operation's response, regardless of whether you
consume its contents.**

Beyond that, try running a tcpdump for an affected connection at the edge of
your network (e.g. after any proxies that you control). If you see that the AWS
endpoint seems to be sending a TCP RST, you should use the AWS support console
to open a case against the offending service. Be prepared to provide request
IDs and specific timestamps of when the issue occured.

### Why am I getting "invalid signature" errors when using an HTTP proxy with the SDK?

The signature algorithm for AWS services (generally sigv4) is tied to the
serialized request's headers, more specifically most headers prefixed with
`X-`. Proxies are prone to modifying the outgoing request by adding additional
forwarding information (often via an `X-Forwarded-For` header) which
effectively breaks the signature that the SDK calculated.

If you're using an HTTP proxy and experiencing signature errors, you should
work to capture the request **as it appears outgoing from the proxy** and
determine whether it is different.

247 changes: 247 additions & 0 deletions content/en/docs/faq/timing-operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
---
title: "Timing SDK operations"
linkTitle: "Timing Operations"
description: "How to perform basic instrumentation in the {{% alias sdk-go %}} to time SDK operations"
weight: 1
---

When debugging timeout / latency issues in the SDK, it is critical to identify
the components of the operation lifecycle which are taking more time to execute
than expected. As a starting point, you will generally need to inspect the
timing breakdown between the overall operation call and the HTTP call itself.

The following sample program implements a basic instrumentation probe in terms
of `smithy-go` middleware for SQS clients and demonstrates how it is used. The
probe emits the following information for each operation call:

* AWS request ID
* service ID
* operation name
* operation invocation time
* http call time

Each emitted message is prefixed with a unique (to a single operation)
"invocation ID" which is set at the beginning of the handler stack.

The entry point for instrumentation is exposed as `WithOperationTiming`, which
is parameterized to accept a message handling function which will receive
instrumentation "events" as formatted strings. `PrintfMSGHandler` is provided
as a convenience which will simply dump messages to stdout.

The service used here is interchangeable - ALL service client options accept
`APIOptions` and an `HTTPClient` as configuration. For example,
`WithOperationTiming` could instead be declared as:

```go
func WithOperationTiming(msgHandler func(string)) func(*s3.Options)
func WithOperationTiming(msgHandler func(string)) func(*dynamodb.Options)
// etc.
```

If you change it, be sure to change the signature of the function it returns as
well.

```
import (
"context"
"fmt"
"log"
"net/http"
"sync"
"time"
awsmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware"
awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/sqs"
"github.com/aws/smithy-go/middleware"
smithyrand "github.com/aws/smithy-go/rand"
)
// WithOperationTiming instruments an SQS client to dump timing information for
// the following spans:
// - overall operation time
// - HTTPClient call time
//
// This instrumentation will also emit the request ID, service name, and
// operation name for each invocation.
//
// Accepts a message "handler" which is invoked with formatted messages to be
// handled externally, you can use the declared PrintfMSGHandler to simply dump
// these values to stdout.
func WithOperationTiming(msgHandler func(string)) func(*sqs.Options) {
return func(o *sqs.Options) {
o.APIOptions = append(o.APIOptions, addTimingMiddlewares(msgHandler))
o.HTTPClient = &timedHTTPClient{
client: awshttp.NewBuildableClient(),
msgHandler: msgHandler,
}
}
}
// PrintfMSGHandler writes messages to stdout.
func PrintfMSGHandler(msg string) {
fmt.Printf("%s\n", msg)
}
type invokeIDKey struct{}
func setInvokeID(ctx context.Context, id string) context.Context {
return middleware.WithStackValue(ctx, invokeIDKey{}, id)
}
func getInvokeID(ctx context.Context) string {
id, _ := middleware.GetStackValue(ctx, invokeIDKey{}).(string)
return id
}
func timeSpan(ctx context.Context, name string, consumer func(string)) func() {
start := time.Now()
return func() {
elapsed := time.Now().Sub(start)
consumer(fmt.Sprintf("[%s] %s: %s", getInvokeID(ctx), name, elapsed))
}
}
type timedHTTPClient struct {
client *awshttp.BuildableClient
msgHandler func(string)
}
func (c *timedHTTPClient) Do(r *http.Request) (*http.Response, error) {
defer timeSpan(r.Context(), "http", c.msgHandler)()
resp, err := c.client.Do(r)
if err != nil {
return nil, fmt.Errorf("inner client do: %v", err)
}
return resp, nil
}
type addInvokeIDMiddleware struct {
msgHandler func(string)
}
func (*addInvokeIDMiddleware) ID() string { return "addInvokeID" }
func (*addInvokeIDMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
out middleware.InitializeOutput, md middleware.Metadata, err error,
) {
id, err := smithyrand.NewUUID(smithyrand.Reader).GetUUID()
if err != nil {
return out, md, fmt.Errorf("new uuid: %v", err)
}
return next.HandleInitialize(setInvokeID(ctx, id), in)
}
type timeOperationMiddleware struct {
msgHandler func(string)
}
func (*timeOperationMiddleware) ID() string { return "timeOperation" }
func (m *timeOperationMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
middleware.InitializeOutput, middleware.Metadata, error,
) {
defer timeSpan(ctx, "operation", m.msgHandler)()
return next.HandleInitialize(ctx, in)
}
type emitMetadataMiddleware struct {
msgHandler func(string)
}
func (*emitMetadataMiddleware) ID() string { return "emitMetadata" }
func (m *emitMetadataMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
middleware.InitializeOutput, middleware.Metadata, error,
) {
out, md, err := next.HandleInitialize(ctx, in)
invokeID := getInvokeID(ctx)
requestID, _ := awsmiddleware.GetRequestIDMetadata(md)
service := awsmiddleware.GetServiceID(ctx)
operation := awsmiddleware.GetOperationName(ctx)
m.msgHandler(fmt.Sprintf(`[%s] requestID = "%s"`, invokeID, requestID))
m.msgHandler(fmt.Sprintf(`[%s] service = "%s"`, invokeID, service))
m.msgHandler(fmt.Sprintf(`[%s] operation = "%s"`, invokeID, operation))
return out, md, err
}
func addTimingMiddlewares(mh func(string)) func(*middleware.Stack) error {
return func(s *middleware.Stack) error {
if err := s.Initialize.Add(&timeOperationMiddleware{msgHandler: mh}, middleware.Before); err != nil {
return fmt.Errorf("add time operation middleware: %v", err)
}
if err := s.Initialize.Add(&addInvokeIDMiddleware{msgHandler: mh}, middleware.Before); err != nil {
return fmt.Errorf("add invoke id middleware: %v", err)
}
if err := s.Initialize.Insert(&emitMetadataMiddleware{msgHandler: mh}, "RegisterServiceMetadata", middleware.After); err != nil {
return fmt.Errorf("add emit metadata middleware: %v", err)
}
return nil
}
}
func main() {
cfg, err := config.LoadDefaultConfig(context.Background())
if err != nil {
log.Fatal(fmt.Errorf("load default config: %v", err))
}
svc := sqs.NewFromConfig(cfg, WithOperationTiming(PrintfMSGHandler))
var wg sync.WaitGroup
for i := 0; i < 6; i++ {
wg.Add(1)
go func() {
defer wg.Done()
_, err = svc.ListQueues(context.Background(), nil)
if err != nil {
fmt.Println(fmt.Errorf("list queues: %v", err))
}
}()
}
wg.Wait()
}
```

A sample output of this program:

```
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] http: 192.24067ms
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] requestID = "dbee3082-96a3-5b23-adca-6d005696fa94"
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] service = "SQS"
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation = "ListQueues"
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation: 193.098393ms
[0740f0e0-953e-4328-94fc-830a5052e763] http: 195.185732ms
[0740f0e0-953e-4328-94fc-830a5052e763] requestID = "48b301fa-fc9f-5f1f-9007-5c783caa9322"
[0740f0e0-953e-4328-94fc-830a5052e763] service = "SQS"
[0740f0e0-953e-4328-94fc-830a5052e763] operation = "ListQueues"
[0740f0e0-953e-4328-94fc-830a5052e763] operation: 195.725491ms
[c0589832-f351-4cc7-84f1-c656eb79dbd7] http: 200.52383ms
[444030d0-6743-4de5-bd91-bc40b2b94c55] http: 200.525919ms
[c0589832-f351-4cc7-84f1-c656eb79dbd7] requestID = "4a73cc82-b47b-56e1-b327-9100744e1b1f"
[c0589832-f351-4cc7-84f1-c656eb79dbd7] service = "SQS"
[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation = "ListQueues"
[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation: 201.214365ms
[444030d0-6743-4de5-bd91-bc40b2b94c55] requestID = "ca1523ed-1879-5610-bf5d-7e6fd84cabee"
[444030d0-6743-4de5-bd91-bc40b2b94c55] service = "SQS"
[444030d0-6743-4de5-bd91-bc40b2b94c55] operation = "ListQueues"
[444030d0-6743-4de5-bd91-bc40b2b94c55] operation: 201.197071ms
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] http: 206.449568ms
[12b2b39d-df86-4648-a436-ff0482d13340] http: 206.526603ms
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] requestID = "64229710-b552-56ed-8f96-ca927567ec7b"
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] service = "SQS"
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation = "ListQueues"
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation: 207.252357ms
[12b2b39d-df86-4648-a436-ff0482d13340] requestID = "76d9cbc0-07aa-58aa-98b7-9642c79f9851"
[12b2b39d-df86-4648-a436-ff0482d13340] service = "SQS"
[12b2b39d-df86-4648-a436-ff0482d13340] operation = "ListQueues"
[12b2b39d-df86-4648-a436-ff0482d13340] operation: 207.360621ms
```
19 changes: 13 additions & 6 deletions content/en/docs/making-requests.md
Original file line number Diff line number Diff line change
Expand Up @@ -233,14 +233,21 @@ if err != nil {
For more information on error handling, including how to inspect for specific error types, see the
[Handling Errors]({{% ref "handling-errors.md" %}}) documentation.

#### Responses with io.ReadCloser
#### Responses with `io.ReadCloser`

Some API operations return a response struct that contain an output member that is an `io.ReadCloser`. If you're making
requests with these operations, always be sure to call `io.ReadCloser` member's `Close` method after you've completed
reading the content.
Some API operations return a response struct that contain an output member that
is an `io.ReadCloser`. This will be the case for API operations that expose
some element of their output in the body of the HTTP response itself.

For example {{% alias service=S3 %}} `GetObject` operation returns a response
whose `Body` member is an `io.ReadCloser`:
For example, {{% alias service=S3 %}} `GetObject` operation returns a response
whose `Body` member is an `io.ReadCloser` for accessing the object payload.

{{% pageinfo color="warning" %}}
**You MUST ALWAYS `Close()` any `io.ReadCloser` output members, regardless of
whether you've consumed its content. Failure to do so can leak resources and
potentially create issues with reading response bodies for operations called in
the future.**
{{% /pageinfo %}}

```go
resp, err := s3svc.GetObject(context.TODO(), &s3.GetObjectInput{...})
Expand Down

0 comments on commit aa70a99

Please sign in to comment.