From aa70a99eac48cad92dfe6a0904f40e831b0f9d50 Mon Sep 17 00:00:00 2001 From: Luc Talatinian Date: Thu, 21 Dec 2023 11:24:35 -0500 Subject: [PATCH] docs: add FAQ/troubleshooting and operation timing guide --- content/en/docs/faq/_index.md | 86 ++++++++ content/en/docs/faq/timing-operations.md | 247 +++++++++++++++++++++++ content/en/docs/making-requests.md | 19 +- 3 files changed, 346 insertions(+), 6 deletions(-) create mode 100644 content/en/docs/faq/_index.md create mode 100644 content/en/docs/faq/timing-operations.md diff --git a/content/en/docs/faq/_index.md b/content/en/docs/faq/_index.md new file mode 100644 index 00000000000..e208e4a1c83 --- /dev/null +++ b/content/en/docs/faq/_index.md @@ -0,0 +1,86 @@ +--- +title: "Frequently Asked Questions" +linkTitle: "FAQ / Troubleshooting" +description: "Answers to some commonly-asked questions about the {{% alias sdk-go %}}" +weight: 9 +--- + +### How do I configure my SDK's HTTP client? Are there any guidelines or best practices? + +We are unable to provide guidance to customers on how to configure their HTTP +workflow in a manner that is most effective for their particular workload. The +answer to this is the product of a multivariate equation, with input factors +including but not limited to: + +* the network footprint of the application (TPS, throughput, etc.) +* the services being used +* the compute characteristics of the deployment +* the geographical nature of the deployment +* the desired application behavior or needs of the application itself (SLAs, + timings, etc.) + +### How should I configure operation timeouts? + +Much like the previous question, it depends. **The answer to this question, +however, should almost NEVER be based on pure empirical observation** - e.g. "I +made 1000 calls to this operation, it took at most 5 seconds so I will set the +timeout based on that with a safety factor of 2x to 10 seconds". Environment +conditions can change and these types of assumptions can become wrong without +warning, leaving you open to the possibility of self-induced production pain. + +### Requests made by the SDK are timing out or taking too long, how do I fix this? + +We are unable to assist with extended or timed-out operation calls due to +extended HTTP round-trip times. An HTTP round-trip is defined as the time +between the first byte of the request being sent and the last byte of the +response being received. Note that in certain SDKs this can differ from "HTTP +client call time", which is generally the sum of round-trip time and other +bootstrapping operations such as TCP connection management. + +If you are experiencing issues due to operation latency or timeouts, your first +course of action should be to obtain telemetry of the SDK operation lifecycle +to determine the timing breakdown between the HTTP round-trip and the +surrounding overhead of the operation. See the guide on +[timing SDK operations]({{< ref "/docs/faq/timing-operations.md" >}}), +which contains a reusable code snippet that can achieve this. + +### How do I fix a `read: connection reset` error? + +The SDK retries any errors matching the `connection reset` pattern by default. +However, this error can still occur in a context **outside** of the retry loop: +certain service operations directly forward the API's HTTP response body to the +caller to be consumed from the wire directly via `io.ReadCloser` (e.g. +`GetObject`'s object payload). You may encounter this error when performing a +`Read` on the response body. + +This error indicates that your host, the service or any intermediary party +(e.g. NAT gateways, proxies, load balancers) closed the connection while +attempting to read the response. + +This can occur for several reasons: +* You did not consume the response body for time after the response itself was + received (after the service operation was called). **We recommend you consume + the HTTP response body as soon as possible for these types of operations.** +* You did not close a previously-received response body. This can cause + connection resets on certain platforms. **You MUST close any `io.ReadCloser` + instances provided in an operation's response, regardless of whether you + consume its contents.** + +Beyond that, try running a tcpdump for an affected connection at the edge of +your network (e.g. after any proxies that you control). If you see that the AWS +endpoint seems to be sending a TCP RST, you should use the AWS support console +to open a case against the offending service. Be prepared to provide request +IDs and specific timestamps of when the issue occured. + +### Why am I getting "invalid signature" errors when using an HTTP proxy with the SDK? + +The signature algorithm for AWS services (generally sigv4) is tied to the +serialized request's headers, more specifically most headers prefixed with +`X-`. Proxies are prone to modifying the outgoing request by adding additional +forwarding information (often via an `X-Forwarded-For` header) which +effectively breaks the signature that the SDK calculated. + +If you're using an HTTP proxy and experiencing signature errors, you should +work to capture the request **as it appears outgoing from the proxy** and +determine whether it is different. + diff --git a/content/en/docs/faq/timing-operations.md b/content/en/docs/faq/timing-operations.md new file mode 100644 index 00000000000..ff8fc33893a --- /dev/null +++ b/content/en/docs/faq/timing-operations.md @@ -0,0 +1,247 @@ +--- +title: "Timing SDK operations" +linkTitle: "Timing Operations" +description: "How to perform basic instrumentation in the {{% alias sdk-go %}} to time SDK operations" +weight: 1 +--- + +When debugging timeout / latency issues in the SDK, it is critical to identify +the components of the operation lifecycle which are taking more time to execute +than expected. As a starting point, you will generally need to inspect the +timing breakdown between the overall operation call and the HTTP call itself. + +The following sample program implements a basic instrumentation probe in terms +of `smithy-go` middleware for SQS clients and demonstrates how it is used. The +probe emits the following information for each operation call: + +* AWS request ID +* service ID +* operation name +* operation invocation time +* http call time + +Each emitted message is prefixed with a unique (to a single operation) +"invocation ID" which is set at the beginning of the handler stack. + +The entry point for instrumentation is exposed as `WithOperationTiming`, which +is parameterized to accept a message handling function which will receive +instrumentation "events" as formatted strings. `PrintfMSGHandler` is provided +as a convenience which will simply dump messages to stdout. + +The service used here is interchangeable - ALL service client options accept +`APIOptions` and an `HTTPClient` as configuration. For example, +`WithOperationTiming` could instead be declared as: + +```go +func WithOperationTiming(msgHandler func(string)) func(*s3.Options) +func WithOperationTiming(msgHandler func(string)) func(*dynamodb.Options) +// etc. +``` + +If you change it, be sure to change the signature of the function it returns as +well. + +``` +import ( + "context" + "fmt" + "log" + "net/http" + "sync" + "time" + + awsmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware" + awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http" + "github.com/aws/aws-sdk-go-v2/config" + "github.com/aws/aws-sdk-go-v2/service/sqs" + "github.com/aws/smithy-go/middleware" + smithyrand "github.com/aws/smithy-go/rand" +) + +// WithOperationTiming instruments an SQS client to dump timing information for +// the following spans: +// - overall operation time +// - HTTPClient call time +// +// This instrumentation will also emit the request ID, service name, and +// operation name for each invocation. +// +// Accepts a message "handler" which is invoked with formatted messages to be +// handled externally, you can use the declared PrintfMSGHandler to simply dump +// these values to stdout. +func WithOperationTiming(msgHandler func(string)) func(*sqs.Options) { + return func(o *sqs.Options) { + o.APIOptions = append(o.APIOptions, addTimingMiddlewares(msgHandler)) + o.HTTPClient = &timedHTTPClient{ + client: awshttp.NewBuildableClient(), + msgHandler: msgHandler, + } + } +} + +// PrintfMSGHandler writes messages to stdout. +func PrintfMSGHandler(msg string) { + fmt.Printf("%s\n", msg) +} + +type invokeIDKey struct{} + +func setInvokeID(ctx context.Context, id string) context.Context { + return middleware.WithStackValue(ctx, invokeIDKey{}, id) +} + +func getInvokeID(ctx context.Context) string { + id, _ := middleware.GetStackValue(ctx, invokeIDKey{}).(string) + return id +} + +func timeSpan(ctx context.Context, name string, consumer func(string)) func() { + start := time.Now() + return func() { + elapsed := time.Now().Sub(start) + consumer(fmt.Sprintf("[%s] %s: %s", getInvokeID(ctx), name, elapsed)) + } +} + +type timedHTTPClient struct { + client *awshttp.BuildableClient + msgHandler func(string) +} + +func (c *timedHTTPClient) Do(r *http.Request) (*http.Response, error) { + defer timeSpan(r.Context(), "http", c.msgHandler)() + + resp, err := c.client.Do(r) + if err != nil { + return nil, fmt.Errorf("inner client do: %v", err) + } + + return resp, nil +} + +type addInvokeIDMiddleware struct { + msgHandler func(string) +} + +func (*addInvokeIDMiddleware) ID() string { return "addInvokeID" } + +func (*addInvokeIDMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) ( + out middleware.InitializeOutput, md middleware.Metadata, err error, +) { + id, err := smithyrand.NewUUID(smithyrand.Reader).GetUUID() + if err != nil { + return out, md, fmt.Errorf("new uuid: %v", err) + } + + return next.HandleInitialize(setInvokeID(ctx, id), in) +} + +type timeOperationMiddleware struct { + msgHandler func(string) +} + +func (*timeOperationMiddleware) ID() string { return "timeOperation" } + +func (m *timeOperationMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) ( + middleware.InitializeOutput, middleware.Metadata, error, +) { + defer timeSpan(ctx, "operation", m.msgHandler)() + return next.HandleInitialize(ctx, in) +} + +type emitMetadataMiddleware struct { + msgHandler func(string) +} + +func (*emitMetadataMiddleware) ID() string { return "emitMetadata" } + +func (m *emitMetadataMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) ( + middleware.InitializeOutput, middleware.Metadata, error, +) { + out, md, err := next.HandleInitialize(ctx, in) + + invokeID := getInvokeID(ctx) + requestID, _ := awsmiddleware.GetRequestIDMetadata(md) + service := awsmiddleware.GetServiceID(ctx) + operation := awsmiddleware.GetOperationName(ctx) + m.msgHandler(fmt.Sprintf(`[%s] requestID = "%s"`, invokeID, requestID)) + m.msgHandler(fmt.Sprintf(`[%s] service = "%s"`, invokeID, service)) + m.msgHandler(fmt.Sprintf(`[%s] operation = "%s"`, invokeID, operation)) + + return out, md, err +} + +func addTimingMiddlewares(mh func(string)) func(*middleware.Stack) error { + return func(s *middleware.Stack) error { + if err := s.Initialize.Add(&timeOperationMiddleware{msgHandler: mh}, middleware.Before); err != nil { + return fmt.Errorf("add time operation middleware: %v", err) + } + if err := s.Initialize.Add(&addInvokeIDMiddleware{msgHandler: mh}, middleware.Before); err != nil { + return fmt.Errorf("add invoke id middleware: %v", err) + } + if err := s.Initialize.Insert(&emitMetadataMiddleware{msgHandler: mh}, "RegisterServiceMetadata", middleware.After); err != nil { + return fmt.Errorf("add emit metadata middleware: %v", err) + } + return nil + } +} + +func main() { + cfg, err := config.LoadDefaultConfig(context.Background()) + if err != nil { + log.Fatal(fmt.Errorf("load default config: %v", err)) + } + + svc := sqs.NewFromConfig(cfg, WithOperationTiming(PrintfMSGHandler)) + + var wg sync.WaitGroup + + for i := 0; i < 6; i++ { + wg.Add(1) + go func() { + defer wg.Done() + + _, err = svc.ListQueues(context.Background(), nil) + if err != nil { + fmt.Println(fmt.Errorf("list queues: %v", err)) + } + }() + } + wg.Wait() +} +``` + +A sample output of this program: + +``` +[e9a801bb-c51d-45c8-8e9f-a202e263fde8] http: 192.24067ms +[e9a801bb-c51d-45c8-8e9f-a202e263fde8] requestID = "dbee3082-96a3-5b23-adca-6d005696fa94" +[e9a801bb-c51d-45c8-8e9f-a202e263fde8] service = "SQS" +[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation = "ListQueues" +[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation: 193.098393ms +[0740f0e0-953e-4328-94fc-830a5052e763] http: 195.185732ms +[0740f0e0-953e-4328-94fc-830a5052e763] requestID = "48b301fa-fc9f-5f1f-9007-5c783caa9322" +[0740f0e0-953e-4328-94fc-830a5052e763] service = "SQS" +[0740f0e0-953e-4328-94fc-830a5052e763] operation = "ListQueues" +[0740f0e0-953e-4328-94fc-830a5052e763] operation: 195.725491ms +[c0589832-f351-4cc7-84f1-c656eb79dbd7] http: 200.52383ms +[444030d0-6743-4de5-bd91-bc40b2b94c55] http: 200.525919ms +[c0589832-f351-4cc7-84f1-c656eb79dbd7] requestID = "4a73cc82-b47b-56e1-b327-9100744e1b1f" +[c0589832-f351-4cc7-84f1-c656eb79dbd7] service = "SQS" +[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation = "ListQueues" +[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation: 201.214365ms +[444030d0-6743-4de5-bd91-bc40b2b94c55] requestID = "ca1523ed-1879-5610-bf5d-7e6fd84cabee" +[444030d0-6743-4de5-bd91-bc40b2b94c55] service = "SQS" +[444030d0-6743-4de5-bd91-bc40b2b94c55] operation = "ListQueues" +[444030d0-6743-4de5-bd91-bc40b2b94c55] operation: 201.197071ms +[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] http: 206.449568ms +[12b2b39d-df86-4648-a436-ff0482d13340] http: 206.526603ms +[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] requestID = "64229710-b552-56ed-8f96-ca927567ec7b" +[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] service = "SQS" +[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation = "ListQueues" +[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation: 207.252357ms +[12b2b39d-df86-4648-a436-ff0482d13340] requestID = "76d9cbc0-07aa-58aa-98b7-9642c79f9851" +[12b2b39d-df86-4648-a436-ff0482d13340] service = "SQS" +[12b2b39d-df86-4648-a436-ff0482d13340] operation = "ListQueues" +[12b2b39d-df86-4648-a436-ff0482d13340] operation: 207.360621ms +``` diff --git a/content/en/docs/making-requests.md b/content/en/docs/making-requests.md index 9fd244d95f8..93fe1245f91 100644 --- a/content/en/docs/making-requests.md +++ b/content/en/docs/making-requests.md @@ -233,14 +233,21 @@ if err != nil { For more information on error handling, including how to inspect for specific error types, see the [Handling Errors]({{% ref "handling-errors.md" %}}) documentation. -#### Responses with io.ReadCloser +#### Responses with `io.ReadCloser` -Some API operations return a response struct that contain an output member that is an `io.ReadCloser`. If you're making -requests with these operations, always be sure to call `io.ReadCloser` member's `Close` method after you've completed -reading the content. +Some API operations return a response struct that contain an output member that +is an `io.ReadCloser`. This will be the case for API operations that expose +some element of their output in the body of the HTTP response itself. -For example {{% alias service=S3 %}} `GetObject` operation returns a response -whose `Body` member is an `io.ReadCloser`: +For example, {{% alias service=S3 %}} `GetObject` operation returns a response +whose `Body` member is an `io.ReadCloser` for accessing the object payload. + +{{% pageinfo color="warning" %}} +**You MUST ALWAYS `Close()` any `io.ReadCloser` output members, regardless of +whether you've consumed its content. Failure to do so can leak resources and +potentially create issues with reading response bodies for operations called in +the future.** +{{% /pageinfo %}} ```go resp, err := s3svc.GetObject(context.TODO(), &s3.GetObjectInput{...})