From aa70a99eac48cad92dfe6a0904f40e831b0f9d50 Mon Sep 17 00:00:00 2001
From: Luc Talatinian <lucix@amazon.com>
Date: Thu, 21 Dec 2023 11:24:35 -0500
Subject: [PATCH] docs: add FAQ/troubleshooting and operation timing guide

---
 content/en/docs/faq/_index.md            |  86 ++++++++
 content/en/docs/faq/timing-operations.md | 247 +++++++++++++++++++++++
 content/en/docs/making-requests.md       |  19 +-
 3 files changed, 346 insertions(+), 6 deletions(-)
 create mode 100644 content/en/docs/faq/_index.md
 create mode 100644 content/en/docs/faq/timing-operations.md

diff --git a/content/en/docs/faq/_index.md b/content/en/docs/faq/_index.md
new file mode 100644
index 00000000000..e208e4a1c83
--- /dev/null
+++ b/content/en/docs/faq/_index.md
@@ -0,0 +1,86 @@
+---
+title: "Frequently Asked Questions"
+linkTitle: "FAQ / Troubleshooting"
+description: "Answers to some commonly-asked questions about the {{% alias sdk-go %}}"
+weight: 9
+---
+
+### How do I configure my SDK's HTTP client? Are there any guidelines or best practices?
+
+We are unable to provide guidance to customers on how to configure their HTTP
+workflow in a manner that is most effective for their particular workload. The
+answer to this is the product of a multivariate equation, with input factors
+including but not limited to:
+
+* the network footprint of the application (TPS, throughput, etc.)
+* the services being used
+* the compute characteristics of the deployment
+* the geographical nature of the deployment
+* the desired application behavior or needs of the application itself (SLAs,
+  timings, etc.)
+
+### How should I configure operation timeouts?
+
+Much like the previous question, it depends. **The answer to this question,
+however, should almost NEVER be based on pure empirical observation** - e.g. "I
+made 1000 calls to this operation, it took at most 5 seconds so I will set the
+timeout based on that with a safety factor of 2x to 10 seconds". Environment
+conditions can change and these types of assumptions can become wrong without
+warning, leaving you open to the possibility of self-induced production pain.
+
+### Requests made by the SDK are timing out or taking too long, how do I fix this?
+
+We are unable to assist with extended or timed-out operation calls due to
+extended HTTP round-trip times. An HTTP round-trip is defined as the time
+between the first byte of the request being sent and the last byte of the
+response being received. Note that in certain SDKs this can differ from "HTTP
+client call time", which is generally the sum of round-trip time and other
+bootstrapping operations such as TCP connection management.
+
+If you are experiencing issues due to operation latency or timeouts, your first
+course of action should be to obtain telemetry of the SDK operation lifecycle
+to determine the timing breakdown between the HTTP round-trip and the
+surrounding overhead of the operation. See the guide on
+[timing SDK operations]({{< ref "/docs/faq/timing-operations.md" >}}),
+which contains a reusable code snippet that can achieve this.
+
+### How do I fix a `read: connection reset` error?
+
+The SDK retries any errors matching the `connection reset` pattern by default.
+However, this error can still occur in a context **outside** of the retry loop:
+certain service operations directly forward the API's HTTP response body to the
+caller to be consumed from the wire directly via `io.ReadCloser` (e.g.
+`GetObject`'s object payload). You may encounter this error when performing a
+`Read` on the response body.
+
+This error indicates that your host, the service or any intermediary party
+(e.g. NAT gateways, proxies, load balancers) closed the connection while
+attempting to read the response.
+
+This can occur for several reasons:
+* You did not consume the response body for time after the response itself was
+  received (after the service operation was called). **We recommend you consume
+  the HTTP response body as soon as possible for these types of operations.**
+* You did not close a previously-received response body. This can cause
+  connection resets on certain platforms. **You MUST close any `io.ReadCloser`
+  instances provided in an operation's response, regardless of whether you
+  consume its contents.**
+
+Beyond that, try running a tcpdump for an affected connection at the edge of
+your network (e.g. after any proxies that you control). If you see that the AWS
+endpoint seems to be sending a TCP RST, you should use the AWS support console
+to open a case against the offending service. Be prepared to provide request
+IDs and specific timestamps of when the issue occured.
+
+### Why am I getting "invalid signature" errors when using an HTTP proxy with the SDK?
+
+The signature algorithm for AWS services (generally sigv4) is tied to the
+serialized request's headers, more specifically most headers prefixed with
+`X-`. Proxies are prone to modifying the outgoing request by adding additional
+forwarding information (often via an `X-Forwarded-For` header) which
+effectively breaks the signature that the SDK calculated.
+
+If you're using an HTTP proxy and experiencing signature errors, you should
+work to capture the request **as it appears outgoing from the proxy** and
+determine whether it is different.
+
diff --git a/content/en/docs/faq/timing-operations.md b/content/en/docs/faq/timing-operations.md
new file mode 100644
index 00000000000..ff8fc33893a
--- /dev/null
+++ b/content/en/docs/faq/timing-operations.md
@@ -0,0 +1,247 @@
+---
+title: "Timing SDK operations"
+linkTitle: "Timing Operations"
+description: "How to perform basic instrumentation in the {{% alias sdk-go %}} to time SDK operations"
+weight: 1
+---
+
+When debugging timeout / latency issues in the SDK, it is critical to identify
+the components of the operation lifecycle which are taking more time to execute
+than expected. As a starting point, you will generally need to inspect the
+timing breakdown between the overall operation call and the HTTP call itself.
+
+The following sample program implements a basic instrumentation probe in terms
+of `smithy-go` middleware for SQS clients and demonstrates how it is used. The
+probe emits the following information for each operation call:
+
+* AWS request ID
+* service ID
+* operation name
+* operation invocation time
+* http call time
+
+Each emitted message is prefixed with a unique (to a single operation)
+"invocation ID" which is set at the beginning of the handler stack.
+
+The entry point for instrumentation is exposed as `WithOperationTiming`, which
+is parameterized to accept a message handling function which will receive
+instrumentation "events" as formatted strings. `PrintfMSGHandler` is provided
+as a convenience which will simply dump messages to stdout.
+
+The service used here is interchangeable - ALL service client options accept
+`APIOptions` and an `HTTPClient` as configuration. For example,
+`WithOperationTiming` could instead be declared as:
+
+```go
+func WithOperationTiming(msgHandler func(string)) func(*s3.Options)
+func WithOperationTiming(msgHandler func(string)) func(*dynamodb.Options)
+// etc.
+```
+
+If you change it, be sure to change the signature of the function it returns as
+well.
+
+```
+import (
+    "context"
+    "fmt"
+    "log"
+    "net/http"
+    "sync"
+    "time"
+
+    awsmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware"
+    awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http"
+    "github.com/aws/aws-sdk-go-v2/config"
+    "github.com/aws/aws-sdk-go-v2/service/sqs"
+    "github.com/aws/smithy-go/middleware"
+    smithyrand "github.com/aws/smithy-go/rand"
+)
+
+// WithOperationTiming instruments an SQS client to dump timing information for
+// the following spans:
+//   - overall operation time
+//   - HTTPClient call time
+//
+// This instrumentation will also emit the request ID, service name, and
+// operation name for each invocation.
+//
+// Accepts a message "handler" which is invoked with formatted messages to be
+// handled externally, you can use the declared PrintfMSGHandler to simply dump
+// these values to stdout.
+func WithOperationTiming(msgHandler func(string)) func(*sqs.Options) {
+    return func(o *sqs.Options) {
+        o.APIOptions = append(o.APIOptions, addTimingMiddlewares(msgHandler))
+        o.HTTPClient = &timedHTTPClient{
+            client:     awshttp.NewBuildableClient(),
+            msgHandler: msgHandler,
+        }
+    }
+}
+
+// PrintfMSGHandler writes messages to stdout.
+func PrintfMSGHandler(msg string) {
+    fmt.Printf("%s\n", msg)
+}
+
+type invokeIDKey struct{}
+
+func setInvokeID(ctx context.Context, id string) context.Context {
+    return middleware.WithStackValue(ctx, invokeIDKey{}, id)
+}
+
+func getInvokeID(ctx context.Context) string {
+    id, _ := middleware.GetStackValue(ctx, invokeIDKey{}).(string)
+    return id
+}
+
+func timeSpan(ctx context.Context, name string, consumer func(string)) func() {
+    start := time.Now()
+    return func() {
+        elapsed := time.Now().Sub(start)
+        consumer(fmt.Sprintf("[%s] %s: %s", getInvokeID(ctx), name, elapsed))
+    }
+}
+
+type timedHTTPClient struct {
+    client     *awshttp.BuildableClient
+    msgHandler func(string)
+}
+
+func (c *timedHTTPClient) Do(r *http.Request) (*http.Response, error) {
+    defer timeSpan(r.Context(), "http", c.msgHandler)()
+
+    resp, err := c.client.Do(r)
+    if err != nil {
+        return nil, fmt.Errorf("inner client do: %v", err)
+    }
+
+    return resp, nil
+}
+
+type addInvokeIDMiddleware struct {
+    msgHandler func(string)
+}
+
+func (*addInvokeIDMiddleware) ID() string { return "addInvokeID" }
+
+func (*addInvokeIDMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
+    out middleware.InitializeOutput, md middleware.Metadata, err error,
+) {
+    id, err := smithyrand.NewUUID(smithyrand.Reader).GetUUID()
+    if err != nil {
+        return out, md, fmt.Errorf("new uuid: %v", err)
+    }
+
+    return next.HandleInitialize(setInvokeID(ctx, id), in)
+}
+
+type timeOperationMiddleware struct {
+    msgHandler func(string)
+}
+
+func (*timeOperationMiddleware) ID() string { return "timeOperation" }
+
+func (m *timeOperationMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
+    middleware.InitializeOutput, middleware.Metadata, error,
+) {
+    defer timeSpan(ctx, "operation", m.msgHandler)()
+    return next.HandleInitialize(ctx, in)
+}
+
+type emitMetadataMiddleware struct {
+    msgHandler func(string)
+}
+
+func (*emitMetadataMiddleware) ID() string { return "emitMetadata" }
+
+func (m *emitMetadataMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) (
+    middleware.InitializeOutput, middleware.Metadata, error,
+) {
+    out, md, err := next.HandleInitialize(ctx, in)
+
+    invokeID := getInvokeID(ctx)
+    requestID, _ := awsmiddleware.GetRequestIDMetadata(md)
+    service := awsmiddleware.GetServiceID(ctx)
+    operation := awsmiddleware.GetOperationName(ctx)
+    m.msgHandler(fmt.Sprintf(`[%s] requestID = "%s"`, invokeID, requestID))
+    m.msgHandler(fmt.Sprintf(`[%s] service   = "%s"`, invokeID, service))
+    m.msgHandler(fmt.Sprintf(`[%s] operation = "%s"`, invokeID, operation))
+
+    return out, md, err
+}
+
+func addTimingMiddlewares(mh func(string)) func(*middleware.Stack) error {
+    return func(s *middleware.Stack) error {
+        if err := s.Initialize.Add(&timeOperationMiddleware{msgHandler: mh}, middleware.Before); err != nil {
+            return fmt.Errorf("add time operation middleware: %v", err)
+        }
+        if err := s.Initialize.Add(&addInvokeIDMiddleware{msgHandler: mh}, middleware.Before); err != nil {
+            return fmt.Errorf("add invoke id middleware: %v", err)
+        }
+        if err := s.Initialize.Insert(&emitMetadataMiddleware{msgHandler: mh}, "RegisterServiceMetadata", middleware.After); err != nil {
+            return fmt.Errorf("add emit metadata middleware: %v", err)
+        }
+        return nil
+    }
+}
+
+func main() {
+    cfg, err := config.LoadDefaultConfig(context.Background())
+    if err != nil {
+        log.Fatal(fmt.Errorf("load default config: %v", err))
+    }
+
+    svc := sqs.NewFromConfig(cfg, WithOperationTiming(PrintfMSGHandler))
+
+    var wg sync.WaitGroup
+
+    for i := 0; i < 6; i++ {
+        wg.Add(1)
+        go func() {
+            defer wg.Done()
+
+            _, err = svc.ListQueues(context.Background(), nil)
+            if err != nil {
+                fmt.Println(fmt.Errorf("list queues: %v", err))
+            }
+        }()
+    }
+    wg.Wait()
+}
+```
+
+A sample output of this program:
+
+```
+[e9a801bb-c51d-45c8-8e9f-a202e263fde8] http: 192.24067ms
+[e9a801bb-c51d-45c8-8e9f-a202e263fde8] requestID = "dbee3082-96a3-5b23-adca-6d005696fa94"
+[e9a801bb-c51d-45c8-8e9f-a202e263fde8] service   = "SQS"
+[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation = "ListQueues"
+[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation: 193.098393ms
+[0740f0e0-953e-4328-94fc-830a5052e763] http: 195.185732ms
+[0740f0e0-953e-4328-94fc-830a5052e763] requestID = "48b301fa-fc9f-5f1f-9007-5c783caa9322"
+[0740f0e0-953e-4328-94fc-830a5052e763] service   = "SQS"
+[0740f0e0-953e-4328-94fc-830a5052e763] operation = "ListQueues"
+[0740f0e0-953e-4328-94fc-830a5052e763] operation: 195.725491ms
+[c0589832-f351-4cc7-84f1-c656eb79dbd7] http: 200.52383ms
+[444030d0-6743-4de5-bd91-bc40b2b94c55] http: 200.525919ms
+[c0589832-f351-4cc7-84f1-c656eb79dbd7] requestID = "4a73cc82-b47b-56e1-b327-9100744e1b1f"
+[c0589832-f351-4cc7-84f1-c656eb79dbd7] service   = "SQS"
+[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation = "ListQueues"
+[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation: 201.214365ms
+[444030d0-6743-4de5-bd91-bc40b2b94c55] requestID = "ca1523ed-1879-5610-bf5d-7e6fd84cabee"
+[444030d0-6743-4de5-bd91-bc40b2b94c55] service   = "SQS"
+[444030d0-6743-4de5-bd91-bc40b2b94c55] operation = "ListQueues"
+[444030d0-6743-4de5-bd91-bc40b2b94c55] operation: 201.197071ms
+[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] http: 206.449568ms
+[12b2b39d-df86-4648-a436-ff0482d13340] http: 206.526603ms
+[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] requestID = "64229710-b552-56ed-8f96-ca927567ec7b"
+[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] service   = "SQS"
+[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation = "ListQueues"
+[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation: 207.252357ms
+[12b2b39d-df86-4648-a436-ff0482d13340] requestID = "76d9cbc0-07aa-58aa-98b7-9642c79f9851"
+[12b2b39d-df86-4648-a436-ff0482d13340] service   = "SQS"
+[12b2b39d-df86-4648-a436-ff0482d13340] operation = "ListQueues"
+[12b2b39d-df86-4648-a436-ff0482d13340] operation: 207.360621ms
+```
diff --git a/content/en/docs/making-requests.md b/content/en/docs/making-requests.md
index 9fd244d95f8..93fe1245f91 100644
--- a/content/en/docs/making-requests.md
+++ b/content/en/docs/making-requests.md
@@ -233,14 +233,21 @@ if err != nil {
 For more information on error handling, including how to inspect for specific error types, see the
 [Handling Errors]({{% ref "handling-errors.md" %}}) documentation.
 
-#### Responses with io.ReadCloser
+#### Responses with `io.ReadCloser`
 
-Some API operations return a response struct that contain an output member that is an `io.ReadCloser`. If you're making
-requests with these operations, always be sure to call `io.ReadCloser` member's `Close` method after you've completed
-reading the content.
+Some API operations return a response struct that contain an output member that
+is an `io.ReadCloser`. This will be the case for API operations that expose
+some element of their output in the body of the HTTP response itself.
 
-For example {{% alias service=S3 %}} `GetObject` operation returns a response
-whose `Body` member is an `io.ReadCloser`:
+For example, {{% alias service=S3 %}} `GetObject` operation returns a response
+whose `Body` member is an `io.ReadCloser` for accessing the object payload.
+
+{{% pageinfo color="warning" %}}
+**You MUST ALWAYS `Close()` any `io.ReadCloser` output members, regardless of
+whether you've consumed its content. Failure to do so can leak resources and
+potentially create issues with reading response bodies for operations called in
+the future.**
+{{% /pageinfo %}}
 
 ```go
 resp, err := s3svc.GetObject(context.TODO(), &s3.GetObjectInput{...})