Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misalignment between custom botocore.config retry parameters and the time it takes to raise ConnectTimeoutError #3298

Open
1 task
mgilar opened this issue Nov 12, 2024 · 2 comments
Assignees
Labels
bug This issue is a confirmed bug. cloudwatch p3 This is a minor priority issue

Comments

@mgilar
Copy link

mgilar commented Nov 12, 2024

Describe the bug

When connection cannot be stablished using a CloudWatch boto3 client, the time it takes to raise the ConnectTimeoutError does not match the time it should take given the custom parameters defined using the botocore.config Config class.

The time to raise the connection error consistently takes three times the max. number of retrials or max timeout time.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Given the following custom configuration to be applied in a CloudWatch boto3 client:

import boto3
from botocore.config import Config

custom_config = Config(
    connect_timeout=1,  # Time to establish connection
    read_timeout=1,  # Time to wait for a response after connection
    retries={
        "total_max_attempts": 1,
        "mode": "standard"
    },
)

cloudwatch = boto3.client("cloudwatch", region_name="eu-west-1", config=custom_config)

After testing the retrials behaviour when the connection cannot be stablished, with different parameter configurations I obtain the following execution times (time difference between cloudwatch.PutMetricData is logged and the ConnectTimeoutError is raised):

connect_timeout = 1
read_timeout = 1
total_max_attempts = 1
Execution time: ~1s

connect_timeout = 1
read_timeout = 1
total_max_attempts = 2
Execution time: ~2s

connect_timeout = 1
read_timeout = 1
total_max_attempts = 3
Execution time: ~3s

connect_timeout = 2
read_timeout = 1
total_max_attempts = 1
Execution time: ~2s

Current Behavior

After testing the retrials behaviour when the connection cannot be stablished, with different parameter configurations I obtain the following execution times (time difference between cloudwatch.PutMetricData is logged and the ConnectTimeoutError is raised):

connect_timeout = 1
read_timeout = 1
total_max_attempts = 1
Execution time: ~3s

connect_timeout = 1
read_timeout = 1
total_max_attempts = 2
Execution time: ~7s

connect_timeout = 1
read_timeout = 1
total_max_attempts = 3
Execution time: ~11s

connect_timeout = 2
read_timeout = 1
total_max_attempts = 1
Execution time: ~6s

Reproduction Steps

Create the following custom configuration to be applied in a CloudWatch boto3 client:

import boto3
from botocore.config import Config

custom_config = Config(
    connect_timeout=1,  # Time to establish connection
    read_timeout=1,  # Time to wait for a response after connection
    retries={
        "total_max_attempts": 1,
        "mode": "standard"
    },
)

cloudwatch = boto3.client("cloudwatch", region_name="eu-west-1", config=custom_config)

Execute the code from an AWS Service that does not have permissions to access/write Cloudwatch metrics (in my case it was a Sagemaker endpoint).

Package versions

botocore = 1.35.58
boto3 = 1.35.58
urllib3 = 2.2.3

Possible Solution

It looks like the source of the problem (see additional info/context to check full error logs) is related to urllib connection session parameters not being overwritten by custom retry parameters. Therefore urllib connection is executed as many times as its default parameter: 3.

Additional Information/Context

The full error obtained when connection cannot be established is:

2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - conn.connect()
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connection.py", line 693, in connect
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.sock = sock = self._new_conn()
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ^^^^^^^^^^^^^^^^
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connection.py", line 208, in _new_conn
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise ConnectTimeoutError(
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPSConnection object at 0x7f058020aad0>, 'Connection to monitoring.eu-west-1.amazonaws.com timed out. (connect timeout=2)')
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/botocore/endpoint.py", line 279, in _do_get_response
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - http_response = self._send(request)
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ^^^^^^^^^^^^^^^^^^^
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/botocore/endpoint.py", line 383, in _send
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return self.http_session.send(request)
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/botocore/httpsession.py", line 499, in send
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise ConnectTimeoutError(endpoint_url=request.url, error=e)
2024-11-12T12:28:47,744 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://monitoring.eu-west-1.amazonaws.com/"
2024-11-12T12:28:47,745 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Event needs-retry.cloudwatch.PutMetricData: calling handler <bound method RetryHandler.needs_retry of <botocore.retries.standard.RetryHandler object at 0x7f0580287910>>
2024-11-12T12:28:47,745 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Max attempts of 1 reached.
2024-11-12T12:28:47,745 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Not retrying request.
2024-11-12T12:28:47,745 [INFO ] W-04fd8f971b1bfc2c8d11abb1b-1-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Failed to save custom metric with error: Connect timeout on endpoint URL: "https://monitoring.eu-west-1.amazonaws.com/"

SDK version used

1.35.58

Environment details (OS name and version, etc.)

Linux

@mgilar mgilar added bug This issue is a confirmed bug. needs-triage This issue or PR still needs to be triaged. labels Nov 12, 2024
@adev-code adev-code self-assigned this Nov 12, 2024
@adev-code adev-code added investigating This issue is being investigated and/or work is in progress to resolve the issue. p3 This is a minor priority issue cloudwatch and removed p3 This is a minor priority issue needs-triage This issue or PR still needs to be triaged. labels Nov 12, 2024
@adev-code
Copy link

Hi @mgilar, thank you for reaching out. I was able to get the same error ConnectTimeoutError using your sample code and set up (CloudWatch and SageMaker) but in order to get the error I had to set my "connect_timeout" and "read_timeout" very low as possible (for example 0.000001). When it is at 1, it works fine. For a further look, could you please provide your used case for the test and also to provide the full debug logs by adding the line: boto3.set_stream_logger('') and redacting any sensitive information. And also if possible, to provide the full code that you may have and any further related logs. Thank you.

@adev-code adev-code added response-requested Waiting on additional info and feedback. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Nov 13, 2024
@mgilar
Copy link
Author

mgilar commented Nov 20, 2024

Hi @adev-code, let me provide further details on the above mentioned issue:

  • I have deployed a MultiModel Sagemaker Endpoint with custom inference code. I created a boto3 client of cloudwatch within the model handler, as depicted in the bug description, and then use it to log model usage information such as:
import boto3
from botocore.config import Config

boto3.set_stream_logger("", logging.DEBUG)

custom_config = Config(
            connect_timeout=event.get("connect_timeout", 1),  # Time to establish connection
            read_timeout=event.get(
                "read_timeout", 1
            ),  # Time to wait for a response after connection
            retries={
                "max_attempts": 0,
                "total_max_attempts": 1,
                "mode": "standard",
            },
        )
REGION = os.environ.get("AWS_REGION", "eu-west-2")
cloudwatch = boto3.client("cloudwatch", region_name=REGION, config=custom_config)

dimensions = [
            {"Name": "ModelName", "Value": model_name},
            {"Name": "TriggeringEntity", "Value": triggering_entity},
            {"Name": "StatusCode", "Value": str(status_code)},
            {"Name": "Method", "Value": method},
        ]
        response = cloudwatch.put_metric_data(
            Namespace="test",
            MetricData=[
                {
                    "MetricName": "ModelContext",
                    "Dimensions": dimensions,
                    "Timestamp": datetime.utcnow(),
                    "Value": 1,
                    "Unit": "Count",
                    "StorageResolution": 60,
                }
            ],
        )
        # Check response for success
        if response["ResponseMetadata"]["HTTPStatusCode"] == 200:
            log.info("Metric data successfully put to CloudWatch.")
        else:
            log.info("Failed to put metric data to CloudWatch.")

The obtained logs are:

2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen

2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - response = self._make_request(
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ^^^^^^^^^^^^^^^^^^^
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 490, in _make_request
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise new_e
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 466, in _make_request
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self._validate_conn(conn)
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - conn.connect()
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connection.py", line 693, in connect
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - self.sock = sock = self._new_conn()
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ^^^^^^^^^^^^^^^^
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/urllib3/connection.py", line 208, in _new_conn
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise ConnectTimeoutError(
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - urllib3.exceptions.ConnectTimeoutError: (<botocore.awsrequest.AWSHTTPSConnection object at 0x7f3bcbd31ad0>, 'Connection to monitoring.eu-west-1.amazonaws.com timed out. (connect timeout=1)')
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - During handling of the above exception, another exception occurred:
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Traceback (most recent call last):
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/botocore/endpoint.py", line 279, in _do_get_response
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - http_response = self._send(request)
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ^^^^^^^^^^^^^^^^^^^
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/botocore/endpoint.py", line 383, in _send
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - return self.http_session.send(request)
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - File "/venv/lib64/python3.11/site-packages/botocore/httpsession.py", line 499, in send
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - raise ConnectTimeoutError(endpoint_url=request.url, error=e)
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://monitoring.eu-west-1.amazonaws.com/"
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2024-11-20 09:45:47,485 botocore.hooks [DEBUG] Event needs-retry.cloudwatch.PutMetricData: calling handler <bound method RetryHandler.needs_retry of <botocore.retries.standard.RetryHandler object at 0x7f3bc838e790>>
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2024-11-20 09:45:47,485 botocore.retries.standard [DEBUG] Max attempts of 1 reached.
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2024-11-20 09:45:47,485 botocore.retries.standard [DEBUG] Not retrying request.
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - { "component": "GetAIModelPrediction", "request_id": null, "timestamp": 1732095947000, "level": "ERROR", "type": "app", "msg": "Failed to save custom metric with error: Connect timeout on endpoint URL: "https://monitoring.eu-west-1.amazonaws.com/\"" }
2024-11-20T09:45:47,487 [WARN ] W-82500c51c9d78af3b3ebaff28-8-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - 2024-11-20 09:45:47,486 libawsutils.ailogger [ERROR] Failed to save custom metric with error: Connect timeout on endpoint URL: "https://monitoring.eu-west-1.amazonaws.com/"

Let me know if I can provide further information

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. cloudwatch p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests

2 participants