Dial i/o timeouts while connecting #109

adamconnelly · 2022-04-22T13:39:40Z

I'm periodically seeing connection failures when trying to connect to an Aurora Serverless instance. Most connection attempts are successful, but occasionally we get errors, making me think that there's some underlying network / database issue causing the problems. The error messages look something like this:

failed to connect to `host=xyz user=xyz database=xyz`: dial error (dial tcp x.x.x.x:5432: i/o timeout)

We're using v1.7.0 of pgconn and v4.9.0 of pgx. I know these aren't the latest versions, so we can definitely look at updating if there's anything that's likely to help with this issue.

The connection attempt times out after 60 seconds, which makes sense because of this line, and the error message is coming from here.

While investigating this, I noticed there's connection retry logic in the Go sql package, for example here. It automatically retries connecting if driver.ErrBadConn is returned. I guess what I'm wondering is would it make sense to return ErrBadConn when a dial timeout happens? Obviously this doesn't solve the underlying issue, but it might mitigate the problem assuming it's transient.

I'm happy to experiment with this, but I just wanted to ask first since I'm not mega familiar with Go SQL drivers.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

jackc · 2022-04-22T20:59:27Z

Most connection attempts are successful, but occasionally we get errors, making me think that there's some underlying network / database issue causing the problems.

That error message does seem to indicate a network or server issue.

We're using v1.7.0 of pgconn and v4.9.0 of pgx. I know these aren't the latest versions, so we can definitely look at updating if there's anything that's likely to help with this issue.

That is pretty old, but off the top of my head I don't recall any changes that would affect this.

While investigating this, I noticed there's connection retry logic in the Go sql package, for example here. It automatically retries connecting if driver.ErrBadConn is returned. I guess what I'm wondering is would it make sense to return ErrBadConn when a dial timeout happens? Obviously this doesn't solve the underlying issue, but it might mitigate the problem assuming it's transient.

I'm not absolutely sure but I think the ErrBadConn logic only applies when you have an existing connection. I don't see how that would be effective in the dialing process.

adamconnelly · 2022-04-26T08:18:11Z

Thanks for the reply - I'll post an update if I find out anything interesting through testing.

adamconnelly · 2022-05-13T16:11:45Z

@jackc I did some more investigation, and I'm pretty certain that the ErrBadConn approach would work. It certainly resulted in retries for dial failures.

However, in the end I've actually taken the approach of replacing the DialFunc using something like this:

wrappedDial := config.DialFunc
config.DialFunc = func(ctx context.Context, network, addr string) (net.Conn, error) {
	var conn net.Conn
	var err error
	for i := 0; i < pgMaxDialAttempts; i++ {
		ok := func() bool {
			// We're manually enforcing a dial timeout here rather than relying on connect_timeout
			// in the connection string because the connect_timeout applies to the full connection
			// process, meaning that any dial retries would fail because the context has already expired.
			ctx, cancel := context.WithTimeout(ctx, time.Second*5)
			defer cancel()
			conn, err = wrappedDial(ctx, network, addr)

			return err == nil
		}()

		if ok {
			break
		}
	}

	return conn, err
}

That seems to have worked (in that an initial dial times out after 5 seconds, but a subsequent dial succeeds), although unfortunately since implementing it I've only seen one example of the failure, so it's difficult to be certain.

I'm happy to close the issue if you want since I've got a workaround now, but just figured the info could be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dial i/o timeouts while connecting #109

Dial i/o timeouts while connecting #109

adamconnelly commented Apr 22, 2022

jackc commented Apr 22, 2022

adamconnelly commented Apr 26, 2022

adamconnelly commented May 13, 2022

Dial i/o timeouts while connecting #109

Dial i/o timeouts while connecting #109

Comments

adamconnelly commented Apr 22, 2022

jackc commented Apr 22, 2022

adamconnelly commented Apr 26, 2022

adamconnelly commented May 13, 2022