Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dial i/o timeouts while connecting #109

Open
adamconnelly opened this issue Apr 22, 2022 · 3 comments
Open

Dial i/o timeouts while connecting #109

adamconnelly opened this issue Apr 22, 2022 · 3 comments

Comments

@adamconnelly
Copy link

I'm periodically seeing connection failures when trying to connect to an Aurora Serverless instance. Most connection attempts are successful, but occasionally we get errors, making me think that there's some underlying network / database issue causing the problems. The error messages look something like this:

failed to connect to `host=xyz user=xyz database=xyz`: dial error (dial tcp x.x.x.x:5432: i/o timeout)

We're using v1.7.0 of pgconn and v4.9.0 of pgx. I know these aren't the latest versions, so we can definitely look at updating if there's anything that's likely to help with this issue.

The connection attempt times out after 60 seconds, which makes sense because of this line, and the error message is coming from here.

While investigating this, I noticed there's connection retry logic in the Go sql package, for example here. It automatically retries connecting if driver.ErrBadConn is returned. I guess what I'm wondering is would it make sense to return ErrBadConn when a dial timeout happens? Obviously this doesn't solve the underlying issue, but it might mitigate the problem assuming it's transient.

I'm happy to experiment with this, but I just wanted to ask first since I'm not mega familiar with Go SQL drivers.

Thanks in advance!

@jackc
Copy link
Owner

jackc commented Apr 22, 2022

Most connection attempts are successful, but occasionally we get errors, making me think that there's some underlying network / database issue causing the problems.

That error message does seem to indicate a network or server issue.

We're using v1.7.0 of pgconn and v4.9.0 of pgx. I know these aren't the latest versions, so we can definitely look at updating if there's anything that's likely to help with this issue.

That is pretty old, but off the top of my head I don't recall any changes that would affect this.

While investigating this, I noticed there's connection retry logic in the Go sql package, for example here. It automatically retries connecting if driver.ErrBadConn is returned. I guess what I'm wondering is would it make sense to return ErrBadConn when a dial timeout happens? Obviously this doesn't solve the underlying issue, but it might mitigate the problem assuming it's transient.

I'm not absolutely sure but I think the ErrBadConn logic only applies when you have an existing connection. I don't see how that would be effective in the dialing process.

@adamconnelly
Copy link
Author

Thanks for the reply - I'll post an update if I find out anything interesting through testing.

@adamconnelly
Copy link
Author

@jackc I did some more investigation, and I'm pretty certain that the ErrBadConn approach would work. It certainly resulted in retries for dial failures.

However, in the end I've actually taken the approach of replacing the DialFunc using something like this:

wrappedDial := config.DialFunc
config.DialFunc = func(ctx context.Context, network, addr string) (net.Conn, error) {
	var conn net.Conn
	var err error
	for i := 0; i < pgMaxDialAttempts; i++ {
		ok := func() bool {
			// We're manually enforcing a dial timeout here rather than relying on connect_timeout
			// in the connection string because the connect_timeout applies to the full connection
			// process, meaning that any dial retries would fail because the context has already expired.
			ctx, cancel := context.WithTimeout(ctx, time.Second*5)
			defer cancel()
			conn, err = wrappedDial(ctx, network, addr)

			return err == nil
		}()

		if ok {
			break
		}
	}

	return conn, err
}

That seems to have worked (in that an initial dial times out after 5 seconds, but a subsequent dial succeeds), although unfortunately since implementing it I've only seen one example of the failure, so it's difficult to be certain.

I'm happy to close the issue if you want since I've got a workaround now, but just figured the info could be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants