Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout error not show which fallback was used #1942

Closed
laskoviymishka opened this issue Mar 13, 2024 · 14 comments
Closed

Timeout error not show which fallback was used #1942

laskoviymishka opened this issue Mar 13, 2024 · 14 comments

Comments

@laskoviymishka
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Same as here: jackc/pgconn#139
Given: Cluster with ipv4 and ipv6 address.
when there is no network connection we should see what exact address we failed to connect

Describe the solution you'd like

Show IP-address that failed as part of error message.

failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: dial tcp [2a05:d01c:d05:6500:3318:e752:f216:3d9b]:5432: i/o timeout)

Describe alternatives you've considered

Leave it as is

failed to connect to `host=chinook.chiyepnb01it.eu-west-2.rds.amazonaws.com user=tutorial database=chinook`: dial error (timeout: context Deadline exceeded)

Additional context

This error also fixed at v4 driver - jackc/pgconn#140

@laskoviymishka laskoviymishka changed the title Timeout error not show wjich fallback was used Timeout error not show which fallback was used Mar 13, 2024
laskoviymishka added a commit to laskoviymishka/pgx that referenced this issue Mar 13, 2024
Fixes: jackc#1942

Once we face a timeout we should expose address, it can be for example ipv6 or ipv4 and knowing what exact address we took for resolver extreamly helpful for triaging a networking issue.

This issue is also fixed in pgx/v4, see: jackc/pgconn#139
@jackc
Copy link
Owner

jackc commented Mar 16, 2024

I think this is fundamentally the same issue as #1929. And I think it should be solved by the same approach. Connect should return some sort of multi-error that includes the errors from all the connection attempts.

@laskoviymishka
Copy link
Contributor Author

This error is an exception, so solving same way as #1929 - is incorrect, since we try host sequential, once we reach first timeout all other host will receive same errTimeout automatically.

@laskoviymishka
Copy link
Contributor Author

laskoviymishka commented Mar 16, 2024

In the meantime using parallel conn checker would break dual-stack logic.

I tried to implement parallel connect. In this approach we will use the ipv4 address and start working with it, so in theory it's okay, but the psql does not work like this, so I would say that the current approach with a sequential check is right, but the error message can be improved.

@jackc
Copy link
Owner

jackc commented Mar 17, 2024

Yeah, I don't think a parallel connection attempt is a good idea. And I don't think the other PR that is proposed for #1929 is right either. But if the error message was a multi error that included all the attempts, I think that would solve this and #1929.

@laskoviymishka
Copy link
Contributor Author

Nah, multiple errors would be equal to no error, you will see 2 hosts that errored but in fact only one is not reachable.
Maybe it's a good idea to stop iterations after nor errTimeout but after errTimeout with context deadline exceeded. Will try to debug such impl.

@jackc
Copy link
Owner

jackc commented May 11, 2024

FWIW, multi errors are implemented in 8db9716.

This will produce errors like:

    failed to connect to `user=postgres database=pgx_test`:
            lookup foo.invalid: no such host
            [::1]:1 (localhost): dial error: dial tcp [::1]:1: connect: connection refused
            127.0.0.1:1 (localhost): dial error: dial tcp 127.0.0.1:1: connect: connection refused
            127.0.0.1:2 (127.0.0.1): dial error: dial tcp 127.0.0.1:2: connect: connection refused

@laskoviymishka
Copy link
Contributor Author

laskoviymishka commented May 11, 2024

This is a bit wrong, per what I see in commit, what happens when ipv4 connectivity is ok, but ipv6 is blocked is an issue.
We shall return ipv6 conn refused, but not ipv4.

@jackc
Copy link
Owner

jackc commented May 11, 2024

If ipv4 worked then no error would be returned. The error list is only returned if no attempt succeeds.

@laskoviymishka
Copy link
Contributor Author

Then this violate original behavior of ip versions priority, for example psql cli (and any known to me driver) will ignore ipv4 if ipv6 available on host, so connecting here would be a strange behavior.

@jackc
Copy link
Owner

jackc commented May 12, 2024

This is matching the behavior of libpq. See https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-MULTIPLE-HOSTS.

In either format, a single host name can translate to multiple network addresses. A common example of this is a host that has both an IPv4 and an IPv6 address.

When multiple hosts are specified, or when a single host name is translated to multiple addresses, all the hosts and addresses will be tried in order, until one succeeds. If none of the hosts can be reached, the connection fails. If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.

This is exactly what pgx is trying to match.


Furthermore, there is no change in connection behavior. The only change 8db9716 made was to report the results of all attempts when none of them succeeded.

@laskoviymishka
Copy link
Contributor Author

laskoviymishka commented May 12, 2024

Then looks like this closes the original issue.

Enviroment:

host chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has address 10.0.101.236
chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com has IPv6 address 2a05:d018:471:2f50:9396:217d:8dcc:c73

ubuntu@ip-10-0-1-121:~$ telnet chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com 5432
Trying 2a05:d018:471:2f50:9396:217d:8dcc:c73...
^C
ubuntu@ip-10-0-1-121:~$ telnet 10.0.101.236 5432
Trying 10.0.101.236...
Connected to 10.0.101.236.
Escape character is '^]'.

With incorrect db-name error is not network related:

ubuntu@ip-10-0-1-121:~$  DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password connect_timeout=2" && ./todo 

Unable to connection to database: failed to connect to `user=cdcdb_admin database=`:
        [2a05:d018:471:2f50:9396:217d:8dcc:c73]:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): dial error: timeout: context deadline exceeded
        10.0.101.236:5432 (chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com): server error: FATAL: database "cdcdb_admin" does not exist (SQLSTATE 3D000)

With correct DB URL:

ubuntu@ip-10-0-1-121:~$ DATABASE_URL="host=chinook.cygmty5fevxe.eu-west-1.rds.amazonaws.com user=cdcdb_admin password=Password database=chinook connect_timeout=2" && ./todo 

Todo pgx demo

Usage:

  todo list
  todo add task
  todo update task_num item
  todo remove task_num

Example:

  todo add 'Learn Go'
  todo list

@laskoviymishka
Copy link
Contributor Author

laskoviymishka commented May 12, 2024

@jackc will you backport this to old v4 driver? (some people still use it, for example me :D)

@jackc
Copy link
Owner

jackc commented May 15, 2024

@laskoviymishka

Personally, I don't plan on porting back to v4. I'm not sure how far that part of the code has diverged between v4 and v5 and I'm typically only doing bug fixes on v4. But if someone else wants to do the work I don't mind merging it.

@laskoviymishka
Copy link
Contributor Author

Okay, I'll try to backport and if it goes well - will open PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants