Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust listener close in riverpgxv5 #246

Merged
merged 1 commit into from
Mar 2, 2024

Conversation

brandur
Copy link
Contributor

@brandur brandur commented Mar 1, 2024

As part of #239, we're observing the possibility of the notifier going
into a hot loop where it's trying to reopen a listener connection, but
the listener won't let it because it thinks the connection is already
open.

A suspect is the listener's Close implementation, which in the event
of an error, returns the error and fails to release an underlying
connection, putting it into a state where it's never reusable.

Here, modify Close so that it always releases and unsets an underlying
connection regardless of the error state returned.

I tried to add a test case for this, but reading through pgx and net
source code, I couldn't find any way to simulate an error from Close
(I thought a cancelled context would do it, but it does not), so I had
to leave it.

Thanks @mfrister for pointing this one out!

Fixes #248.

@brandur brandur mentioned this pull request Mar 1, 2024
@brandur brandur requested a review from bgentry March 1, 2024 14:51
@brandur
Copy link
Contributor Author

brandur commented Mar 1, 2024

@bgentry Seriously tried to add a test case for this, but couldn't find a way to make Close on the PG conn return an error. The only way to do it I think would be to add additional test-only interface functions to inject state, which would be kind of gross. Thoughts?

@mfrister
Copy link

mfrister commented Mar 1, 2024

Thanks for the fix!

I've deployed a version of my worker with riverpgxv5 from this branch and will report on Monday whether the issue appeared again. So far it looks good.

Enjoy your weekend!

@Rshep3087
Copy link

There is this package from pgx https://github.com/jackc/pgx/blob/master/pgxtest/pgxtest.go. You can overwrite the CloseFunc on the test runner I believe.

@brandur brandur force-pushed the brandur-robust-listener-close branch from 8ba4e4c to f566e2c Compare March 1, 2024 15:31
@brandur
Copy link
Contributor Author

brandur commented Mar 1, 2024

@mfrister Excellent. Thanks!

@Rshep3087 Hm, just reading through this module, I believe ConnTestRunner is more of a helper for controlling the flow of test runs rather than something that can act as a mock for a pgx.Conn.

@brandur brandur force-pushed the brandur-robust-listener-close branch 2 times, most recently from cc2c92e to 45d7f15 Compare March 1, 2024 15:51
As part of #239, we're observing the possibility of the notifier going
into a hot loop where it's trying to reopen a listener connection, but
the listener won't let it because it thinks the connection is already
open.

A suspect is the listener's `Close` implementation, which in the event
of an error, returns the error and fails to release an underlying
connection, putting it into a state where it's never reusable.

Here, modify `Close` so that it always releases and unsets an underlying
connection regardless of the error state returned.

I tried to add a test case for this, but reading through pgx and net
source code, I couldn't find any way to simulate an error from `Close`
(I thought a cancelled context would do it, but it does not), so I had
to leave it.
@brandur brandur force-pushed the brandur-robust-listener-close branch from 45d7f15 to 02267da Compare March 1, 2024 15:52
Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only ways I can think of to test this are (a) use an interface over the conn here, which feels excessive, or (b) somehow force the server to uncleanly terminate the conn or reply with an error upon close.

I'm fine with not going forward with one of those approaches at this time, they seem to heavy for the benefit at this stage. Thanks for the quick fix :shipit:

Comment on lines -478 to -480
if err := l.conn.Conn().Close(ctx); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, I noticed this change when I reviewed the mega-PR, should have said something 😞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh. Well, I was going to say "next time", but hopefully nothing like that PR ever happens again :)

@brandur
Copy link
Contributor Author

brandur commented Mar 2, 2024

The only ways I can think of to test this are (a) use an interface over the conn here, which feels excessive, or (b) somehow force the server to uncleanly terminate the conn or reply with an error upon close.

Yeah, I had the same reactions.

I think (a) is plausible, but has the downside that it makes the code a little more painful to use. Rather than being able to easily jump easily into concrete implementations, you now have to jump through the interface, which is always a bit annoying.

And RE: (b), I read through a lot of pgx source around Close, and man, as far as I can tell, it is quite difficult to ever get it to return an error. The underlying net.Conn under the pgx connection which is under the pgx pool connection will under some cases, but I couldn't find an obvious way to trigger it, especially from this level.

I'm fine with not going forward with one of those approaches at this time, they seem to heavy for the benefit at this stage. Thanks for the quick fix :shipit:

Sweet. Yeah, that was my conclusion as well. Possible maybe to add some form of testing, but maybe more trouble than it's worth, and getting the fix out the door is the best compromise for the time being.

Thanks!

@brandur brandur merged commit cf178f7 into master Mar 2, 2024
8 of 10 checks passed
@brandur brandur deleted the brandur-robust-listener-close branch March 2, 2024 04:20
brandur added a commit that referenced this pull request Mar 2, 2024
Contains an important fix from #246 which resolves a problem wherein a
listener from the `riverpgxv5` driver couldn't be reused in cases where
`Close` returned an error.
@brandur brandur mentioned this pull request Mar 2, 2024
brandur added a commit that referenced this pull request Mar 2, 2024
Contains an important fix from #246 which resolves a problem wherein a
listener from the `riverpgxv5` driver couldn't be reused in cases where
`Close` returned an error.
brandur added a commit that referenced this pull request Mar 2, 2024
Contains an important fix from #246 which resolves a problem wherein a
listener from the `riverpgxv5` driver couldn't be reused in cases where
`Close` returned an error.
@dhermes
Copy link
Contributor

dhermes commented Mar 2, 2024

Regarding unit testing, I have a little bit of experience here. I hunted down a bug in lib/pq a few years back and ended up replicating the bug locally while crafting a fix by using a TCP proxy that would send an RST packet after some signal had been received (but before the actual PostgreSQL connection was closed).

From what I understand, your unit tests hit a REAL PostgreSQL instance, so you could run a proxy within a Go unit test and put that proxy in front of the PostgreSQL instance available to the unit tests.

🤞 I'd hope a plain old TCP error over a non-TLS connection would do the trick, but maybe not. In my log capture around the issue that this fixed (#248), it was actually a TLS protocol issue that showed up in the error:

{
    "level": "error",
    "time": "2024-03-01T23:56:08.679126682Z",
    "notifier": {
        "err": {
            "error": "tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.122.48.181:44034->10.122.30.240:5432: i/o timeout",
            "kind": "*fmt.wrapError",
            "stack": null
        }
    },
    "subsystem": "river",
    "message": "error closing listener"
}

However, if I had to guess I'd imagine the TCP connection is what had already been closed (e.g. via RST).

brandur added a commit that referenced this pull request Mar 2, 2024
Here, follow up #246 by adding a few more tests that verify a listener's
state after `Close` has been invoked, including if it returned an error,
which we're able to simulate by overriding pgx's `DialFunc` and
returning a stand-in stub for an underlying `net.Conn`.

Also, remove the explicit `Close` call on an underlying connection in
favor of just invoking the pool's `Release` function. In case of an
error condition, `Release` will detect that and do the right thing, and
pgx is better tested/vetted to make sure that right thing happens.
brandur added a commit that referenced this pull request Mar 2, 2024
Here, follow up #246 by adding a few more tests that verify a listener's
state after `Close` has been invoked, including if it returned an error,
which we're able to simulate by overriding pgx's `DialFunc` and
returning a stand-in stub for an underlying `net.Conn`.

Also, remove the explicit `Close` call on an underlying connection in
favor of just invoking the pool's `Release` function. In case of an
error condition, `Release` will detect that and do the right thing, and
pgx is better tested/vetted to make sure that right thing happens.
brandur added a commit that referenced this pull request Mar 2, 2024
Here, follow up #246 by adding a few more tests that verify a listener's
state after `Close` has been invoked, including if it returned an error,
which we're able to simulate by overriding pgx's `DialFunc` and
returning a stand-in stub for an underlying `net.Conn`.

Also, remove the explicit `Close` call on an underlying connection in
favor of just invoking the pool's `Release` function. In case of an
error condition, `Release` will detect that and do the right thing, and
pgx is better tested/vetted to make sure that right thing happens.
@brandur
Copy link
Contributor Author

brandur commented Mar 2, 2024

@dhermes Thanks for the added detail there! I'm sort of hoping to avoid a full-blown proxy for the time being if we can possibly do it (looked at yours — looks good, but definitely a substantial amount of new code). You gave me the idea though of getting something more like an "in code" proxy going, and I found a fairly low touch way of simulating a Close error using a combination of pgxpool's DialFunc and a lightweight net.Conn mock. See #250.

type connStub struct {
	net.Conn

	closeFunc func() error
}

func newConnStub(conn net.Conn) *connStub {
	return &connStub{
		Conn: conn,

		closeFunc: conn.Close,
	}
}

func (c *connStub) Close() error {
	return c.closeFunc()
}
		var connStub *connStub

		config := testPoolConfig()
		config.ConnConfig.DialFunc = func(ctx context.Context, network, addr string) (net.Conn, error) {
			// Dialer settings come from pgx's default internal one (not exported unfortunately).
			conn, err := (&net.Dialer{KeepAlive: 5 * time.Minute}).Dial(network, addr)
			if err != nil {
				return nil, err
			}

			connStub = newConnStub(conn)
			return connStub, nil
		}

brandur added a commit that referenced this pull request Mar 2, 2024
Here, follow up #246 by adding a few more tests that verify a listener's
state after `Close` has been invoked, including if it returned an error,
which we're able to simulate by overriding pgx's `DialFunc` and
returning a stand-in stub for an underlying `net.Conn`.

Also, remove the explicit `Close` call on an underlying connection in
favor of just invoking the pool's `Release` function. In case of an
error condition, `Release` will detect that and do the right thing, and
pgx is better tested/vetted to make sure that right thing happens.
brandur added a commit that referenced this pull request Mar 2, 2024
Here, follow up #246 by adding a few more tests that verify a listener's
state after `Close` has been invoked, including if it returned an error,
which we're able to simulate by overriding pgx's `DialFunc` and
returning a stand-in stub for an underlying `net.Conn`.

Also, remove the explicit `Close` call on an underlying connection in
favor of just invoking the pool's `Release` function. In case of an
error condition, `Release` will detect that and do the right thing, and
pgx is better tested/vetted to make sure that right thing happens.
brandur added a commit that referenced this pull request Mar 2, 2024
Here, follow up #246 by adding a few more tests that verify a listener's
state after `Close` has been invoked, including if it returned an error,
which we're able to simulate by overriding pgx's `DialFunc` and
returning a stand-in stub for an underlying `net.Conn`.

Also, remove the explicit `Close` call on an underlying connection in
favor of just invoking the pool's `Release` function. In case of an
error condition, `Release` will detect that and do the right thing, and
pgx is better tested/vetted to make sure that right thing happens.
@mfrister
Copy link

mfrister commented Mar 4, 2024

As promised above:

I've deployed a version of my worker with riverpgxv5 from this branch and will report on Monday whether the issue appeared again. So far it looks good.

The problem hasn't reappeared on my worker over the weekend, as expected.

@brandur
Copy link
Contributor Author

brandur commented Mar 4, 2024

@mfrister That's great to hear!! Thanks for checking in about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible connection re-use issue in 0.0.24
5 participants