Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No backoff if PoolEntryCreator is interrupted #2265

Open
PaoloVoronini opened this issue Nov 20, 2024 · 5 comments
Open

No backoff if PoolEntryCreator is interrupted #2265

PaoloVoronini opened this issue Nov 20, 2024 · 5 comments

Comments

@PaoloVoronini
Copy link

PaoloVoronini commented Nov 20, 2024

https://github.com/brettwooldridge/HikariCP/blob/ffdb8106bccb285d1775af1e8b040bfce4588dc1/src/main/java/com/zaxxer/hikari/pool/HikariPool.java#L758C19-L758C31

Method UtilityElf.quietlySleep() retains the interrupted flag
As a result PoolEntryCreator does not sleep at all - i.e. no backoff.
Possibly the interrupted flag also prevents creating new pool entry.

while (shouldContinueCreating()) {
   final var poolEntry = createPoolEntry();
   if (poolEntry != null) {
      added = true;
      connectionBag.add(poolEntry);
      logger.debug("{} - Added connection {}", poolName, poolEntry.connection);
      quietlySleep(30L);
      break;
   } else {  // failed to get connection from db, sleep and retry
      if (loggingPrefix != null && backoffMs % 50 == 0)
         logger.debug("{} - Connection add failed, sleeping with backoff: {}ms", poolName, backoffMs);

      //THE PROBLEM IS HERE
      //the thread does not sleep because it is interrupted
      quietlySleep(backoffMs); 

      backoffMs = Math.min(SECONDS.toMillis(5), backoffMs * 2);
   }
}

@brettwooldridge
my suggestion would be to clear the interrupted flag in the snippet above.
i.e.

quietlySleep(backoffMs);
//clear interrupted state
Thread.interrupted();

PS: this might be related to #2208 and other "database outage" issues mentioned in it

@brettwooldridge
Copy link
Owner

What would cause the adder thread to be interrupted besides shutdown?

@PaoloVoronini
Copy link
Author

PaoloVoronini commented Nov 21, 2024

thanks Brett for the quick response!

unfortunately, i was not able to figure out the root cause reliably.
similar to other issues i mentioned, we observe this upon database outages
this is the last piece of the stacktrace (please find below)
so far i can only speculate that possibly the ojdbc sets the interrupted status of the thread for some reason

PS: just in case, i should mention that i wrote a little method to support the statement "thread does not sleep"

public static void main(String[] args) {
    Thread t = new Thread(() -> {
        while (true) {
            try {
                System.out.println("sleeping");
                Thread.sleep(3000);
            } catch (InterruptedException e) {
                System.out.println("interrupted");
                Thread.currentThread().interrupt();
            }
        }
    });
    t.start();
    t.interrupt();
}

End of the stacktrace:

Caused by: java.io.InterruptedIOException: Socket read interrupted
    at oracle.net.nt.TimeoutSocketChannel.initializeSocketChannel(TimeoutSocketChannel.java:278) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.connect(TimeoutSocketChannel.java:235) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.<init>(TimeoutSocketChannel.java:202) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.establishSocket(TcpNTAdapter.java:335) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.establishSocket(TcpsNTAdapter.java:248) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:224) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.connect(TcpsNTAdapter.java:227) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnOption.connect(ConnOption.java:332) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.executeConnOption(ConnStrategy.java:1199) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:742) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:708) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:959) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.connect(NSProtocol.java:328) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.connectNetworkSessionProtocol(T4CConnection.java:3411) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1016) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 14 more 

@brettwooldridge
Copy link
Owner

brettwooldridge commented Nov 21, 2024

Thanks. Still trying to figure this out. java.io.InterruptedIOException extends java.io.IOException which extends java.lang.Exception. None of them extends java.lang.InterruptedException. The above exception should be caught and handled here. Still not a thread interrupt...

Is this running in Google GCD or something? Are you supplying an externally managed ScheduledExecutorService to the pool?

@PaoloVoronini
Copy link
Author

PaoloVoronini commented Nov 22, 2024

  1. we have observed this both with a component running in Google Cloud and on-premise
  2. no, we do not supply ScheduledExecutorService
  3. why have i not shared the full stacktrace? sorry! please find it below now

PS: yes, it does not make sense to me, neither.
first, the exception is not InterruptedException, but i could explain it (just a possibility) with original exception being something like ClosedByInterruptException, being caught somewhere (in ojbc or java.nio) and suppressed without clearing the interrupted state of the thread - this would explain the interrupt status without InterruptedException.
second, still someone should call "interrupt" on our thread in the first place. for this i see no explanation at all.

2024-10-24 09:34:52,386 INFO ? [UAT connection adder] throwing
java.sql.SQLRecoverableException: ORA-17002: I/O error: , Authentication lapse 0 ms.
https://docs.oracle.com/error-help/db/ora-17002/
    at oracle.jdbc.driver.T4CConnection.handleLogonIOException(T4CConnection.java:1658) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.handleLogonInterruptedIOException(T4CConnection.java:1519) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1132) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:1178) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:105) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:886) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:693) [ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:138) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:359) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:201) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:470) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:733) [HikariCP-5.0.1.jar:?]
    at com.zaxxer.hikari.pool.HikariPool$PoolEntryCreator.call(HikariPool.java:712) [HikariCP-5.0.1.jar:?]
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
    at java.base/java.lang.Thread.run(Thread.java:840) [?:?]
Caused by: java.io.IOException: , Authentication lapse 0 ms.
    at oracle.jdbc.driver.T4CConnection.handleLogonIOException(T4CConnection.java:1653) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 16 more
Caused by: java.io.InterruptedIOException:
    at oracle.net.nt.TcpNTAdapter.handleEstablishSocketException(TcpNTAdapter.java:401) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.establishSocket(TcpNTAdapter.java:346) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.establishSocket(TcpsNTAdapter.java:248) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:224) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.connect(TcpsNTAdapter.java:227) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnOption.connect(ConnOption.java:332) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.executeConnOption(ConnStrategy.java:1199) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:742) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:708) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:959) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.connect(NSProtocol.java:328) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.connectNetworkSessionProtocol(T4CConnection.java:3411) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1016) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 14 more
Caused by: java.io.InterruptedIOException: Socket read interrupted
    at oracle.net.nt.TimeoutSocketChannel.initializeSocketChannel(TimeoutSocketChannel.java:278) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.connect(TimeoutSocketChannel.java:235) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TimeoutSocketChannel.<init>(TimeoutSocketChannel.java:202) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.establishSocket(TcpNTAdapter.java:335) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.establishSocket(TcpsNTAdapter.java:248) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:224) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.TcpsNTAdapter.connect(TcpsNTAdapter.java:227) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnOption.connect(ConnOption.java:332) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.executeConnOption(ConnStrategy.java:1199) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:742) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:708) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:959) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.net.ns.NSProtocol.connect(NSProtocol.java:328) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.connectNetworkSessionProtocol(T4CConnection.java:3411) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:1016) ~[ojdbc11-23.4.0.24.05.jar:23.4.0.24.05]
    ... 14 more 

@brettwooldridge
Copy link
Owner

brettwooldridge commented Nov 22, 2024

java.sql.SQLRecoverableException: ORA-17002: I/O error: , Authentication lapse 0 ms.

This is a very specific and unusual error. Many threads point to an Oracle listener process falling over. But that is not conclusive.

I suggest looking at these Google results and this ChatGPT diagnosis.

It seems to me handling InterruptedException is a separate issue. My general reaction is that it is bad behavior to simply "clear and ignore" an InterruptedException. If anything, the entire pool should shutdown rather than continuing. I suspect addressing the underlying connection issue will also resolve the InterruptedException issue.

Also, you should ensure that you are running in HikariCP v6.2.1. Some threads point to improperly closed connections, and v6.0.0 contained a fix related to connection closing (double-close).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants