Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on connection loss during runtime #154

Open
dfrese opened this issue Oct 11, 2023 · 27 comments
Open

Deadlock on connection loss during runtime #154

dfrese opened this issue Oct 11, 2023 · 27 comments

Comments

@dfrese
Copy link

dfrese commented Oct 11, 2023

Hi there!

If the connection to the Oracle server is lost during the runtime of an application (Ecto/Phoenix), resp the server is down for a short time, the Ecto adapter seems to deadlock and also does not recover when the Oracle server is available again.

I only see these log messages, once per worker in the pool (the second one actually twice per worker),

[error] Jamdb.Oracle (#PID<0.818.0>) disconnected: ** (DBConnection.ConnectionError) :closed
[error] Jamdb.Oracle (#PID<0.1055.0>) failed to connect: ** (DBConnection.ConnectionError) {:local, [host: 'localhost', ...]}

With the Ecto Postgresql-Adapater for example, I get a DBConnection.ConnectionError exception raised from the Repo functions instead (after a few retry attempts). But if the Postgres server is back, the Repo works again.

@vstavskyi
Copy link
Member

Ping doesn't work during transaction

  def ping(%{mode: :idle} = s) do
    case query(s, 'PING') do
      {:ok, _} -> {:ok, s}
      {:error, err} -> {:disconnect, error!(err), s}
      {:disconnect, err} -> {:disconnect, error!(err), s}
    end
  end
  def ping(%{mode: :transaction} = s) do
    {:ok, s}
  end

Try this

  def ping(s) do
    case query(s, 'PING') do
      {:ok, _} -> {:ok, s}
      {:error, err} -> {:disconnect, error!(err), s}
      {:disconnect, err} -> {:disconnect, error!(err), s}
    end
  end

@vstavskyi
Copy link
Member

Disconnect is not working properly in ecto, disconnect seems never calling.

  def disconnect(_err, %{pid: pid}) do
    :jamdb_oracle.stop(pid) 
  end

Repo.query("CLOSE") works.

@dfrese
Copy link
Author

dfrese commented Oct 12, 2023

Try this

  def ping(s) do
    case query(s, 'PING') do
      {:ok, _} -> {:ok, s}
      {:error, err} -> {:disconnect, error!(err), s}
      {:disconnect, err} -> {:disconnect, error!(err), s}
    end
  end

It seems this changed the situation:

I see this exception now

%DBConnection.ConnectionError{
  message: "'ORA-01089: immediate shutdown or close in progress - no operations are permitted\\n'",
  severity: :error,
  reason: :error
}

but then also this one

 %DBConnection.ConnectionError{
  message: "{:badmatch, {:error, :socket, :closed, {:oraclient, #Port<0.12>, :disconnected, 1, :select, '15', 871, 15, 8192, 19, {15000, 500}, #PID<0.992.0>, {1094931114, {0, 0, []}}, [{:format, \"\", :in, 2, 22, 0, 0}], [host: 'localhost', port: 1521, ...], #PID<0.833.0>, [], #PID<0.983.0>}}}",
  severity: :error,
  reason: :error
}

which seems to be caused here:

** (MatchError) no match of right hand side value: {:error, :socket, :closed, {:oraclient, #Port<0.11>, :disconnected, 1, :select, '393', 871, 15, 8192, 19, {15000, 500}, #PID<0.991.0>, ..., [{:format, "", :in, 2, 22, 0, 0}], [host: 'localhost', port: 1521, ...], #PID<0.835.0>, [], #PID<0.987.0>}}
    (jamdb_oracle 0.5.6) src/jamdb_oracle_conn.erl:205: :jamdb_oracle_conn.handle_req/3
    (jamdb_oracle 0.5.6) src/jamdb_oracle_conn.erl:237: :jamdb_oracle_conn.send_req/2
    (jamdb_oracle 0.5.6) src/jamdb_oracle_conn.erl:76: :jamdb_oracle_conn.disconnect/2
    (jamdb_oracle 0.5.6) src/jamdb_oracle.erl:59: :jamdb_oracle.handle_call/3

That looks related to 'disconnect' that you mentioned.
(I removed some query and connection details).

@vstavskyi
Copy link
Member

Maybe it's fixed

@dfrese
Copy link
Author

dfrese commented Oct 19, 2023

Looks better, yes. Thank you!

I also see this now

%DBConnection.ConnectionError{
  message: ":closed",
  severity: :error,
  reason: :error
}

and then later a simple

(RuntimeError) could not lookup Ecto repo MyRepo because it was not started or it does not exist

It doesn't recover by itself however; but I now see that it doesn't anymore with postgresql either (or maybe never did)... that has probably something to do with my application and supervision.

It behaves the same now as with the postgresql adapter 👍

@dfrese
Copy link
Author

dfrese commented Oct 20, 2023

Or maybe not, sorry. Tested again today (after fixing my App). And weirdly I now see this:

[error] Jamdb.Oracle (#PID<0.987.0>) failed to connect: ** (DBConnection.ConnectionError) {:badarg, [{:erlang, :exit, [:undefined, :ok], [error_info: %{module: :erl_erts_errors}]}, {:jamdb_oracle_conn, :disconnect, 2, [file: 'src/jamdb_oracle_conn.erl', line: 73]}, {:jamdb_oracle_conn, :handle_error, 3, [file: 'src/jamdb_oracle_conn.erl', line: 183]}, {:jamdb_oracle, :init, 1, [file: 'src/jamdb_oracle.erl', line: 38]}, {:gen_server, :init_it, 2, [file: 'gen_server.erl', line: 851]}, {:gen_server, :init_it, 6, [file: 'gen_server.erl', line: 814]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]}

Passwd seems to be :undefined in disconnect. Does that tell you something?

@vstavskyi
Copy link
Member

fixed

@dfrese
Copy link
Author

dfrese commented Oct 23, 2023

Yes, that's gone.

What (I think) I found out, though:

  • With the Postgresql adapter, the Ecto.Repo process does not exit. Accessing the db will just continue to raise a DBConnection.ConnectionError, until the Database is available again. (even when it is not available on startup)
  • With the Oracle adapater, the Ecto.Repo seems to exit with reason :normal, which is unfortunate, because a Supervisor will not even try to restart it then.

I would suggest to do it like the Postgresql adapter, and not exit the process. But exiting with some other reason than :normal sound reasonable, too. Ecto should define a standard for adapters, imho.

@vstavskyi
Copy link
Member

vstavskyi commented Oct 23, 2023

I've looked at postgrex/type_server.ex
Try a few changes in stage

handle_call(stop, _From, #oraclient{conn_state=disconnected} = State) ->
    jamdb_oracle_conn:disconnect(State, 0),
    {noreply, State};
handle_call(stop, _From, State) ->
    jamdb_oracle_conn:disconnect(State, 1),
    {stop, normal, ok, State};
handle_info(timeout, State) ->
    {stop, normal, State};
handle_info(_Info, State) ->
    {noreply, State}.

@dfrese
Copy link
Author

dfrese commented Oct 23, 2023

The repo process still exits; and I think still :normal.
Also, it seems it exits the process calling the Repo function now, if already in 'disconnected' state.

Postgres is: Repo process starts and keeps running regardless of connectivity; then the Repo functions raise an exception (after trying to connect/checkout a worker).
The exception is this with the postgres adapter:

** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 1679ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

@vstavskyi
Copy link
Member

Maybe :ok at the end of disconnect is needed.
Like in postgrex/protocol.ex
Try stage
jamdb_oracle.ex
jamdb_oracle_conn.erl
jamdb_oracle.erl

Also try different reasons in jamdb_oracle.erl
{stop, shutdown, ok, State}
instead of
{stop, normal, ok, State}

@dfrese
Copy link
Author

dfrese commented Oct 24, 2023

That didn't change much.

But I just dug my head into a bit, and noticed that a major difference to the Postgres adapter is, that it doesn't start an extra process for a DBConnection. See Postgres.Protocol for example - https://github.com/elixir-ecto/postgrex/blob/v0.17.3/lib/postgrex/protocol.ex#L70

So I thought the equivalent would be not to use a :jamdb_oracle genserver in Jamdb.Oracle, but to use :jamdb_oracle_conn directly. Replacing pid in the Jamdb.Oracle struct with a conn, and thread changes to the conn though.

I tried that a bit quick&dirty, and that works really fine! Behaves like the Ecto Postgres adapter now - i.e. The Repo process keeps running, but using it always throws the "request was dropped from queue" exception... until Oracle is accessible (again).

This should also be more efficient, corresponding to the documentation of DBConnection (https://github.com/elixir-ecto/db_connection#design)

One important design detail in DBConnection is that it avoids copying data. Other database libraries would send a request to the connection process, perform the query in the connection process, and then send it back to the client. This means a lot of data copying in Elixir. DBConnection keeps the socket in the holder and works on it directly.

If you like, and agree, I can try to work a bit more on this and make a proper pull request; although it's actually relatively simple: query looks like this

  @spec query(conn :: %Jamdb.Oracle{}, sql :: any(), params :: any()) ::
    {:ok | :cont, any(), %Jamdb.Oracle{}} | {:error | :disconnect, any(), %Jamdb.Oracle{}}
  def query(conn, sql, params \\ [])
  def query(%{conn: conn, timeout: timeout} = s, sql, params) do
    case :jamdb_oracle_conn.sql_query(conn, stmt(sql, params), timeout) do
      {:ok, [{:result_set, columns, _, rows}], conn} ->
        {:ok, %{num_rows: length(rows), rows: rows, columns: columns}, %{s | conn: conn}}
      {:ok, [{:fetched_rows, _, _, _} = result], conn} -> {:cont, result, %{s | conn: conn}}
      {:ok, [{:proc_result, 0, rows}], conn} -> {:ok, %{num_rows: length(rows), rows: rows}, %{s | conn: conn}}
      {:ok, [{:proc_result, _, msg}], conn} -> {:error, msg, %{s | conn: conn}}
      {:ok, [{:affected_rows, num_rows}], conn} -> {:ok, %{num_rows: num_rows, rows: nil}, %{s | conn: conn}}
      {:ok, result, conn} -> {:ok, result, %{s | conn: conn}}
      {:error, _type, reason, conn} -> {:disconnect, reason, %{s | conn: conn}}
    end
  end

And then basically just threading the updated s in the handle_* functions. I only wasn't sure what the return values of :jamdb_oracle_conn.connect mean.

What do you think?

@vstavskyi
Copy link
Member

vstavskyi commented Oct 25, 2023

Yes, I like it.

type conn -> DBConnection.conn() - > GenServer.server() -> pid() | name() | {atom(), node()}
So, conn must be pid of genserver also?

In :jamdb_oracle_conn.sql_query first param is record #oraclient{}
How it works with pid? I need to see full source code.

@vstavskyi
Copy link
Member

vstavskyi commented Oct 25, 2023

erlang:process_flag(trap_exit, true), was added
Like in postgrex/type_server.ex

pid #PID<0.336.0 is linked to pool #PID<0.333.0>
It seems erlang:link not needed

One more #PID<0.331.0>
This is conn from jamdb_oracle_ecto.ex

erlang:process_info output

{{#PID<0.333.0>,
[
current_function: {Jamdb.Oracle, :connect, 1},
initial_call: {:proc_lib, :init_p, 5},
status: :running,
message_queue_len: 0,
links: [#PID<0.332.0>],
dictionary: [
"$initial_call": {DBConnection.Connection, :init, 1},
"$ancestors": [#PID<0.332.0>, DBConnection.ConnectionPool.Supervisor,
DBConnection.App, #PID<0.312.0>]
],
trap_exit: false,
error_handler: :error_handler,
priority: :normal,
group_leader: #PID<0.311.0>,
total_heap_size: 752,
heap_size: 376,
stack_size: 33,
reductions: 139,
garbage_collection: [
max_heap_size: %{error_logger: true, kill: true, size: 0},
min_bin_vheap_size: 46422,
min_heap_size: 233,
fullsweep_after: 65535,
minor_gcs: 1
],
suspending: []
]}

{#PID<0.336.0>,
[
current_function: {:gen_server, :loop, 7},
initial_call: {:proc_lib, :init_p, 5},
status: :waiting,
message_queue_len: 0,
links: [#Port<0.4>, #PID<0.333.0>],
dictionary: [
"$initial_call": {:jamdb_oracle, :init, 1},
"$ancestors": [#PID<0.333.0>, #PID<0.332.0>,
DBConnection.ConnectionPool.Supervisor, DBConnection.App, #PID<0.312.0>]
],
trap_exit: true,
error_handler: :error_handler,
priority: :normal,
group_leader: #PID<0.311.0>,
total_heap_size: 6771,
heap_size: 4185,
stack_size: 12,
reductions: 123597,
garbage_collection: [
max_heap_size: %{error_logger: true, kill: true, size: 0},
min_bin_vheap_size: 46422,
min_heap_size: 233,
fullsweep_after: 65535,
minor_gcs: 261
],
suspending: []
]}

{#PID<0.331.0>,
[
current_function: {:gen_server, :loop, 7},
initial_call: {:proc_lib, :init_p, 5},
status: :waiting,
message_queue_len: 0,
links: [#PID<0.330.0>, #PID<0.317.0>],
dictionary: [
connection_module: Jamdb.Oracle,
"$initial_call": {DBConnection.ConnectionPool, :init, 1},
"$ancestors": [Jamdb.Repo, Jamdb.Supervisor, #PID<0.328.0>]
],
trap_exit: false,
error_handler: :error_handler,
priority: :normal,
group_leader: #PID<0.327.0>,
total_heap_size: 376,
heap_size: 376,
stack_size: 12,
reductions: 188,
garbage_collection: [
max_heap_size: %{error_logger: true, kill: true, size: 0},
min_bin_vheap_size: 46422,
min_heap_size: 233,
fullsweep_after: 65535,
minor_gcs: 0
],
suspending: []
]}

@dfrese
Copy link
Author

dfrese commented Oct 25, 2023

Yes, I like it.

type conn -> DBConnection.conn() - > GenServer.server() -> pid() | name() | {atom(), node()} So, conn must be pid of genserver also?

In :jamdb_oracle_conn.sql_query first param is record #oraclient{} How it works with pid? I need to see full source code.

No, I didn't use :jamdb_oracle (the genserver) at all now. So no pid anymore, just :jamdb_oracle_conn directly.
You can look at it here: https://github.com/active-group/jamdb_oracle/tree/dbconnection_direct
This works as expected when cutting the route to the oracle server, and reconnecting it again.
(I now have a (new?) problem with tests and Ecto.Adapters.SQL.Sandbox though... but that might not be directly related; I'm not sure)

There are two TODOs left on that branch:

  1. In which situations can :jamdb_oracle_conn.sql_query throw exceptions? Didn't catch them for now. They were catched in jamdb_oracle.el :sql_query though
  2. What does the {:ok, result, conn} return of :jamdb_oracle_conn.connect mean? jamdb_oracle.el stopped the server in that case.

I also had to add

handle_error(socket, Reason, State) ->
    disconnect(State),
    {error, socket, Reason, State#oraclient{conn_state=disconnected}};

to jamdb_oracle_conn - as without a socket connection disconnect(State, 1) seemed to hang.

@vstavskyi
Copy link
Member

Thank you, I'll check.

I created branch dbconnection_direct

when throw exceptions

For example, if connection was closed and pid doesn't exists.

{ok, pid} = jamdb_oracle:start_link(?DEF_OPTS),
jamdb_oracle:sql_query(pid, "CLOSE"),
Result = jamdb_oracle:sql_query(pid, "select * from dual"),

{ok, result}

In some rare cases, but not sure

@vstavskyi
Copy link
Member

Please check and test master branch and stage branch.

:pid is used in master, but better error handling.
:conn is used in stage.

@dfrese
Copy link
Author

dfrese commented Oct 31, 2023

I wrote a test case, adding it to jamdb_oracle_test.exs. As the behaviour when the db is initially not available is the same as when the connection is lost on the way (or assuming that..):

  test "DBConnection behaviour on connection errors" do
    some_query = %Jamdb.Oracle.Query{statement: ["select * from dual"]}
    # with an unreachable oracle server,

    # connection should start anyway,
    assert {:ok, conn} = Jamdb.Oracle.start_link(hostname: "localhost", port: 7777, username: "bla", password: "foo", database: "does_not_exist", pool_size: 1)

    # just using it should throw an DBConnection.Error
    assert_raise DBConnection.ConnectionError, fn -> DBConnection.prepare_execute!(conn, some_query, []) end
  end

It fails in master and succeeds in stage.

Feel free to copy and commit that code if you like.

@vstavskyi
Copy link
Member

I got the identical results.

(DBConnection.ConnectionError) connection not available and request was dropped from queue after 2000ms.

@dfrese
Copy link
Author

dfrese commented Nov 2, 2023

Identical to me, or identical between the two branches?

On the current master (501f45c) I get

...........................................................................

  1) test DBConnection behaviour on connection errors (Jamdb.OracleTest)
     test/jamdb_oracle_test.exs:1231
     ** (EXIT from #PID<0.386.0>) killed

...............
Finished in 0.3 seconds (0.3s async, 0.00s sync)
91 tests, 1 failure
elixir --version
Erlang/OTP 25 [erts-13.1.2] [source] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit]

Elixir 1.14.5 (compiled with Erlang/OTP 25)

@vstavskyi
Copy link
Member

I got error on master too :(

Can't understand why connect is calling multiple times, if :jamdb_oracle.start_link fails.

@vstavskyi
Copy link
Member

fixed !?

@dfrese
Copy link
Author

dfrese commented Nov 3, 2023

Can't understand why connect is calling multiple times, if :jamdb_oracle.start_link fails.

I would say it is not supposed to fail. Postgrex fails when essential config options are missing, but not when the server is currently not answering:

iex(1)> Postgrex.start_link(database: "foo", hostname: "invalid.invalid")
{:ok, #PID<0.638.0>}

(only writes log messages; process does not die)

[error] Postgrex.Protocol (#PID<0.640.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (invalid.invalid:5432): non-existing domain - :nxdomain
[error] Postgrex.Protocol (#PID<0.640.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (invalid.invalid:5432): non-existing domain - :nxdomain
iex(3)> Postgrex.start_link([])
{:ok, #PID<0.644.0>}
** (EXIT from #PID<0.636.0>) shell process exited with reason: killed

@dfrese
Copy link
Author

dfrese commented Nov 3, 2023

fixed !?

Uhh, I would expect a lot of other problems following from that.

I think the solution on the stage branch is what DBConnection expects from implementations. Also with respect to their "no data has to be copied" statement.

@dfrese
Copy link
Author

dfrese commented Nov 13, 2023

Found a little bug on the stable branch: In query the last case should be

      {:error, err, conn} -> {:disconnect, err, %{s | conn: conn}}

instead of

      {:error, err, conn} -> {:disconnect, err, conn}

@dfrese
Copy link
Author

dfrese commented Nov 13, 2023

I also had some weird occurrences of errors, where a jamdb_oracle functions tried something on values that seemed to represent messages from other processes (like a Phoenix.Socket message).

I looked into jamdb_oracle_conn.erl, and the little 'tricks' with Passwd, Cursor and Task seem suspicious.
Could it be that these states would have to be piped through instead now, i.e. as part of the #oraclient{} struct?

@dfrese
Copy link
Author

dfrese commented Nov 16, 2023

I also had some weird occurrences of errors, where a jamdb_oracle functions tried something on values that seemed to represent messages from other processes (like a Phoenix.Socket message).

I looked into jamdb_oracle_conn.erl, and the little 'tricks' with Passwd, Cursor and Task seem suspicious. Could it be that these states would have to be piped through instead now, i.e. as part of the #oraclient{} struct?

I did that on this branch: https://github.com/active-group/jamdb_oracle/tree/conn_no_mvars btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants