Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #4349 #4350

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Fixes #4349 #4350

wants to merge 2 commits into from

Conversation

weiznich
Copy link
Member

This commit fixes an issue that allowed to trigger a runtime error by
passing an array of arrays to .eq_any(). We rewrite queries containing
IN expressions to = ANY() on postgresql as that more
efficient (allows caching + allows binding all values at once). That's
not possible for arrays of arrays as we do not support nested arrays
yet.

The fix introduces an associated constant for the SqlType trait that
tracks if a SQL type is a Array<T> or not. This information is then
used to conditionally generate the "right" SQL. By defaulting to false
while adding this constant we do not break existing code. We use the
derive to set the right value based on the type name.

cc @Ten0

(I would suggest to include that PR in the patch release as well)

@weiznich weiznich requested a review from a team November 19, 2024 19:09
This commit fixes an issue that allowed to trigger a runtime error by
passing an array of arrays to `.eq_any()`. We rewrite queries containing
`IN` expressions to `= ANY()` on postgresql as that more
efficient (allows caching + allows binding all values at once). That's
not possible for arrays of arrays as we do not support nested arrays
yet.

The fix introduces an associated constant for the `SqlType` trait that
tracks if a SQL type is a `Array<T>` or not. This information is then
used to conditionally generate the "right" SQL. By defaulting to `false`
while adding this constant we do not break existing code. We use the
derive to set the right value based on the type name.
@Ten0
Copy link
Member

Ten0 commented Nov 19, 2024

Thanks a lot for working on this! ⚡️
It seems ok to go with the hidden associated const. 🙂
I had not considered making this not part of the public API, but it looks like it does solve my concern that we may publicly release something that is not as "powerful" (or fundamentally appropriate considering that we don't have associated const equality) as having a type alias would be and we would want to change it later on.
I did mostly convince myself that additional array nesting levels can't compile for other reasons anyway so I think I'm happy with not being able to do that. :)

However regarding the approach of = ANY (VALUES (...), (...)) vs IN, IN seems to be mostly a syntax sugar for expanding into multiple OR clauses, whereas = ANY (VALUES (...), (...)) generates the following query plan:

Nested Loop  (cost=0.19..16.41 rows=2 width=40) (actual time=0.021..0.023 rows=2 loops=1)
  ->  Unique  (cost=0.04..0.04 rows=2 width=32) (actual time=0.015..0.016 rows=2 loops=1)
        ->  Sort  (cost=0.04..0.04 rows=2 width=32) (actual time=0.014..0.015 rows=2 loops=1)
              Sort Key: "*VALUES*".column1
              Sort Method: quicksort  Memory: 25kB
              ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=32) (actual time=0.001..0.001 rows=2 loops=1)
  ->  Index Scan using value_index_key on table  (cost=0.15..8.17 rows=1 width=40) (actual time=0.002..0.002 rows=1 loops=2)
        Index Cond: (value = "*VALUES*".column1)
Planning Time: 0.231 ms
Execution Time: 0.033 ms

which seems more appropriate when there are many values.
That being said I've also seen occurrences where Postgres would switch to seq scan/bitmap heap scan with recheck on the array in O(n²) when doing = ANY (ARRAY), which given this query plan leaves me wondering whether doing = ANY (VALUES (...), (...)) wouldn't solve this issue even in the non-array case 😅

So overall it looks like we want the one that behaves the closest to what = ANY (ARRAY) does, and if relevant for other scenarios we might want to open a VALUES subquery API for smarter query plans on large arrays.

I'll investigate further on what query plans get generated for all 3 cases tomorrow. 🙂

I would suggest to include that PR in the patch release as well

This seems edge-case enough (considering that unless I missed an existing issue nobody else has reported it since Diesel 2.0) that I think I would tend to not consider it a release blocker 🙂

@Ten0
Copy link
Member

Ten0 commented Nov 19, 2024

So I have investigated query plans a bit more and it looks like doing IN is probably the most consistent approach, because IN gets rewritten as = ANY in a lot of cases and VALUES generates different query plans from ANY/IN even in the non-array case.

@Ten0
Copy link
Member

Ten0 commented Nov 20, 2024

TL;DR: argument for using = ANY(subquery), but discarded in next post.

So I found this comment:
https://dba.stackexchange.com/a/335502/196014

Which suggests that = ANY may actually be not-that-dumb:

When the length of the array is very large, the other queries simply do a parallel seq scan. This is very fast, because the "Filter where id=ANY(...)" isn't dumb, it uses some kind of fast search like hashing or bisect, it doesn't compare every row with every value of the array.

So it seems that = ANY ( ... ) may sometimes be more performant than IN

But my experience was contrary to this (it was in fact, dumb, and I would see n² CPU consumption profiles).

So I've performed new tests (see code below), and I can find:

  • Sometimes bitmap heap scan with seemingly O(n) recheck on each line for a total of a clear O(n²) query
  • Scenarios where it seems to not do n² despite doing a very similar bitmap heap scan, specifically when no hash index exists

Maybe I should send a message to the PG mailing list pointing that as a bug because the two recheck-on-array scenarios are very inconsistent with regards to their algorithmic complexity...

So I don't know what to think... If = ANY is supposed to be smart but IN isn't when we pass in an array, maybe we want to do the "good worse case" of VALUES to not have surprising perf loss for users who don't expect any significant difference of perf between eq_any(array_of_values) and eq_any(array_of_values_that_happen_to_also_be_arrays).

But if ANY is known-sometimes-dumb and from the PG perspective it's user's responsibility to use VALUES instead of = ANY(ARRAY) when relevant, then as stated above the more consistent approach of using IN seems to win.

There is also a case for doing IN anyway because that's the closest-to-SQL. But if = ANY(VALUES) is really the always-safest maybe we want to expose that and recommend its usage over that of eq_any.

Tests:

BEGIN;
CREATE TABLE tmp_benchmark(
    id Int8 NOT NULL GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    value text NOT NULL
);
-- Insert 1M values
INSERT INTO tmp_benchmark(value) SELECT DISTINCT (floor(random() * 1000000000000)::int8)::text FROM generate_series(1,1000000);

ALTER TABLE tmp_benchmark ADD CONSTRAINT tmp_benchmark_unique_value_and_corresponding_index UNIQUE(value);

CREATE INDEX tmp_benchmark_benchmark_value_hash ON tmp_benchmark USING HASH (value);

ANALYZE tmp_benchmark;

-- Randomly sample 50k of those values
CREATE TEMPORARY TABLE lookup_test ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 50000;

EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test);
/* Gives:
Hash Semi Join  (cost=1445.00..20996.25 rows=50000 width=20) (actual time=4.595..107.748 rows=50000 loops=1)
  Hash Cond: (tmp_benchmark.value = lookup_test.value)
  ->  Seq Scan on tmp_benchmark  (cost=0.00..16370.00 rows=1000000 width=20) (actual time=0.008..39.863 rows=1000000 loops=1)
  ->  Hash  (cost=820.00..820.00 rows=50000 width=12) (actual time=4.566..4.567 rows=50000 loops=1)
        Buckets: 65536  Batches: 1  Memory Usage: 2704kB
        ->  Seq Scan on lookup_test  (cost=0.00..820.00 rows=50000 width=12) (actual time=0.006..1.836 rows=50000 loops=1)
Planning Time: 0.247 ms
Execution Time: 108.723 ms
 */
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark  (cost=985.09..1024.13 rows=10 width=20) (actual time=26.626..14440.230 rows=50000 loops=1)
  Recheck Cond: (value = ANY ($0))
  Rows Removed by Index Recheck: 7
  Heap Blocks: exact=6369
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=945.00..945.01 rows=1 width=32) (actual time=4.606..4.607 rows=1 loops=1)
          ->  Seq Scan on lookup_test  (cost=0.00..820.00 rows=50000 width=12) (actual time=0.007..1.998 rows=50000 loops=1)
  ->  Bitmap Index Scan on tmp_benchmark_benchmark_value_hash  (cost=0.00..40.08 rows=10 width=0) (actual time=25.848..25.848 rows=50007 loops=1)
        Index Cond: (value = ANY ($0))
Planning Time: 0.098 ms
Execution Time: 14441.520 ms
 */

-- Randomly sample 100k of those values

CREATE TEMPORARY TABLE lookup_test_2 ON COMMIT DROP AS
WITH sub AS (SELECT value FROM tmp_benchmark LIMIT 10000000)
SELECT value FROM sub ORDER BY RANDOM() LIMIT 100000;

EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (SELECT value FROM lookup_test_2);
/* Gives:
Nested Loop  (cost=1555.20..3025.00 rows=78336 width=20) (actual time=15.002..101.907 rows=100000 loops=1)
  ->  HashAggregate  (cost=1555.20..1557.20 rows=200 width=32) (actual time=14.995..26.162 rows=100000 loops=1)
        Group Key: lookup_test_2.value
        Batches: 5  Memory Usage: 6977kB  Disk Usage: 232kB
        ->  Seq Scan on lookup_test_2  (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.006..4.192 rows=100000 loops=1)
  ->  Index Scan using tmp_benchmark_benchmark_value_hash on tmp_benchmark  (cost=0.00..7.88 rows=1 width=20) (actual time=0.001..0.001 rows=1 loops=100000)
        Index Cond: (value = lookup_test_2.value)
        Rows Removed by Index Recheck: 0
Planning Time: 0.148 ms
Execution Time: 103.924 ms
 */
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark  (cost=1595.29..1634.33 rows=10 width=20) (actual time=44.862..58360.269 rows=100000 loops=1)
  Recheck Cond: (value = ANY ($0))
  Rows Removed by Index Recheck: 16
  Heap Blocks: exact=6370
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.109..9.110 rows=1 loops=1)
          ->  Seq Scan on lookup_test_2  (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.007..4.002 rows=100000 loops=1)
  ->  Bitmap Index Scan on tmp_benchmark_benchmark_value_hash  (cost=0.00..40.08 rows=10 width=0) (actual time=44.242..44.243 rows=100018 loops=1)
        Index Cond: (value = ANY ($0))
Planning Time: 0.089 ms
Execution Time: 58364.590 ms
 */

-- Note how with 2x the number of values, execution time is x4 the above. Monitoring CPU while this query runs shows 100% CPU, unlike the subselect-based variant.

DROP INDEX tmp_benchmark_benchmark_value_hash;
-- Once we remove the hash index, PG is not stupid anymore, even with the same 100k sample and bitmap heap scan plan it doesn't do n² somehow

-- On the same 100k sample:
EXPLAIN ANALYZE SELECT * FROM tmp_benchmark WHERE value = ANY (CAST((SELECT array_agg(value) FROM lookup_test_2) AS text[]));
/* Gives:
Bitmap Heap Scan on tmp_benchmark  (cost=1599.54..1638.58 rows=10 width=20) (actual time=236.604..252.188 rows=100000 loops=1)
  Recheck Cond: (value = ANY ($0))
  Heap Blocks: exact=6370
  InitPlan 1 (returns $0)
    ->  Aggregate  (cost=1555.20..1555.21 rows=1 width=32) (actual time=9.091..9.093 rows=1 loops=1)
          ->  Seq Scan on lookup_test_2  (cost=0.00..1359.36 rows=78336 width=32) (actual time=0.009..3.946 rows=100000 loops=1)
  ->  Bitmap Index Scan on tmp_benchmark_unique_value_and_corresponding_index  (cost=0.00..44.33 rows=10 width=0) (actual time=236.169..236.169 rows=100000 loops=1)
        Index Cond: (value = ANY ($0))
Planning Time: 0.161 ms
Execution Time: 254.145 ms
*/

-- So despite doing a bitmap heap scan on the same 100k elements array as above, this one seemingly doesn't n².

ROLLBACK;

Copy link
Member

@Ten0 Ten0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized I also have examples in mind where = ANY(array) performs better than = ANY(subquery) because it manages to push a filter further into the query plan in that case, so I think that ends up convincing me that we just want to map to SQL as close as possible, so that we should do IN([1],[2],...).

The only thing is we should probably document it, considering that sometimes = ANY(...) manages to not do n², and I suspect IN([1],[2],...) may not be able to be as efficient in similar scenarios since it seems to expand to just many OR clauses...

let connection: &mut PgConnection = &mut connection();
let tag_combinations_to_look_for: &[&[&str]] = &[&["foo"], &["foo", "bar"], &["baz"]];
let result: Vec<i32> = posts
.filter(tags.eq_any(tag_combinations_to_look_for))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably want to document this behavior in these places:

/// The postgres backend provided a specialized implementation
/// by using `left = ANY(values)` as optimized variant instead.

/// The postgres backend provided a specialized implementation
/// by using `left = ALL(values)` as optimized variant instead.

/// On PostgreSQL, this method automatically performs a `= ANY()`
/// query.

if ST::IS_ARRAY {
self.walk_ansi_ast(out)
} else {
out.push_bind_param::<Array<ST>, Vec<I>>(&self.values)
Copy link
Member

@Ten0 Ten0 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this got me thinking (long-term changes thoughts): maybe QueryId should be generic on the backend as well, so that we can properly express that this variant could in fact use static query id. If we added the option for HAS_STATIC_QUERY_ID = false to the QueryId derive maybe it wouldn't be that verbose in general.
But maybe that's not worth. In any case that would be a separate change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an reasonable idea for the long run.

Although I remember that the benchmarks have shown at some point that the cost of query construction for looking up statements in the prepared statement cache is only relevant for SQLite. It's not relevant for postgresql due to the network roundtrips.

@Ten0
Copy link
Member

Ten0 commented Nov 22, 2024

For completeness: while the "reproducer" above is indeed n² with expressions like = ANY(ARRAY(subselect)), I can't seem to reproduce the same n² with an actual = ANY($1) anymore, so that in fact probably behaves fine (postgres/postgres@50e17ad), which in turn means that the array case of array_value IN (arr1,arr2,arr3) being expanded as array_value = arr1 OR array_value = arr2 may be:

  1. Way slower than people might expect because the OR are rechecked linearly (n²)
  2. Crashing faster, because postgres will return a stack depth limit exceeded error when there are more than 10k values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants