Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug preventing orderedAggregate and distinctAggregator from being used together (alternative approach) #578

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

shane-circuithub
Copy link
Contributor

(This is an alternative fix to the bug described in #577 that doesn't require adding Eq and Ord constraints to PrimExpr.)

This PR contains two commits.

The first commit adds a (failing) test case demonstrating the inability to combine orderedAggregate and distinctAggregator together, even in the limited cases allowed by PostgreSQL. The problem is that PostgreSQL only allows this if the expressions given in the ORDER BY clause match the expressions given as arguments to the aggregation function.

The second commit fixes this bug. It does so by adding Eq and Ord constraints to Symbol (but not PrimExpr) and using them to detect identical references anywhere within the PrimExprs contained with the Aggregate. Instead of renaming the entire expressions, we only rename references to fields (which could possibly be lateral references, though this isn't checked). We maintain a Map of already-renamed symbols such that identical references will be renamed to the same symbol, fulfilling PostgreSQL's restriction.

@shane-circuithub
Copy link
Contributor Author

This is based on the refactoring in #575.

@tomjaguarpaw
Copy link
Owner

I think this is my preferred approach but I won't be able to get to it immediately.

@tomjaguarpaw
Copy link
Owner

Hi @shane-circuithub, please ping me if this is is urgent for you. I'll get to it in due course but I'm short of spare mental cycles at the moment.

@shane-circuithub shane-circuithub force-pushed the aggregate-distinct-order-by-fix-2 branch from b4d0804 to ad27062 Compare October 15, 2023 21:55
@shane-circuithub
Copy link
Contributor Author

shane-circuithub commented Oct 16, 2023

It's not necessarily urgent that you merge this, because I can just pin to this commit for now, but I am actually using this. I didn't realise when I first wrote this, but this is actually needed for ordered-set aggregation functions to work. Take the following query for example:

SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY column1)
FROM (
  VALUES
    (1),
    (2),
    (3),
    (4)
) _;
 percentile_cont 
-----------------
             2.5
(1 row)

If you try to recreate that with Opaleye (without this commit), it will produce the following (paraphrased):

SELECT
  percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
  SELECT
    0.5 AS inner_1,
    column1 AS inner_2
  FROM (
    VALUES
      (1),
      (2),
      (3),
      (4)
  ) _
) _;
ERROR:  column "_.inner_1" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2:   percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
                          ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

With this PR, this becomes:

SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY inner_1)
FROM (
  SELECT
    column1 AS inner_1
  FROM (
    VALUES
      (1),
      (2),
      (3),
      (4)
  ) _
) _;
 percentile_cont 
-----------------
             2.5
(1 row)

With this PR, extractAggregateFields only rewrites references (e.g., column1 to inner_1), it doesn't rewrite non-reference expressions (such as 0.5), which allows Opaleye produces valid SQL for ordered set aggregation functions. This is why I closed #577 in favour if this PR, because #577 wouldn't solve this problem.

The other "valid" way to construct ordered set aggregation functions (which admittedly I care much less about supporting) is with the direct argument in a GROUP BY clause, e.g.:

SELECT
  percentile_cont(column1) WITHIN GROUP (ORDER BY column2)
FROM (
  VALUES
    (0.5, 1),
    (0.5, 2),
    (0.5, 3),
    (0.5, 4),
    (0.8, 5),
    (0.8, 6),
    (0.8, 7),
    (0.8, 8)
) _
GROUP BY
  column1;

Without this PR, if you try to write this in Opaleye it will produce the following SQL (again paraphrased):

SELECT
  percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
FROM (
  SELECT
    column1 AS inner_1,
    column1 AS inner_2,
    column2 AS inner_3
  FROM (
    VALUES
      (0.5, 1),
      (0.5, 2),
      (0.5, 3),
      (0.5, 4),
      (0.8, 5),
      (0.8, 6),
      (0.8, 7),
      (0.8, 8)
  ) _
) _
GROUP BY
  inner_1;
ERROR:  column "_.inner_2" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2:   percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
                          ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

Even though inner_1 and inner_2 refer to the same underlying column1, Postgres isn't smart enough to detect that, it wants the exact same expression in the GROUP BY clause and the direct argument. However, with this PR, it becomes:

SELECT
  percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
  SELECT
    column1 AS inner_1,
    column2 AS inner_2
  FROM (
    VALUES
      (0.5, 1),
      (0.5, 2),
      (0.5, 3),
      (0.5, 4),
      (0.8, 5),
      (0.8, 6),
      (0.8, 7),
      (0.8, 8)
  ) _
) _
GROUP BY
  inner_1;
 percentile_cont 
-----------------
             2.5
             7.4

Because this PR rewrites only references, and because it keeps track of already-rewritten references and makes sure to only to rename each reference once, both the GROUP BY and the direct argument to percentile_cont end up with the same inner_1 expression.

This is not to put pressure on you to deal with this right now if you don't have the capacity (again I'm happy to just pin to this commit in the short-term), but it just explains an additional motivation that I wasn't aware of when I first wrote this.

@tomjaguarpaw
Copy link
Owner

tomjaguarpaw commented Oct 20, 2023

The first commit adds a (failing) test

Could you double check? That test passes here.

EDIT: No, it fails here (as expected)!

@tomjaguarpaw
Copy link
Owner

Hmm, and if fails in CI. Maybe a difference between Postgres versions? I'm using 13 locally and CI is using 11.

https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587659165/job/17898358429#step:17:627

@tomjaguarpaw
Copy link
Owner

Hmm, no it fails in CI in 13 too. I'm really confused why it doesn't fail for me locally!

https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587744501/job/17898614746#step:17:631

@tomjaguarpaw
Copy link
Owner

No, it does fail locally. Not sure what I was thinking. Never mind.

@tomjaguarpaw
Copy link
Owner

The error is "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list", so I think I have to understand that first.

@tomjaguarpaw
Copy link
Owner

I think we have a bigger problem, because if I slightly change your test case then your fix is no longer sufficient. I appreciate that your fix is enough to make it work with well-formed examples that should work but I'd much rather find a fix that doesn't allow generating incorrect SQL.


@ Test/Test.hs:626 @ testStringArrayAggregateOrderedDistinct :: Test
 testStringArrayAggregateOrderedDistinct = it "" $ q `selectShouldReturnSorted` expected
   where q =
           O.aggregateOrdered
-            (O.asc snd)
+            (O.asc (\x -> snd x O..++ snd x))
             (PP.p2 (O.arrayAgg, O.distinctAggregator . O.stringAgg . O.sqlString $ ","))
             table7Q
         expected = [ ( map fst sortedData
@ Test/Test.hs:1506 @ main = do

@tomjaguarpaw
Copy link
Owner

tomjaguarpaw commented Oct 27, 2023

I can now understand the aggregator DISTINCT issue (but not the WITHIN GROUP one). If you have an aggregation of the following form then the expressions p, q, r must each occur literally within the set {a,b,c,...}. That's a pretty annoying restriction! And it will be hard to enforce statically, but that's how I'd like to solve this particular issue. I'll have more of a think.

FROM (SELECT
     AGGREGATOR(DISTINCT a,b,c,... ORDER BY p,q,r, ...) as "result3_3"

@tomjaguarpaw
Copy link
Owner

I think the correct API for this is probably something like

makeAggregator ::
  AggregatorFunction w a b ->
  Order a ->
  Aggregator a b

makeDistinctAggregator ::
  AggregatorFunction w a b ->
  Order w ->
  Aggregator a b

where the AggregatorFunctions are like

stringAgg ::
  AggregatorFunction
    (Wrap (Field SqlText), Wrap (Field SqlText))
    (Field SqlText, Field SqlText)
    (Field SqlText)

and Wrap is an opaque wrapper that allows you to reorder Fields, but doesn't allow you to apply any Field functions to them. That is, the w parameter to the AggregatorFunction is the exact arguments supplied to the SQL aggregation function. So for example, we would have

wrap :: Field a -> Wrap (Field a)

ascWrap :: Order (Wrap (Field a))

asc :: Order (Field a)
asc = contramap wrap ascWrap

I think this would work. It's a bit heavyweight, but actually resolves my preexisting concerns about the semantics of orderAggregate, aggregateOrdered and distinctAggregator.

I've no idea whether the same approach would work for the WITHIN GROUP problem.

@shane-circuithub shane-circuithub force-pushed the aggregate-distinct-order-by-fix-2 branch from ad27062 to 56e733e Compare November 20, 2023 11:52
tomjaguarpaw added a commit that referenced this pull request Jan 21, 2024
It's redundant because we only need to rebind PrimExprs that came from
a lateral subquery.  As explained at [1], rather than carefully
analysing which PrimExprs came from a lateral subquery we can just
rebind everything.  A previous commit [2] changed things so that
everything mentioned in aggregator order expressions is rebound.
Instead we probably should have done what this commit does, that is,
added an Unpackspec constraint to aggregate.

Fixes #587

This is a simpler approach to resolving the issues discussed at

* #585
* #578

This still suffers from the problem described at:

#578 (comment)

i.e. if we find a way of duplicating field names, such as

    O.asc (\x -> snd x O..++ snd x)

then we can still create crashing queries.  The benefit of this
comment is that there is a way of generating non-crashing queries!

[1] https://github.com/tomjaguarpaw/haskell-opaleye/blob/52a4063188dd617ff91050dc0f2e27fc0570633c/src/Opaleye/Internal/Aggregate.hs#L111-L114

[2] d848317, part of
    #576
tomjaguarpaw added a commit that referenced this pull request Jan 26, 2024
It's redundant because we only need to rebind PrimExprs that came from
a lateral subquery.  As explained at [1], rather than carefully
analysing which PrimExprs came from a lateral subquery we can just
rebind everything.  A previous commit [2] changed things so that
everything mentioned in aggregator order expressions is rebound.
Instead we probably should have done what this commit does, that is,
added an Unpackspec constraint to aggregate.

Fixes #587

This is a simpler approach to resolving the issues discussed at

* #585
* #578

This still suffers from the problem described at:

#578 (comment)

i.e. if we find a way of duplicating field names, such as

    O.asc (\x -> snd x O..++ snd x)

then we can still create crashing queries.  The benefit of this
comment is that there is a way of generating non-crashing queries!

[1] https://github.com/tomjaguarpaw/haskell-opaleye/blob/52a4063188dd617ff91050dc0f2e27fc0570633c/src/Opaleye/Internal/Aggregate.hs#L111-L114

[2] d848317, part of
    #576
tomjaguarpaw added a commit that referenced this pull request Feb 6, 2024
It's redundant because we only need to rebind PrimExprs that came from
a lateral subquery.  As explained at [1], rather than carefully
analysing which PrimExprs came from a lateral subquery we can just
rebind everything.  A previous commit [2] changed things so that
everything mentioned in aggregator order expressions is rebound.
Instead we probably should have done what this commit does, that is,
added an Unpackspec constraint to aggregate.

Fixes #587

This is a simpler approach to resolving the issues discussed at

* #585
* #578

This still suffers from the problem described at:

#578 (comment)

i.e. if we find a way of duplicating field names, such as

    O.asc (\x -> snd x O..++ snd x)

then we can still create crashing queries.  The benefit of this
comment is that there is a way of generating non-crashing queries!

[1] https://github.com/tomjaguarpaw/haskell-opaleye/blob/52a4063188dd617ff91050dc0f2e27fc0570633c/src/Opaleye/Internal/Aggregate.hs#L111-L114

[2] d848317, part of
    #576
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants