Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Support Subquery pushdown #297

Open
smilingthax opened this issue Mar 15, 2024 · 0 comments
Open

[Feature request] Support Subquery pushdown #297

smilingthax opened this issue Mar 15, 2024 · 0 comments

Comments

@smilingthax
Copy link

Motivation

Examples:

  1. Simple case:
SELECT 
  id,
  (SELECT COUNT(*) FROM tbl2 WHERE tbl2.parent_id = tbl1.id) 
FROM tbl1;  -- + WHERE, ...

The query plain currently looks somewhat like this:

 Foreign Scan on tbl1
   Output: tbl1.id, (SubPlan 1)
   Remote query: SELECT `id` FROM `s`.`tbl1`   -- + WHERE, ORDER BY, ...
   SubPlan 1
     ->  Foreign Scan
           Output: (count(tbl2.parent_id))
           Relations: Aggregate on (s.tbl2)
           Remote query: SELECT count(`id`) FROM `s`.`tbl2` WHERE ((`parent_id` = ?))

I.e. for each row of tbl1 a separate query is send to the remote database.

  1. Similar case (worse performance):
SELECT 
  id,
  EXISTS (SELECT FROM tbl2 WHERE tbl2.parent_id = tbl1.id) 
FROM tbl1;

Query Plan:

 Foreign Scan on tbl1
   Output: tbl1.id, (hashed SubPlan 2)
   Remote query: SELECT `id` FROM `s`.`tbl1`   -- + WHERE, ORDER BY, ...
   SubPlan 2
     ->  Foreign Scan on s.tbl2
           Output: tbl2.parent_id
           Remote query: SELECT `parent_id` FROM `s`.`tbl2`

This unfortunately tries to download the complete tbl2!

  1. Another version (might be more difficult to implement?):
SELECT 
  id,
  r1.*
FROM tbl1,
LATERAL (SELECT COUNT(*) FROM tbl2 WHERE tbl2.parent_id = tbl1.id) r1;

Query Plan:

 Nested Loop
   Output: tbl1.id, (count(tbl2.parent_id))
   ->  Foreign Scan on s.tbl1
         Output: tbl1.id, ... more fields...
         Remote query: SELECT `id` FROM `s`.`tbl1`   -- + WHERE, ORDER BY, ...
   ->  Foreign Scan
         Output: (count(*))
         Relations: Aggregate on ("s".tbl2)
         Remote query: SELECT count(*) FROM `s`.`tbl2` WHERE ((`parent_id` = ?))

This is basically the same as 1.

Partial workaround:

SELECT
  tbl1.id,
  COUNT(tbl2.id)
FROM tbl1
LEFT JOIN tbl2 ON tbl2.parent_id = tbl1.id
GROUP BY 1

Query Plan: Single Foreign Scan, everything is pushed down

However, this no longer works when additional fields, e.g. tbl1.*, should be retrieved.

One might try this:

WITH t0 AS ( -- NOT MATERIALIZED does not help
  SELECT
    tbl1.id,
    COUNT(tbl2.id)
  FROM table
  LEFT JOIN tbl2 ON tbl2.parent_id = tbl1.id
  GROUP BY 1
  ORDER BY 1
)
SELECT
  tbl1.*,
  count
FROM t0
JOIN tbl1 ON tbl1.id = t0.id

But the resulting query plan (Merge Join between Foreign Scan of tbl1 and Materialize of Foreign Scan from t0) is not feasible for bigger tables, although SELECT * FROM t0 and SELECT * FROM tbl1 are quite fast by themselves...

Expected Result

The aggregation is evaluated on the remote mysql server without requiring a separate query for each row of tbl1.
Also, EXISTS should probably keep the WHERE, instead of downloading the whole table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant