Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON predicate pushdown for Pinot #23966

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions docs/src/main/sphinx/connector/pinot.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The Pinot connector allows Trino to query data stored in

To connect to Pinot, you need:

- Pinot 1.1.0 or higher.
- Pinot 1.2.0 or higher.
- Network access from the Trino coordinator and workers to the Pinot controller
nodes. Port 8098 is the default port.

Expand All @@ -33,9 +33,9 @@ This can be the ip or the FDQN, the url scheme (`http://`) is optional.
### General configuration properties

| Property name | Required | Description |
|--------------------------------------------------------|----------| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|--------------------------------------------------------|----------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `pinot.controller-urls` | Yes | A comma separated list of controller hosts. If Pinot is deployed via [Kubernetes](https://kubernetes.io/) this needs to point to the controller service endpoint. The Pinot broker and server must be accessible via DNS as Pinot returns hostnames and not IP addresses. |
| `pinot.broker-url` | No | A host and port of broker. If broker URL exposed by Pinot controller API is not accessible, this property can be used to specify the broker endpoint. Enabling this property will disable broker discovery. |
| `pinot.broker-url` | No | A host and port of broker. If broker URL exposed by Pinot controller API is not accessible, this property can be used to specify the broker endpoint. Enabling this property will disable broker discovery. |
| `pinot.connection-timeout` | No | Pinot connection timeout, default is `15s`. |
| `pinot.metadata-expiry` | No | Pinot metadata expiration time, default is `2m`. |
| `pinot.controller.authentication.type` | No | Pinot authentication method for controller requests. Allowed values are `NONE` and `PASSWORD` - defaults to `NONE` which is no authentication. |
Expand All @@ -44,7 +44,7 @@ This can be the ip or the FDQN, the url scheme (`http://`) is optional.
| `pinot.broker.authentication.type` | No | Pinot authentication method for broker requests. Allowed values are `NONE` and `PASSWORD` - defaults to `NONE` which is no authentication. |
| `pinot.broker.authentication.user` | No | Broker username for basic authentication method. |
| `pinot.broker.authentication.password` | No | Broker password for basic authentication method. |
| `pinot.max-rows-per-split-for-segment-queries` | No | Fail query if Pinot server split returns more rows than configured, default to `2,147,483,647`. |
| `pinot.max-rows-per-split-for-segment-queries` | No | Fail query if Pinot server split returns more rows than configured, default to `2,147,483,647`. |
| `pinot.prefer-broker-queries` | No | Pinot query plan prefers to query Pinot broker, default is `true`. |
| `pinot.forbid-segment-queries` | No | Forbid parallel querying and force all querying to happen via the broker, default is `false`. |
| `pinot.segments-per-split` | No | The number of segments processed in a split. Setting this higher reduces the number of requests made to Pinot. This is useful for smaller Pinot clusters, default is `1`. |
Expand All @@ -53,6 +53,7 @@ This can be the ip or the FDQN, the url scheme (`http://`) is optional.
| `pinot.max-rows-for-broker-queries` | No | Max rows for a broker query can return, default is `50,000`. |
| `pinot.aggregation-pushdown.enabled` | No | Push down aggregation queries, default is `true`. |
| `pinot.count-distinct-pushdown.enabled` | No | Push down count distinct queries to Pinot, default is `true`. |
| `pinot.json-predicate-pushdown.enabled` | No | Push down JSON predicates to Pinot. The default is `false` because Pinot's `JSON_MATCH()` generates an error when a JSON index is missing. |
| `pinot.target-segment-page-size` | No | Max allowed page size for segment query, default is `1MB`. |
| `pinot.proxy.enabled` | No | Use Pinot Proxy for controller and broker requests, default is `false`. |

Expand Down Expand Up @@ -206,8 +207,11 @@ The connector provides {ref}`globally available <sql-globally-available>` and
{ref}`read operation <sql-read-operations>` statements to access data and
metadata in Pinot.


## Performance

(pinot-pushdown)=
## Pushdown
### Pushdown

The connector supports pushdown for a number of operations:

Expand All @@ -231,5 +235,18 @@ significant performance impact. If you encounter this problem, you can disable
it with the catalog property `pinot.count-distinct-pushdown.enabled` or the
catalog session property `count_distinct_pushdown_enabled`.

JSON predicate pushdown for the following expressions:

- `JSON_EXTRACT(json, '$.selector') IS NULL`
- `JSON_EXTRACT_SCALAR(json, '$.selector') IS NULL'`
- `CONTAINS(ARRAY['string'], JSON_EXTRACT_SCALAR(json, '$.selector'))`
- `CONTAINS(ARRAY[5], CAST(JSON_EXTRACT_SCALAR(json, '$.selector') AS INTEGER))"))`
- `JSON_ARRAY_CONTAINS(JSON_EXTRACT(json, '$.selector'), 'string')`
- `JSON_ARRAY_CONTAINS(JSON_EXTRACT(json, '$.selector'), 5)`
- `JSON_ARRAY_CONTAINS(JSON_EXTRACT(json, '$.selector'), 'string') =/!= true/false`
- `NOT(JSON_ARRAY_CONTAINS(JSON_EXTRACT(json, '$.selector'), 'string'))`
- `CONTAINS(ARRAY['string'], JSON_EXTRACT_SCALAR(json, '$.selector')) =/!= true/false`
- `NOT(CONTAINS(ARRAY['string'], JSON_EXTRACT_SCALAR(json, '$.selector')))`

```{include} pushdown-correctness-behavior.fragment
```
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ public class PinotConfig
private int maxRowsForBrokerQueries = 50_000;
private boolean aggregationPushdownEnabled = true;
private boolean countDistinctPushdownEnabled = true;
private boolean jsonPredicatePushdownEnabled;
private boolean proxyEnabled;
private DataSize targetSegmentPageSize = DataSize.of(1, MEGABYTE);

Expand Down Expand Up @@ -221,6 +222,11 @@ public boolean isCountDistinctPushdownEnabled()
return countDistinctPushdownEnabled;
}

public boolean isJsonPredicatePushdownEnabled()
{
return jsonPredicatePushdownEnabled;
}

@Config("pinot.count-distinct-pushdown.enabled")
@ConfigDescription("Controls whether distinct count is pushed down to Pinot. Distinct count pushdown can cause Pinot to do a full scan. Aggregation pushdown must also be enabled in addition to this parameter otherwise no pushdowns will be enabled.")
public PinotConfig setCountDistinctPushdownEnabled(boolean countDistinctPushdownEnabled)
Expand All @@ -229,6 +235,14 @@ public PinotConfig setCountDistinctPushdownEnabled(boolean countDistinctPushdown
return this;
}

@Config("pinot.json-predicate-pushdown.enabled")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this toggle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new session property (pinot.json_predicate_pushdown_enabled) defaults to false because JSON_MATCH can fail when a JSON index hasn't been configured in Pinot.

Is it possible to check the index config during analyze?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a way to determine if the JSON index has been completely applied. Retrieving the table config from the Pinot Controller API (/tables/{tableName}) and inspecting tableIndexConfig.jsonIndexColumns would only confirm that it has been configured and the query could still fail if the JSON index was recently added and the table's segments haven't been reloaded.

Copy link
Author

@robertzych robertzych Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr Assuming a missing JSON index rarely happens (or only during development), I propose we change the default value of this config to true. Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we definitely need this toogle. If we can switch it to true by default is a bit interesting. @robertzych mentioned in the Trino Contributor Call https://github.com/trinodb/trino/wiki/Contributor-meetings#trino-contributor-call-13-dec-2024 that this could lead to a scenario where queries that used to work fail after an upgrade to a Trino version with this pushdown support.

So I think it depends a bit on what we did in similar situations for other connectors. Given that it can be breaking queries for users we seem to need the toggle.

@ConfigDescription("Controls whether JSON predicates are pushed down to Pinot. A JSON predicate is converted to a Pinot JSON_MATCH function call which generates an error when a JSON index is missing.")
public PinotConfig setJsonPredicatePushdownEnabled(boolean jsonPredicatePushdownEnabled)
{
this.jsonPredicatePushdownEnabled = jsonPredicatePushdownEnabled;
return this;
}

public boolean isTlsEnabled()
{
return "https".equalsIgnoreCase(getControllerUrls().get(0).getScheme());
Expand Down
Loading
Loading