You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After the first rather minutes than hours playing around with the Snowflake Kafka Connector (both SNOWPIPE as well as SNOWPIPE_STREAMING) I'm seeing constant full tables scans when querying the landing area directly. While during my current usecases this might be still performant enough, I could imagine that over time this might become an issue.
Use Cases
What are the newest x records ingested to Snowflake.
What is the lag on Snowflake side compared to the Highwater on Kafka.
Work explorative on raw data
efficiently prune the kafka tables, after a period of time, days or months
Current Approach
I would query something like this:
SELECT
RECORD_METADATA:partition::varchar as PARTITION,
MAX(RECORD_METADATA:offset::number) as MAX_OFFSET,
MAX(RECORD_METADATA:CreateTime::varchar::timestamp) as LATEST_RECORD
FROM LANDING.TOPIC_1
GROUP BY 1
ORDER BY 1;
If my understanding is correct (what I could confirm in own experiments) this would result in a full table scan every time.
The general idea is to work with Streams and Tasks to manually schematise the Record as well as the Metadata. But due to the fixed schedule of the task, working on the schematised result increases the delay of access significantly.
Further, in exploratory and monitoring use-cases where you're working on raw data and limiting the analysis interval, it would be incredibly helpful to make use of pruning on the CreateTime field.
Proposal
Based on these thoughts and observations I propose write out CreateTime, offset and partition columns separately and propose to filter based on these relational columns.
I know that schematization is on the roadmap, but according to the design document this will not include the RECORD_METADATA field.
Also I'm happy to contribute this enhancement, if you agree on the points.
The text was updated successfully, but these errors were encountered:
It's stored as a VARCHAR because setting "transforms.convertTimestamp.target.type": "Timestamp" didn't seem to work, but partition pruning with this column still works.
Just wanted to share that I was able to get the offset, partition and create time for JSON without using the schema registry. The options you need to set are as below and kafka connect will do the rest to populate the values in the column names given. Schema registry is required if you use any format other than JSON
Setting the schematization snowflake.enable.schematization to true for json will help creating the values as separate fields in the target table and hence will help in partition pruning well.
Hey all,
After the first rather minutes than hours playing around with the Snowflake Kafka Connector (both SNOWPIPE as well as SNOWPIPE_STREAMING) I'm seeing constant full tables scans when querying the landing area directly. While during my current usecases this might be still performant enough, I could imagine that over time this might become an issue.
Use Cases
Current Approach
I would query something like this:
If my understanding is correct (what I could confirm in own experiments) this would result in a full table scan every time.
The general idea is to work with Streams and Tasks to manually schematise the Record as well as the Metadata. But due to the fixed schedule of the task, working on the schematised result increases the delay of access significantly.
Further, in exploratory and monitoring use-cases where you're working on raw data and limiting the analysis interval, it would be incredibly helpful to make use of pruning on the CreateTime field.
Proposal
Based on these thoughts and observations I propose write out CreateTime, offset and partition columns separately and propose to filter based on these relational columns.
I know that schematization is on the roadmap, but according to the design document this will not include the
RECORD_METADATA
field.Also I'm happy to contribute this enhancement, if you agree on the points.
The text was updated successfully, but these errors were encountered: