Releases: snowplow/snowplow-rdb-loader
5.3.0
Attention
Snowflake Loader 5.3.0 contains a bug that makes it not copy contexts and unstruct events to events table. We've fixed this problem in version 5.3.1. The issue explained in here. In case you have been affected from this bug, instructions to recover missing data can be found here.
This release makes the Databricks loader resilient to invalid schema evolution.
Databricks invalid schema evolution recovery
What is schema evolution?
One of Snowplow’s key features is the ability to define custom schemas and validate events against them. Over time, users often evolve the schemas, e.g. by adding new fields or changing existing fields. To accommodate these changes, RDB loader automatically adjusts the database tables in the warehouse accordingly.
There are two main types of schema changes:
Breaking: The schema version has to be changed in a major way (1-2-3
→ 2-0-0
). In Databricks, each major schema version has its own column (..._1
, ..._2
, etc, for example: contexts_com.snowplowanalytics_ad_click_1
).
Non-breaking: The schema version can be changed in a minor way (1-2-3
→ 1-3-0
or 1-2-3
→ 1-2-4
). Data is stored in the same database column.
How it used to work
In the past, the transformer would format the data according to the latest version of the schema it saw (for a given major version, e.g. 1-*-*
). For example, if a batch contained events with schema versions 1-0-0
, 1-0-1
and 1-0-2
, the transformer would derive the schema of the Parquet file based on version 1-0-2
. Then the loader would instruct Databricks to adjust the database column and load the file.
This logic relied on two assumptions:
-
Old events compatible with new schemas. Events with older schema versions, e.g.
1-0-0
and1-0-1
, had to be valid against the newer ones, e.g.1-0-2
. Those that were not valid would result in failed events. -
Old columns compatible with new schemas. The corresponding Databricks column type had to expand correctly from one version to another using
mergeSchema = true
. In reality, this only works for adding and removing fields. Other changes, such as altering the type of a field frominteger
tostring
, would fail. Loading would break with SQL errors and the whole batch would be stuck and hard to recover.
These assumptions were not always clear to the users, making the transformer and loader error-prone.
What happens now?
Transformer and loader are now more robust, and the data is easy to recover if the schema was not evolved correctly.
First, we support schema evolution that’s not strictly backwards compatible (although we still recommend against it since it can confuse downstream consumers of the data). This is done by merging multiple schemas so that both old and new events can coexist. For example, suppose we have these two schemas:
{
// 1-0-0
"properties": {
"a": {"type": "integer"}
}
}
{
// 1-0-1
"properties": {
"b": {"type": "integer"}
}
}
These would be merged into the following:
{
// merged
"properties": {
"a": {"type": "integer"},
"b": {"type": "integer"}
}
}
Second, the loader does not fail when it can’t modify the database column to store both old and new events. (As a reminder, an example would be changing the type of a field from integer
to string
.) Instead, it creates a temporary column for the new data as an exception. The users can then run SQL statements to resolve this situation as they see fit. For instance, consider these two schemas:
{
// 1-0-0
"properties": {
"a": {"type": "integer"}
}
}
{
// 1-0-1
"properties": {
"a": {"type": "string"}
}
}
Because 1-0-1
events cannot be loaded into the same column with 1-0-0
, the data would be put in a separate column, e.g. contexts_com_snowplowanalytics_ad_click_1_0_1_recovered_9999999
, where:
1_0_1
is the version of the offending schema;9999999
is a hash code unique to the schema (i.e. it will change if the schema is overwritten with a different one).
If you create a new schema 1-0-2
that reverts the offending changes and is again compatible with 1-0-0
, the data for events with that schema will be written to the original column as expected.
Notes
- If events with incorrectly evolved schemas do not arrive, then the recovery column would not be created.
- It is still possible to break loading by overwriting version
1-0-0
of the schema. Please, avoid doing that.
Other improvements
-
In Databricks Loader, we've started to use the
CREATE OR REPLACE TABLE
statement instead ofDROP
andCREATE
while creating the temporary folder monitoring table sinceDROP
andCREATE
causes problem sometimes. Thanks very much to @dkbrkjni for his contributions! Related Github issue -
In Redshift Loader, folder monitoring wasn't working when the stage bucket's region and Redshift cluster's region were different. This problem is fixed in this version.
Full Changelog
5.2.1
This patch release brings stability improvements and dependency updates to address security vulnerabilities.
Changelog
- Loader: trim the alert error message for loader shutting down (#1079)
- Loader: add timeout to rollback and commit statements (#1161)
- Transformer Batch: set datetimeRebaseModeInWrite Spark config property (#1154)
- Common: bump 'actions/setup-python' to v4 (#1153)
- Loader: Create the events table on startup (#1034)
- Loader: Destroy and create new SSH tunnel after errors (#1148)
- Bump Databricks JDBC to 2.6.29 (#1163)
- Bump Snowflake JDBC to 3.13.25 (#1163)
- Bump schema-ddl to 0.15.2 (#1134)
- Bump Iglu Scala Client to 1.3.0 (#1163)
- Bump protobuf-java to 3.21.7 (#1163)
- Bump netty-codec to 4.1.86.Final (#1163)
5.2.0
This release brings Parquet support to Transformer Pubsub. Also, it brings various new features and improvements to RDB Loader applications.
Scheduler for running 'optimize' command on Databricks Loader
Loader applications are using a manifest table to keep track of information about the folders loaded so far. However, we've found that frequent manifest table updates result in a growing number of files backing this table in Databricks. This severely degrades the performance of loading. A similar problem impacts the event table too.
The remedy for the issue is to run an OPTIMIZE
command on the table, compacting all updates into a small number of files.
In order to make this process smoother for users, we've added a scheduler that runs the OPTIMIZE
command regularly, according to the given CRON statement.
OPTIMIZE
scheduler can be configured like so:
"schedules": {
...
# CRON statement means that run the optimize command on event table every day at 00:00 (JVM timezone)
"optimizeEvents": "0 0 0 ? * *",
# CRON statement means that run the optimize command on manifest table every day at 05:00 (JVM timezone)
"optimizeManifest": "0 0 5 ? * *"
}
Databricks Loader has the above values as default. If you want to disable these schedulers completely, you need to set them to 'null':
"schedules": {
...
"optimizeEvents": null,
"optimizeManifest": null
}
Note: This feature requires collector_tstamp_date
generated column in the event table. We recommend disabling this feature if you don't have this column in your events table. If the feature is enabled and collector_tstamp_date
colum doesn't exist, you might see some errors in the application logs however those errors shouldn't interfere with the normal function of the application.
Parquet support in Transformer Pubsub
In this release, Transformer Pubsub comes with the ability to output in Parquet format. You need to add the following section to the Transformer Pubsub config to enable Parquet output:
"formats": {
"fileFormat": "parquet"
}
You can find the configuration reference to prepare the configuration file and instructions to deploy the application in the docs.
New authorization method in Redshift Loader
On version 4.1.0, we've introduced new authorization method in Snowflake Loader and Databricks Loader. We are adding same method to Redshift Loader in this release.
This method allows to generate temporary credentials using STS and pass these credentials to Redshift. This removes the need to pre-configure the warehouse with access permission.
To start using the new authorization method, you must add a loadAuthMethod
to the storage
block in your config file:
"storage": {
// other required fields go here
"loadAuthMethod": {
"type": "TempCreds"
"roleArn": "arn:aws:iam::123456789:role/example_role_name"
}
}
...where roleArn
is a role with permission to read files from the S3 bucket. The loader must have permission to assume this role. More information about authorization options can be found in the docs.
Full Changelog
- Redshift: Use STS tokens for copying from S3 (#998)
- Snowflake Loader: auto-configure staging location paths (#1059)
- Transformer Pubsub: support Parquet output option (#1124)
- Common Stream Transformer: add Sentry (#881)
- Common: add code format check as CI/CD step (#1089)
- Loader: Improve minimum_age_of_loaded_data metric (#1111)
- Transformer: cache result of flattening schema (#1086)
- Databricks loader: Performance degradation for long running application (#1029)
- Kinesis tranformer: flaky wide row test (#1030)
5.1.2
This patch release brings a bug fix to Snowflake Loader. This bug can affect you if you have json schema with a name that is styled like camelCase but uses capital letters in a row similar to the followings: testTESTSchema
. The bug doesn't affect schemas with usual snake_case
, camelCase
, or PascalCase
style names.
Changelog
- Snowflake Loader: create column name with same method in everywhere (#1128)
Version 5.1.1
5.1.0
This release brings SSH tunnel connection recovery to Redshift Loader. Also, it makes disabling in-batch natural deduplication in Batch Transformer possible.
Option to disable in-batch natural deduplication in Batch Transformer
Previously, it wasn't possible to disable in-batch natural deduplication in Batch Transformer. We have found that in-batch natural deduplication affects performance therefore we have made disabling it possible. If duplicate events aren't a problem for you, we suggest disabling deduplication.
It can be disabled by adding following section to the config:
"deduplication": {
# When natural deduplication is disabled, 'synthetic' deduplication needs to be disabled too.
"synthetic": {
"type": "NONE"
}
"natural": false
}
More information about deduplication in Batch Transformer can be found here.
SSH tunnel connection recovery in Redshift Loader
Redshift loader can connect to a private Redshift cluster through an SSH tunnel. Previously, if SSH tunnel session was disconnected, the loader didn't have a way to discover it. We added retry around SSH tunnel connection to make it possible to recover from this problem and to make it more robust.
Upgrading to 5.1.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.1.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:5.1.0
docker pull snowplow/rdb-loader-redshift:5.1.0
docker pull snowplow/rdb-loader-snowflake:5.1.0
docker pull snowplow/rdb-loader-databricks:5.1.0
The Snowplow docs site has a full guide to running the RDB Loader.
Changelog
5.0.0
This release brings the first GCP supported applications in the RDB Loader application family: Snowflake Loader and Transformer Pubsub.
Additionally, this release brings a few bug fixes on Databricks Loader and Transformer Kinesis.
GCP support on Snowflake Loader and Transformer Pubsub
From its inception, RDB Loader applications are developed to run on AWS. Making possible to run them in GCP have been in our roadmap for a long time. In this release, we pave its way with integrating GCP services to Snowflake Loader to make it possible to run it with GCP services completely. GCP counterpart of transformer, Transformer Pubsub, is created as well. With these additions, it is possible to load Snowplow data from GCP pipeline to Snowflake.
At the moment, Transformer Pubsub can't output in Parquet format. Adding support for it is in our roadmap as well. This change will make the Databricks Loader on GCP possible as well.
How to start loading into Snowflake on GCP
Initially, you need to deploy the Transformer Pubsub. Minimal configuration file for Transformer Pubsub looks like following:
{
# Name of the Pubsub subscription with the enriched events
"input": {
"subscription": "projects/project-id/subscriptions/subscription-id"
}
# Path to transformed archive
"output": {
"path": "gs://bucket/transformed/"
}
# Name of the Pubsub topic used to communicate with Loader
"queue": {
"topic": "projects/project-id/topics/topic-id"
}
}
You can find the configuration reference to prepare the configuration file and instructions to deploy the application in the docs.
Then, for the Snowflake Loader part you'll need to:
- setup the necessary Snowflake resources
- prepare configuration files for the loader
- deploy the Snowflake Loader app
Important bit in the Snowflake Loader config is that Pubsub should be used as message queue:
...
"messageQueue": {
"type": "pubsub"
"subscription": "projects/project-id/subscriptions/subscription-id"
}
...
Full documentation for Snowflake Loader can be found here.
Bug fixes on Databricks Loader and Transformer Kinesis
-
It is reported that there was an issue in Databricks Loader when trying to load a batch where multiple parquet files with different schemas and optional column only exist in some of the files. This issue is fixed in version 5.0.0. Thanks drphrozen for reporting the issue and submitting a PR!
-
It is reported that Transformer Kinesis throws exception when Kinesis stream shard count is increased. This issue is fixed in version 5.0.0. Thanks sdbeans for reporting the issue!
Adding telemetry to loader apps and Transformer Pubsub
In Snowplow, we are trying to improve our products every day and understanding what is popular is important part of it to focus our development effort in the right place. Therefore, we are adding telemetry to loader apps and Transformer Pubsub. What it is doing basically sending heartbeats with some minimal meta-information about the application.
You can help us by providing userProvidedId
in the config file:
"telemetry" {
"userProvidedId": "myCompany"
}
Telemetry can be deactivated by putting the following section in the configuration file:
"telemetry": {
"disable": true
}
More information about telemetry in RDB Loader project can be found here.
Upgrading to 5.0.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.0.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:5.0.0
docker pull snowplow/rdb-loader-redshift:5.0.0
docker pull snowplow/rdb-loader-snowflake:5.0.0
docker pull snowplow/rdb-loader-databricks:5.0.0
The Snowplow docs site has a full guide to running the RDB Loader.
4.3.0
This release brings some important bug fixes, especially around table migrations. Table migrations are an important feature of RDB Loader: if you update an Iglu schema (e.g. from version 1-0-0
to 1-0-1
) then the loader automatically alters the target table to accommodate the newer events. However, we discovered a number of edge cases where migrations did not work as expected.
Redshift loader: Fix to alter max length of text fields
This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 3.0.0 and does not affect older versions.
If you update an Iglu schema by raising the maxLength
setting for a string field, then RDB Loader should respond by altering the table e.g. from VARCHAR(10)
to VARCHAR(20)
. Because of this bug, RDB Loader did not attempt to alter the column length; it would instead attempt to load the newer events into the table without running the migrations. You might be affected by this bug if you have recently updated an Iglu schema by raising the max length of a field. If you think you have been affected by this bug, we suggest you check your entity tables and manually alter the table if needed:
ALTER TABLE <SHREDDED_TABLE_NAME> ALTER COLUMN <EXTENDED_COLUMN> TYPE VARCHAR(<NEW_SIZE>);
The bug is fixed in this new 4.3.0 release. Once you upgrade to 4.3.0, RDB Loader will be prepared to correctly migrate your table in response to future field length changes.
Redshift loader: Fix to recover from failed migrations
This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 1.2.0.
If a table migration is immediately followed by a batch which cannot be loaded for any reason, then a table could be left in an inconsistent state where a migration was partially applied. If this ever happened, then RDB Loader could get stuck on successive loads with error messages like:
Invalid operation: cannot alter column “CCCCCCC” of relation “XXXXXXX", target column size should be different; - SqlState: 0A000
With this new 4.3.0 release, the inconsistent state is still reachable (due to Redshift limitations), but the loader is robust to recover from it.
Redshift loader: Fix migrations for batches with multiple versions of the same schema
This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 1.3.0.
It is possible and completely allowed for a batch of events to contain multiple versions of the same schema, e.g. both 1-0-0
and 1-0-1
. However, because of this bug, the loader was in danger of trying to perform table migrations twice. This could result in an error message like (same error as in previous case):
Invalid operation: cannot alter column “CCCCCCC” of relation “XXXXXXX", target column size should be different; - SqlState: 0A000
or following one depending on schema migration.
Invalid operation: cannot add column “CCCCCCC” of relation “XXXXXXX", already exists; - SqlState: 0A000
This is fixed in the 4.3.0 release, and now the loader will not enter this failure state if a batch contains multiple versions of the same schema.
Snowflake loader: Configure folder monitoring without a stage, while doing loading with a stage
This is a new feature you can benefit from if you load into Snowflake with RDB Loader. The Snowflake loader allows two alternative methods for authentication between the warehouse and the S3 bucket: either using Snowflake storage integration, or using temporary credentials generated with AWS STS. Previously, you were forced to pick the same method for loading events and for folder monitoring. With this change, it is possible to use the storage integration for loading events, but temporary credentials for folder monitoring. This is beneficial if you want the faster load times from using a storage integration, but do not want to go through the overhead of setting up a storage integration just for folder monitoring.
Take a look at the github issue for more details on the different ways to configure the loader to use the different authentication methods.
Snowflake and Databricks loaders: Fix inserting timestamps with wrong timezone to manifest table
This is a low-impact bug that is not expected to have any detrimental effect on loading. It could affect your pipeline if you load into Snowflake or Databricks, and if your warehouse is set to have a non-UTC timezone by default.
This bug affects the manifest table, which is the table the loader uses to track which batches have been loaded already. Because of this bug, timestamps in the manifest table were stored using the default timezone of the warehouse, not UTC. This bug could only affect you in the unlikely case you use the manifest table for some other purpose.
Starting from this version 4.3.0 release, we now take care to insert timestamps with the UTC timezone.
Upgrading to 4.3.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.3.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:4.3.0
docker pull snowplow/rdb-loader-redshift:4.3.0
docker pull snowplow/rdb-loader-snowflake:4.3.0
docker pull snowplow/rdb-loader-databricks:4.3.0
The Snowplow docs site has a full guide to running the RDB Loader.
4.2.2
A minor release brings stability improvements and better error messaging on loaders.