Redshift Loader: Handle missing columns in CSV #1065

istreeter · 2022-09-14T17:07:45Z

This issue is about schema evolutions which add new columns. There is a problem that arises when the data is transformed using the older schema, but attempted to load using the newer schema.

There are a few situations where this problem could arise:

When loading historic data. It is a principle of RDB Loader that the transformed archive can be re-used at a future date to populate a new warehouse.
When re-processing a failed load. Imagine a batch fails to load on 10th January because of a connection failure. The schema is evolved on 11th January. And on 12th January you try re-loading the failed batch.
A simple race condition between the transformer and loader. Imagine the schema is evolved at approximately the same time that the transformer finishes a batch and writes the SQS message. (It's actually more subtle than this, because of how iglu clients cache schemas, but I think this situation is not uncommon).

In all those cases, the CSV file might have fewer columns than what is expected by the loader/warehouse.

When this happens, the loader logs contain this message:

Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_acme_myschema_1' failed.  Check 'stl_load_errors' system table for details.;

If you look up the error in the stl_load_errors table, then the error message is:

raw_field_value = \\N
error_code = 1214
error_reason = Delimiter not found

The solution is to always specify the columns in the COPY INTO statement. Currently we load a shredded table like this:

COPY com_acme_myschema_1
FROM `s3://mybucket/run=123`
DELIMITER '\t'

whereas we need it look like this:

COPY com_acme_myschema_1 (col1, col2, col3)
FROM `s3://mybucket/run=123`
DELIMITER '\t'

...where the column list refer to the columns that we know are present in the CSV file; not the columns that are present in the table.

The text was updated successfully, but these errors were encountered:

sdbeans · 2023-05-09T11:58:02Z

We've run into this issue a few times when upgrading schemas.
A minor version bump, adding new fields to a schema, results in this error from the RDB loader:

ERROR Loader: Loading of s3://path/ has failed. Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_snowplowanalytics_snowplow_mobile_context_1' failed.  Check 'stl_load_errors' system table for details.

And shows this error in stl_load_errors because the new fields do not exist in the CSV file.

Delimiter not found

From what I can tell, this is because the events were processed by the RDB transformer prior to the update, and the RDB loader is trying to process them against the new schema.

Currently running both the transformer and loader as docker images on EC2 instances.
snowplow/transformer-kinesis:5.0.0 and snowplow/rdb-loader-redshift:5.0.0

ek1m · 2023-05-17T11:13:33Z

We also have this problem.
CSV file might have fewer columns than what is expected by the loader/warehouse.
[2023-05-17 09:31:04.276+0000] [ERROR] Loader: Loading of s3://my-bucket/transformed/run=2023-05-09-05-34-31/ has failed. Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'mobile_idea_open_1' failed. Check 'stl_load_errors' system table for details.; - SqlState: XX000

spenes mentioned this issue Sep 15, 2022

Common: resovle a race condition between Shredder and Loader when ADDITION added #193

Closed

istreeter mentioned this issue Oct 24, 2022

Release R137 snowplow/iglu-central#1238

Merged

istreeter mentioned this issue Nov 1, 2022

Common: Upgrade yauaa to 7.x snowplow/enrich#678

Merged

2 tasks

istreeter mentioned this issue Nov 18, 2022

Add invalidation for the schema-list cache snowplow/iglu-scala-client#215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redshift Loader: Handle missing columns in CSV #1065

Redshift Loader: Handle missing columns in CSV #1065

istreeter commented Sep 14, 2022

sdbeans commented May 9, 2023

ek1m commented May 17, 2023

Redshift Loader: Handle missing columns in CSV #1065

Redshift Loader: Handle missing columns in CSV #1065

Comments

istreeter commented Sep 14, 2022

sdbeans commented May 9, 2023

ek1m commented May 17, 2023