Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redshift Loader: Handle missing columns in CSV #1065

Open
istreeter opened this issue Sep 14, 2022 · 2 comments
Open

Redshift Loader: Handle missing columns in CSV #1065

istreeter opened this issue Sep 14, 2022 · 2 comments

Comments

@istreeter
Copy link
Contributor

This issue is about schema evolutions which add new columns. There is a problem that arises when the data is transformed using the older schema, but attempted to load using the newer schema.

There are a few situations where this problem could arise:

  • When loading historic data. It is a principle of RDB Loader that the transformed archive can be re-used at a future date to populate a new warehouse.
  • When re-processing a failed load. Imagine a batch fails to load on 10th January because of a connection failure. The schema is evolved on 11th January. And on 12th January you try re-loading the failed batch.
  • A simple race condition between the transformer and loader. Imagine the schema is evolved at approximately the same time that the transformer finishes a batch and writes the SQS message. (It's actually more subtle than this, because of how iglu clients cache schemas, but I think this situation is not uncommon).

In all those cases, the CSV file might have fewer columns than what is expected by the loader/warehouse.

When this happens, the loader logs contain this message:

Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_acme_myschema_1' failed.  Check 'stl_load_errors' system table for details.;

If you look up the error in the stl_load_errors table, then the error message is:

raw_field_value = \\N
error_code = 1214
error_reason = Delimiter not found

The solution is to always specify the columns in the COPY INTO statement. Currently we load a shredded table like this:

COPY com_acme_myschema_1
FROM `s3://mybucket/run=123`
DELIMITER '\t'

whereas we need it look like this:

COPY com_acme_myschema_1 (col1, col2, col3)
FROM `s3://mybucket/run=123`
DELIMITER '\t'

...where the column list refer to the columns that we know are present in the CSV file; not the columns that are present in the table.

@sdbeans
Copy link

sdbeans commented May 9, 2023

We've run into this issue a few times when upgrading schemas.
A minor version bump, adding new fields to a schema, results in this error from the RDB loader:

ERROR Loader: Loading of s3://path/ has failed. Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_snowplowanalytics_snowplow_mobile_context_1' failed.  Check 'stl_load_errors' system table for details.

And shows this error in stl_load_errors because the new fields do not exist in the CSV file.

Delimiter not found                                                                                 

From what I can tell, this is because the events were processed by the RDB transformer prior to the update, and the RDB loader is trying to process them against the new schema.

Currently running both the transformer and loader as docker images on EC2 instances.
snowplow/transformer-kinesis:5.0.0 and snowplow/rdb-loader-redshift:5.0.0

@ek1m
Copy link

ek1m commented May 17, 2023

We also have this problem.
CSV file might have fewer columns than what is expected by the loader/warehouse.
[2023-05-17 09:31:04.276+0000] [ERROR] Loader: Loading of s3://my-bucket/transformed/run=2023-05-09-05-34-31/ has failed. Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'mobile_idea_open_1' failed. Check 'stl_load_errors' system table for details.; - SqlState: XX000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants