Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better compile time checks if the column exists or not #3322

Open
sungchun12 opened this issue Oct 31, 2024 · 1 comment
Open

Better compile time checks if the column exists or not #3322

sungchun12 opened this issue Oct 31, 2024 · 1 comment
Assignees
Labels
Feature Adds new functionality

Comments

@sungchun12
Copy link
Contributor

Make SQLMesh smarter when running compile time checks by validating a column exists or not BEFORE running queries in the query engine.

Scenario: I have a SQL model below where I intentionally add a fake column that does not exist. The upstream model tcloud_demo.seed_raw_payments has columns defined explicitly. When I run a new plan, it will fail as expected at the query engine level. I want it to fail BEFORE that happens. This makes developing in SQLMesh much faster as users don't have to wait for the query engine to fix obvious errors, and it incentivizes them to explicitly define their columns.

Note: this will not solve for models defined as select * which is a fair expectation

(.venv) ➜  tobiko-cloud-demo git:(demo-sung) ✗ sqlmesh plan dev
======================================================================
Successfully Ran 1 tests against duckdb
----------------------------------------------------------------------
Summary of differences against `dev`:
Models:
├── Directly Modified:
│   └── tcloud_demo__dev.stg_payments
└── Indirectly Modified:
    ├── tcloud_demo__dev.orders
    └── tcloud_demo__dev.customers
---                                                                                        
                                                                                           
+++                                                                                        
                                                                                           
@@ -16,5 +16,6 @@                                                                          
                                                                                           
   order_id,                                                                               
   payment_method,                                                                         
   amount / 100 AS amount,                                                                 
-  'example_column' AS example_column                                                      
+  'example_column' AS example_column,                                                     
+  fake_column                                                                             
 FROM tcloud_demo.seed_raw_payments                                                        
Directly Modified: tcloud_demo__dev.stg_payments (Breaking)
└── Indirectly Modified Children:
    ├── tcloud_demo__dev.customers (Indirect Breaking)
    └── tcloud_demo__dev.orders (Indirect Breaking)
Models needing backfill (missing dates):
├── tcloud_demo__dev.customers: 2024-10-30 - 2024-10-30
├── tcloud_demo__dev.orders: 2024-10-30 - 2024-10-30
└── tcloud_demo__dev.stg_payments: 2024-10-30 - 2024-10-30
Enter the backfill start date (eg. '1 year', '2020-01-01') or blank to backfill from the 
beginning of history: 
Enter the backfill end date (eg. '1 month ago', '2020-01-01') or blank to backfill up until
'2024-10-31 00:00:00': 
Apply - Backfill Tables [y/n]: y
Creating physical table ━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━ 62.5% • 5/8 • 0:00:042024-10-31 13:45:24,191 - ThreadPoolExecutor-2_0 - sqlmesh.core.renderer - WARNING - Column '"fake_column"' could not be resolved for model '"sqlmesh-public-demo"."tcloud_demo"."stg_payments"', the column may not exist or is ambiguous (renderer.py:517)
Creating physical table ━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━ 62.5% • 5/8 • 0:00:05
2024-10-31 13:45:25,427 - MainThread - sqlmesh.core.context - ERROR - Apply Failure: Traceback (most recent call last):
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/utils/concurrency.py", line 69, in _process_node
    self.fn(node)
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/utils/concurrency.py", line 165, in <lambda>
    lambda s_id: fn(snapshots_by_id[s_id]),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/snapshot/evaluator.py", line 300, in <lambda>
    lambda s: self._create_snapshot(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/snapshot/evaluator.py", line 716, in _create_snapshot
    evaluation_strategy.create(
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/snapshot/evaluator.py", line 1618, in create
    self.adapter.create_view(
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/engine_adapter/shared.py", line 302, in internal_wrapper
    return func(*list_args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/engine_adapter/base.py", line 997, in create_view
    self.execute(
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/engine_adapter/base.py", line 2019, in execute
    self._execute(sql, **kwargs)
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh_enterprise/engine_adapter/bigquery.py", line 20, in _execute
    super()._execute(sql, **kwargs)
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/engine_adapter/bigquery.py", line 845, in _execute
    results = self._db_call(
              ^^^^^^^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/engine_adapter/bigquery.py", line 803, in _db_call
    return func(
           ^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/google/cloud/bigquery/job/query.py", line 1676, in result
    while not is_job_done():
              ^^^^^^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
    raise final_exc from source_exc
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/google/cloud/bigquery/job/query.py", line 1625, in is_job_done
    raise job_failed_exception
google.api_core.exceptions.BadRequest: 400 Unrecognized name: fake_column at [1:260]; reason: invalidQuery, location: query, message: Unrecognized name: fake_column at [1:260]

Location: US
Job ID: 425f534f-62ed-4617-8a66-f792dc44eade


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/context.py", line 1288, in apply
    self._apply(plan, circuit_breaker)
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh_enterprise/context.py", line 98, in _apply
    raise e
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh_enterprise/context.py", line 95, in _apply
    super()._apply(plan, circuit_breaker)
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/context.py", line 1866, in _apply
    self._scheduler.create_plan_evaluator(self).evaluate(
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/plan/evaluator.py", line 115, in evaluate
    self._push(plan, snapshots, deployability_index_for_creation)
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/plan/evaluator.py", line 211, in _push
    self.snapshot_evaluator.create(
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/core/snapshot/evaluator.py", line 298, in create
    concurrent_apply_to_snapshots(
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/utils/concurrency.py", line 163, in concurrent_apply_to_snapshots
    return concurrent_apply_to_dag(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/utils/concurrency.py", line 205, in concurrent_apply_to_dag
    ).run()
      ^^^^^
  File "/Users/sung/Desktop/git_repos/tobiko-cloud-demo/.venv/lib/python3.12/site-packages/sqlmesh/utils/concurrency.py", line 64, in run
    self._finished_future.result()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
sqlmesh.utils.concurrency.NodeExecutionFailedError: Execution failed for node SnapshotId<"sqlmesh-public-demo"."tcloud_demo"."stg_payments": 633964984>
 (context.py:1296)
Error: Failed processing SnapshotId<"sqlmesh-public-demo"."tcloud_demo"."stg_payments": 633964984>. 400 Unrecognized name: fake_column at [1:260]; reason: invalidQuery, location: query, message: Unrecognized name: fake_column at [1:260]

Location: US
Job ID: 425f534f-62ed-4617-8a66-f792dc44eade
MODEL (
  name tcloud_demo.stg_payments,
  cron '@daily',
  grain payment_id,
  audits (UNIQUE_VALUES(columns = (
    payment_id
  )), NOT_NULL(columns = (
    payment_id
  )))
);

SELECT
  id AS payment_id,
  order_id,
  payment_method,
  amount / 100 AS amount, /* `amount` is currently stored in cents, so we convert it to dollars */
  'example_column' AS example_column, /* advanced change categorization example */
  -- 'new_column' AS new_column_from_sung, /* non-breaking change example  */
  fake_column

FROM tcloud_demo.seed_raw_payments
MODEL (
  name tcloud_demo.seed_raw_payments,
  kind SEED (
    path '../seeds/raw_payments.csv'
  ),
  columns (
    id INT64,
    order_id INT64,
    payment_method STRING(50),
    amount INT64
  ),
  grain (id, user_id)
)
@sungchun12 sungchun12 added the Feature Adds new functionality label Oct 31, 2024
@sungchun12 sungchun12 changed the title Better compile time checks if the column exists or not for both Better compile time checks if the column exists or not Oct 31, 2024
@sungchun12 sungchun12 self-assigned this Nov 4, 2024
@tobymao
Copy link
Contributor

tobymao commented Dec 19, 2024

this already exists as a warning, all that is needed here is to raise an error if the optimizer breaks, cc @VaggelisD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Adds new functionality
Projects
None yet
Development

No branches or pull requests

2 participants