Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] : Intermittent Inconsistency in get_relation Function #366

Open
2 tasks done
prashant462 opened this issue Dec 2, 2024 · 0 comments
Open
2 tasks done

[Bug] : Intermittent Inconsistency in get_relation Function #366

prashant462 opened this issue Dec 2, 2024 · 0 comments
Labels
bug Something isn't working triage

Comments

@prashant462
Copy link

prashant462 commented Dec 2, 2024

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I am using dbt-spark in my projects.
The get_relation function in the dbt adapter intermittently fails to find relations that exist in the database. This issue manifests indirectly when using dbt macros such as incremental, get_columns_in_relation, or others that rely on get_relation to check for the existence of a relation.

As per my understanding, the get_relation function in the dbt adapter works by calling list_relations to fetch all relations in a schema or database. If the schema is already cached, list_relations retrieves the list of relations from the cache instead of querying the database. However, if a new relation (e.g., a table or view) is created after the schema cache is populated, the cache is not automatically refreshed.

    @available.parse_none
    def get_relation(self, database: str, schema: str, identifier: str) -> Optional[BaseRelation]:
        relations_list = self.list_relations(database, schema)

        matches = self._make_match(relations_list, database, schema, identifier)

        if len(matches) > 1:
            kwargs = {
                "identifier": identifier,
                "schema": schema,
                "database": database,
            }
            raise RelationReturnedMultipleResultsError(kwargs, matches)

        elif matches:
            return matches[0]

        return None
    def list_relations(self, database: Optional[str], schema: str) -> List[BaseRelation]:
        if self._schema_is_cached(database, schema):
            return self.cache.get_relations(database, schema)

        schema_relation = self.Relation.create(
            database=database,
            schema=schema,
            identifier="",
            quote_policy=self.config.quoting,
        ).without_identifier()

        # we can't build the relations cache because we don't have a
        # manifest so we can't run any operations.
        relations = self.list_relations_without_caching(schema_relation)

        # if the cache is already populated, add this schema in
        # otherwise, skip updating the cache and just ignore
        if self.cache:
            for relation in relations:
                self.cache.add(relation)
            if not relations:
                # it's possible that there were no relations in some schemas. We want
                # to insert the schemas we query into the cache's `.schemas` attribute
                # so we can check it later
                self.cache.update_schemas([(database, schema)])

        fire_event(
            ListRelations(
                database=cast_to_str(database),
                schema=schema,
                relations=[_make_ref_key_dict(x) for x in relations],
            )
        )

        return relations

Consequently, get_relation may return None for an existing relation that is not present in the outdated cache. This behavior indirectly affects dbt macros such as incremental and get_columns_in_relation, which rely on get_relation to check for the existence of relations. As a result, these macros may intermittently fail or behave unexpectedly, depending on whether the cache is outdated at the time the macro executes.

If my understanding is incorrect, please clarify how caching and get_relation are supposed to work.

Expected Behavior

To reliably validate the existence of a relation, the process could include the following stages:

  • Check the Cache: First, check if the relation exists in the list of relations for the cached schema (if the schema is already cached).

  • Fallback to Fresh Query: If the relation is not found in the cache, perform a fresh query to fetch the list of relations in the schema without relying on the cache. This step accounts for cases where the relation might have been created after the schema cache was populated.

  • Conclude Non-Existence: If the relation is not found in both the cached and freshly queried lists, conclude that the relation does not exist.

Steps To Reproduce

While the issue is intermittent, it can occur under the following conditions:

Run a dbt operation that internally calls get_relation (e.g., incremental model or macros like get_columns_in_relation) for a schema.

Below attaching the screenshot of the error , which ran successfully in the next retry. (even though the relation already exists)

Screenshot 2024-12-02 at 10 27 21 AM

Relevant log output

No response

Environment

- Python: 3.9.6
- dbt-core: 1.7.4
- dbt-spark: 1.7.1

Additional Context

No response

@prashant462 prashant462 added bug Something isn't working triage labels Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant