[Bug] : Intermittent Inconsistency in get_relation Function #366

prashant462 · 2024-12-02T05:01:44Z

Is this a new bug?

I believe this is a new bug
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I am using dbt-spark in my projects.
The get_relation function in the dbt adapter intermittently fails to find relations that exist in the database. This issue manifests indirectly when using dbt macros such as incremental, get_columns_in_relation, or others that rely on get_relation to check for the existence of a relation.

As per my understanding, the get_relation function in the dbt adapter works by calling list_relations to fetch all relations in a schema or database. If the schema is already cached, list_relations retrieves the list of relations from the cache instead of querying the database. However, if a new relation (e.g., a table or view) is created after the schema cache is populated, the cache is not automatically refreshed.

    @available.parse_none
    def get_relation(self, database: str, schema: str, identifier: str) -> Optional[BaseRelation]:
        relations_list = self.list_relations(database, schema)

        matches = self._make_match(relations_list, database, schema, identifier)

        if len(matches) > 1:
            kwargs = {
                "identifier": identifier,
                "schema": schema,
                "database": database,
            }
            raise RelationReturnedMultipleResultsError(kwargs, matches)

        elif matches:
            return matches[0]

        return None

    def list_relations(self, database: Optional[str], schema: str) -> List[BaseRelation]:
        if self._schema_is_cached(database, schema):
            return self.cache.get_relations(database, schema)

        schema_relation = self.Relation.create(
            database=database,
            schema=schema,
            identifier="",
            quote_policy=self.config.quoting,
        ).without_identifier()

        # we can't build the relations cache because we don't have a
        # manifest so we can't run any operations.
        relations = self.list_relations_without_caching(schema_relation)

        # if the cache is already populated, add this schema in
        # otherwise, skip updating the cache and just ignore
        if self.cache:
            for relation in relations:
                self.cache.add(relation)
            if not relations:
                # it's possible that there were no relations in some schemas. We want
                # to insert the schemas we query into the cache's `.schemas` attribute
                # so we can check it later
                self.cache.update_schemas([(database, schema)])

        fire_event(
            ListRelations(
                database=cast_to_str(database),
                schema=schema,
                relations=[_make_ref_key_dict(x) for x in relations],
            )
        )

        return relations

Consequently, get_relation may return None for an existing relation that is not present in the outdated cache. This behavior indirectly affects dbt macros such as incremental and get_columns_in_relation, which rely on get_relation to check for the existence of relations. As a result, these macros may intermittently fail or behave unexpectedly, depending on whether the cache is outdated at the time the macro executes.

If my understanding is incorrect, please clarify how caching and get_relation are supposed to work.

Expected Behavior

To reliably validate the existence of a relation, the process could include the following stages:

Check the Cache: First, check if the relation exists in the list of relations for the cached schema (if the schema is already cached).
Fallback to Fresh Query: If the relation is not found in the cache, perform a fresh query to fetch the list of relations in the schema without relying on the cache. This step accounts for cases where the relation might have been created after the schema cache was populated.
Conclude Non-Existence: If the relation is not found in both the cached and freshly queried lists, conclude that the relation does not exist.

Steps To Reproduce

While the issue is intermittent, it can occur under the following conditions:

Run a dbt operation that internally calls get_relation (e.g., incremental model or macros like get_columns_in_relation) for a schema.

Below attaching the screenshot of the error , which ran successfully in the next retry. (even though the relation already exists)

Relevant log output

No response

Environment

- Python: 3.9.6
- dbt-core: 1.7.4
- dbt-spark: 1.7.1

Additional Context

No response

The text was updated successfully, but these errors were encountered:

prashant462 added bug Something isn't working triage labels Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] : Intermittent Inconsistency in get_relation Function #366

[Bug] : Intermittent Inconsistency in get_relation Function #366

prashant462 commented Dec 2, 2024 •

edited

Loading

[Bug] : Intermittent Inconsistency in get_relation Function #366

[Bug] : Intermittent Inconsistency in get_relation Function #366

Comments

prashant462 commented Dec 2, 2024 • edited Loading

Is this a new bug?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

prashant462 commented Dec 2, 2024 •

edited

Loading