Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-873] [Regression] 1.6does not work with method: thrift due to pyhive's lack of Cursor.fetchmany() method #885

Closed
2 tasks done
dataders opened this issue Sep 6, 2023 · 4 comments
Labels
bug Something isn't working regression Stale

Comments

@dataders
Copy link
Contributor

dataders commented Sep 6, 2023

Is this a regression in a recent version of dbt-spark?

  • I believe this is a regression in dbt-spark functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

reports & discussion

@sid-deshmukh originally opened dbt-labs/dbt-external-tables#234, but I believe this issue to be with dbt-spark, not dbt-external-tables.

@timvw and @jelstongreen also reported in a #db-databricks-and-spark thread in Community Slack there were experiencing similar issues

for reference, here's our internal dbt Labs Slack thread

stacktrace

compiling fails with the following stacktrace. dbt calls .get_result_from_cursor() which calls cursor.fetchall() which in PyHive is passed to it's Cursor._fetch_more() (pyhive/hive.py#L507), where it fails.

columns = [_unwrap_column(col, col_schema[1]) for col, col_schema in
           zip(response.results.columns, schema)]
full stacktrace
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 302, in exception_handler
    yield
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/clients/jinja.py", line 257, in call_macro
    return macro(*args, **kwargs)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 52, in macro
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/base/impl.py", line 290, in execute
    return self.connections.execute(sql=sql, auto_begin=auto_begin, fetch=fetch, limit=limit)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/sql/connections.py", line 149, in execute
    table = self.get_result_from_cursor(cursor, limit)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/sql/connections.py", line 129, in get_result_from_cursor
    rows = cursor.fetchall()
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/dbt/adapters/spark/connections.py", line 197, in fetchall
    return self._cursor.fetchall()
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/common.py", line 137, in fetchall
    return list(iter(self.fetchone, None))
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/common.py", line 106, in fetchone
    self._fetch_while(lambda: not self._data and self._state !=
                      self._STATE_FINISHED)
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/common.py", line 46, in _fetch_while
    self._fetch_more()
  File "/Users/user/PycharmProjects/dbt-data-pipeline/venv/lib/python3.8/site-packages/pyhive/hive.py", line 481, in _fetch_more
    zip(response.results.columns, schema)]
TypeError: 'NoneType' object is not iterable

Expected/Previous Behavior

things work (ostensibly because pyhive's cursor.fetch() does not invoke ._fetchmore() like .fetchmany() does

Steps To Reproduce

  1. dbt-spark 1.6.0
  2. using method: thrift
  3. doing any sort of jinja compilation (which is almost anything)

Relevant log output

No response

Environment

- OS:
- Python:
- dbt-core (working version):
- dbt-spark (working version):
- dbt-core (regression version):
- dbt-spark (regression version):

Additional Context

this problem ever happening again could be solved by dbt-labs/dbt-core#8471

@dataders dataders added bug Something isn't working triage regression labels Sep 6, 2023
@github-actions github-actions bot changed the title [Regression] 1.6does not work with method: thrift due to pyhive's lack of Cursor.fetchmany() method [ADAP-873] [Regression] 1.6does not work with method: thrift due to pyhive's lack of Cursor.fetchmany() method Sep 6, 2023
@dataders dataders removed the triage label Sep 7, 2023
@timvw
Copy link

timvw commented Sep 19, 2023

I have seen this happen with sparksession as well when using the "show" command...

@lmarcondes
Copy link

Not sure if there's still interest on this, but looking into the PyHive code it doesn't seem to handle queries with empty result sets correctly. I've forked and issued a PR here but it seems the library's been pretty much unsupported for a few years now

With the changes Jinja is able to compile and results are correctly received

❯ dbt run-operation stage_external_sources --log-level debug --print
01:23:03  Running with dbt=1.6.7
01:23:03  running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'profiles_dir': '/home/lmarcondes/.dbt', 'fail_fast': 'True', 'warn_error': 'True', 'log_path': '/home/lmarcondes/Documents/projects/votacao-2022/src/capivara-etl-models/capivara/logs', 'debug': 'False', 'version_check': 'True', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'log_format': 'default', 'static_parser': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'introspect': 'True', 'target_path': 'None', 'invocation_command': 'dbt run-operation stage_external_sources --log-level debug --print', 'send_anonymous_usage_stats': 'False'}
01:23:03  Registered adapter: spark=1.6.0
01:23:03  checksum: a051d2bc88277f3be74306f0393e0e8e6f29724fe11a36c13ebfccd4b87560d8, vars: {}, profile: , target: , version: 1.6.7
01:23:03  Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
01:23:03  Partial parsing enabled, no changes found, skipping parsing
01:23:03  Found 1 model, 5 sources, 0 exposures, 0 metrics, 557 macros, 0 groups, 0 semantic models
01:23:03  Acquiring new spark connection 'macro_stage_external_sources'
01:23:03  Spark adapter: NotImplemented: add_begin_query
01:23:03  Spark adapter: NotImplemented: commit
01:23:03  1 of 5 START external source default.caged_for
01:23:03  On "macro_stage_external_sources": cache miss for schema ".default", this is inefficient
01:23:03  Using spark connection "macro_stage_external_sources"
01:23:03  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "capivara", "target_name": "local", "connection_name": "macro_stage_external_sources"} */
show table extended in default like '*'
  
01:23:03  Opening a new connection, currently in state init
01:23:03  Spark adapter: Poll status: 2, query complete
01:23:03  SQL status: OK in 0.0 seconds
01:23:03  While listing relations in database=, schema=default, found: caged_exc, caged_for, caged_mov, links_2o_turno
01:23:03  1 of 5 (1) refresh table default.caged_for
01:23:03  Using spark connection "macro_stage_external_sources"
01:23:03  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "capivara", "target_name": "local", "connection_name": "macro_stage_external_sources"} */

                 
        refresh table default.caged_for
    
            
01:23:08  Spark adapter: Poll status: 1, sleeping
01:23:13  Spark adapter: Poll status: 1, sleeping
01:23:18  Spark adapter: Poll status: 1, sleeping
01:23:23  Spark adapter: Poll status: 1, sleeping
01:23:28  Spark adapter: Poll status: 1, sleeping
01:23:33  Spark adapter: Poll status: 1, sleeping
01:23:38  Spark adapter: Poll status: 1, sleeping
01:23:43  Spark adapter: Poll status: 1, sleeping
01:23:48  Spark adapter: Poll status: 1, sleeping
01:23:53  Spark adapter: Poll status: 1, sleeping
01:23:58  Spark adapter: Poll status: 1, sleeping
01:24:03  Spark adapter: Poll status: 1, sleeping
01:24:08  Spark adapter: Poll status: 1, sleeping
01:24:12  Spark adapter: Poll status: 2, query complete
01:24:12  SQL status: OK in 69.0 seconds
01:24:12  1 of 5 (1) OK
01:24:12  2 of 5 START external source default.caged_mov
01:24:12  2 of 5 (1) refresh table default.caged_mov
01:24:12  Using spark connection "macro_stage_external_sources"
01:24:12  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.6.7", "profile_name": "capivara", "target_name": "local", "connection_name": "macro_stage_external_sources"} */

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 30, 2024
Copy link
Contributor

github-actions bot commented May 7, 2024

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression Stale
Projects
None yet
Development

No branches or pull requests

3 participants