Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50310][PYTHON] Add a flag to disable DataFrameQueryContext for PySpark #48964

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Nov 26, 2024

What changes were proposed in this pull request?

We disabled the DataFrameQueryContext from #48827, and we also need adding a flag for PySpark for the same performant reason.

Why are the changes needed?

To avoid the performance slowdown for the case when the DataFrameQueryContext too much stacked

Does this PR introduce any user-facing change?

No API changes, but the DataFrameQueryContext would no longer displayed when the flag is disabled

How was this patch tested?

Manually tested:

  1. FLAG ON (almost 25sec)
>>> spark.conf.get("spark.sql.dataFrameDebugging.enabled")
'true'
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
...   _ = c.alias("a")
...
>>> print(time.time() - start)
24.78217577934265
  1. FLAG OFF (only 1 sec)
>>> spark.conf.get("spark.sql.dataFrameDebugging.enabled")
'false'
>>> import time
>>> import pyspark.sql.functions as F
>>>
>>> c = F.col("name")
>>> start = time.time()
>>> for i in range(10000):
...   _ = c.alias("a")
...
>>> print(time.time() - start)
1.0222370624542236

Was this patch authored or co-authored using generative AI tooling?

@itholic itholic changed the title [SPARK-50310][PYTHON] Add a flag to disable DataFrameQueryContext for PySpark [SPARK-50310][PYTHON] Apply a flag to disable DataFrameQueryContext for PySpark Nov 26, 2024
if spark is not None:
_enable_debugging_cache = (
spark.conf.get(
"spark.sql.dataFrameQueryContext.enabled", "true" # type: ignore[union-attr]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.dataFrameQueryContext.enabled has to be StaticSQLConf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Should we update the existing conf to static conf or should we add a new one for PySpark specific?

The existing conf here is defined as SQLConf below:

val DATA_FRAME_QUERY_CONTEXT_ENABLED = buildConf("spark.sql.dataFrameQueryContext.enabled")
.internal()
.doc(
"Enable the DataFrame query context. This feature is enabled by default, but has a " +
"non-trivial performance overhead because of the stack trace collection.")
.version("4.0.0")
.booleanConf
.createWithDefault(true)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so

@xinrong-meng
Copy link
Member

A comment for the PR description #48827 doesn’t seem to disable “DataFrameQueryContext” if I understand correctly :)

@itholic
Copy link
Contributor Author

itholic commented Nov 27, 2024

A comment for the PR description #48827 doesn’t seem to disable “DataFrameQueryContext” if I understand correctly :)

Yes it doesn't disable the DataFrameQueryContext but just provide an option to disable the DataFrameQueryContext to PySpark users. Default behavior still collects the stack traces. Please let me know if I don't get the context clearly.

@itholic itholic changed the title [SPARK-50310][PYTHON] Apply a flag to disable DataFrameQueryContext for PySpark [SPARK-50310][PYTHON] Add a flag to disable DataFrameQueryContext for PySpark Nov 28, 2024
@github-actions github-actions bot added the SQL label Nov 28, 2024
Comment on lines +299 to +300
val DATA_FRAME_DEBUGGING_ENABLED =
buildStaticConf("spark.sql.dataFrameDebugging.enabled")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add separate static conf for PySpark since we cannot change the behavior after enabling the config from Python side. also cc @HyukjinKwon @vladimirg-db

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants