You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OutOfMemorySparkException only when including a validationIndicatorCol - LightGBMClassifier
System information
Language version ( python 3.10.12, scala 2.12):
Spark Version ( 3.4.1):
Spark Platform (Databricks RunTime 13.3 LTS)
Describe the problem
I have a dataset written to S3 which was created using pyspark.ml's OneHotEncoder and VectorAssembler. So the dataset written to S3 has 3 columns:
Features: a sparse representation of my original features. Features were originally a mix of numerical and categorical, the categoricals got OneHotEncoded
Target column (binary)
EvalFlag (boolean)
When I don't set a validationIndicatorCol in the classifier's constructor, training succeeds. However, as soon as I set validationIndicatorCol='EvalFlag' in the constructor, I'm getting the error #org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (60.0 GiB). The average row size was 626.0 B, with 29.0 GiB used for temporary buffers. [shuffleId: None]
Note that I'd already increased spark.driver.MaxResultSize from whatever its default is, to 60gb. Even when it was its default value (which I gather would be much smaller), training without the evaluation flag worked just fine.
So something about including an evaluation set has massively increased the requirements on spark.driver.MaxResultSize
Code to reproduce issue
frompyspark.mlimportPipelinefrompyspark.ml.featureimportOneHotEncoder, VectorAssemblerfrompyspark.sqlimportfunctionsasFfromsynapse.ml.lightgbmimportLightGBMClassifierdf_raw=spark.read.parquet("s3path.pq")
categorical_columns= ["x", "y", "z", ..]
dense_columns= ['a', 'b', 'c', ...]
#df_raw on disk has `categorical_columns`, `dense_columns` and `'target'` as columns ohe_stage= [OneHotEncoder(inputCol=col, outputCol=f"{col}_ohe") forcolincategorical_columns]
ohe_output_columns= [f"{col}_ohe"forcolincategorical_columns]
assembler=VectorAssembler(
inputCols=dense_columns+ohe_output_columns,
outputCol="features",
handleInvalid='keep'
)
pipeline=Pipeline(stages=ohe_stage+ [assembler])
pipeline_model=pipeline.fit(df_raw)
df_transformed= (pipeline_model.transform(df_raw)
.withColumn("EvalFlag", F.rand())
.withColumn("EvalFlag", F.col("EvalFlag")<0.1)
)
#optionally either write df_transformed to s3 to materialize the transformation before training, and then read back in, or just pass df_transformed to training, doesn't change the outcome#df_transformed.write.parquet('s3pathdftransformed.pq')#df_transformed = spark.read.parquet('s3pathdftransformed.pq')lightgbm=LightGBMClassifier(
featuresCol="features",
labelCol="target",
categoricalSlotNames=categorical_columns,
maxDepth=10,
numLeaves=200,
learningRate=0.1,
numIterations=500,
earlyStoppingRound=10,
validationIndicatorCol='EvalFlag', #toggle this line on and offverbosity=1
)
lightgbm_model=lightgbm.fit(df_transformed)
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive: Cognitive project
area/core: Core project
area/deep-learning: DeepLearning project
area/lightgbm: Lightgbm project
area/opencv: Opencv project
area/vw: VW project
area/website: Website
area/build: Project build system
area/notebooks: Samples under notebooks folder
area/docker: Docker usage
area/models: models related issue
What language(s) does this bug affect?
language/scala: Scala source code
language/python: Pyspark APIs
language/r: R APIs
language/csharp: .NET APIs
language/new: Proposals for new client languages
What integration(s) does this bug affect?
integrations/synapse: Azure Synapse integrations
integrations/azureml: Azure ML integrations
integrations/databricks: Databricks integrations
The text was updated successfully, but these errors were encountered:
SynapseML version
OutOfMemorySparkException
only when including avalidationIndicatorCol
- LightGBMClassifierSystem information
Describe the problem
I have a dataset written to S3 which was created using pyspark.ml's OneHotEncoder and VectorAssembler. So the dataset written to S3 has 3 columns:
When I don't set a validationIndicatorCol in the classifier's constructor, training succeeds. However, as soon as I set
validationIndicatorCol='EvalFlag'
in the constructor, I'm getting the error#org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (60.0 GiB). The average row size was 626.0 B, with 29.0 GiB used for temporary buffers. [shuffleId: None]
Note that I'd already increased
spark.driver.MaxResultSize
from whatever its default is, to 60gb. Even when it was its default value (which I gather would be much smaller), training without the evaluation flag worked just fine.So something about including an evaluation set has massively increased the requirements on
spark.driver.MaxResultSize
Code to reproduce issue
Other info / logs
No response
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: