Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items #7003

Merged
merged 2 commits into from
Aug 28, 2024

Conversation

zhli1142015
Copy link
Contributor

@zhli1142015 zhli1142015 commented Aug 26, 2024

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Fixes: #6995

How was this patch tested?

UT.

@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Aug 26, 2024
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

Run Gluten Clickhouse CI

@zhouyuan zhouyuan changed the title [Core] Limit soft affinity duplicate reading detection max cache items [GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items Aug 26, 2024
Copy link

#6995

@jackylee-ch
Copy link
Contributor

This patch doesn't fix the performance problem for SoftAffinityListener, the Spark Events still can be discarded, which may cause user problem if they also use Listeners. We need a comment to explain this problem so that users can know the possible problems with this configuration when using it. BTW, maybe it's a better choice to change it to false by default, and users can turn it on as needed.

@zhli1142015
Copy link
Contributor Author

This patch doesn't fix the performance problem for SoftAffinityListener, the Spark Events still can be discarded, which may cause user problem if they also use Listeners. We need a comment to explain this problem so that users can know the possible problems with this configuration when using it. BTW, maybe it's a better choice to change it to false by default, and users can turn it on as needed.

Do you mean is the SoftAffinityListener cause of the slowness, I don't observe this, can you help tp share how this is repro in your env?

@jackylee-ch
Copy link
Contributor

Do you mean is the SoftAffinityListener cause of the slowness, I don't observe this, can you help tp share how this is repro in your env?

This is how I reproduced the problem locally.

  1. Start spark-sql with --conf spark.hadoop.parquet.page.size=1024 --conf spark.hadoop.parquet.block.size=2048 --conf spark.sql.files.maxPartitionBytes=2048
  2. Then, run below sqls.
create table test(a string) using parquet;
create table test1(a string) using parquet;
insert into test values(0); // make sure there is 10000 values in it
insert into test1 select /*+ REPARTITION(10000) */ * from test;
  1. finally, restart spark-sql with SoftAffinity and run bellow sql
select count(*) from test1;

@zhli1142015
Copy link
Contributor Author

Do you mean is the SoftAffinityListener cause of the slowness, I don't observe this, can you help tp share how this is repro in your env?

This is how I reproduced the problem locally.

  1. Start spark-sql with --conf spark.hadoop.parquet.page.size=1024 --conf spark.hadoop.parquet.block.size=2048 --conf spark.sql.files.maxPartitionBytes=2048
  2. Then, run below sqls.
create table test(a string) using parquet;
create table test1(a string) using parquet;
insert into test values(0); // make sure there is 10000 values in it
insert into test1 select /*+ REPARTITION(10000) */ * from test;
  1. finally, restart spark-sql with SoftAffinity and run bellow sql
select count(*) from test1;

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing. From the code, there are only the cache get / put operations in the event handling logic. I thought the memory pressure (GC) is more like the cause.
10K values
image
120k values
image

@jackylee-ch
Copy link
Contributor

jackylee-ch commented Aug 26, 2024

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition? Oh, there is one more information, sorry for I missed, that you need run select count(*) from test for 30 times in one application.

Copy link

Run Gluten Clickhouse CI

@zhli1142015
Copy link
Contributor Author

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition?

6k+ partitions for 10K values, and 60k+ partitions for 120k values.
image

image

@jackylee-ch
Copy link
Contributor

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition?

6k+ partitions for 10K values, and 60k+ partitions for 120k values. image

image

Greate, could you try 30 sqls in one Application to check if you can reproduce?

@zhli1142015
Copy link
Contributor Author

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition?

6k+ partitions for 10K values, and 60k+ partitions for 120k values. image
image

Greate, could you try 30 sqls in one Application to check if you can reproduce?

OOM happens with 120K values, but there is no event dropped logs.
image

@zhli1142015
Copy link
Contributor Author

OOM is not related, even SA is disable, it still happens.

@jackylee-ch
Copy link
Contributor

OOM happens with 120K values, but there is no event dropped logs.

Hm, perhaps it is related to hardware performance. I tried ./bin/spark-sql --master local[192] with the default the gluten configs and above sqls, the problem can be stably reproduced in our environment.
image

BTW, maybe it's a better choice to change it to false by default, and users can turn it on as needed.

Since this pr has already closed the duplicate reading detection by default, It's ok to me now.

Copy link
Contributor

@jackylee-ch jackylee-ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your fix.

@zhli1142015 zhli1142015 merged commit edaf88a into apache:main Aug 28, 2024
43 checks passed
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
…ax cache items (apache#7003)

* [Core] Limit soft affinity duplicate reading detection max cache items

* disable duplicate_reading by default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VL] Driver memory leak with soft affinity duplicate reading
2 participants