[GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items #7003

zhli1142015 · 2024-08-26T01:38:22Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Fixes: #6995

How was this patch tested?

UT.

github-actions · 2024-08-26T01:38:40Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-08-26T01:38:56Z

Run Gluten Clickhouse CI

github-actions · 2024-08-26T02:06:16Z

#6995

gluten-core/src/main/scala/org/apache/gluten/softaffinity/SoftAffinityManager.scala

jackylee-ch · 2024-08-26T02:40:10Z

This patch doesn't fix the performance problem for SoftAffinityListener, the Spark Events still can be discarded, which may cause user problem if they also use Listeners. We need a comment to explain this problem so that users can know the possible problems with this configuration when using it. BTW, maybe it's a better choice to change it to false by default, and users can turn it on as needed.

zhli1142015 · 2024-08-26T02:49:41Z

This patch doesn't fix the performance problem for SoftAffinityListener, the Spark Events still can be discarded, which may cause user problem if they also use Listeners. We need a comment to explain this problem so that users can know the possible problems with this configuration when using it. BTW, maybe it's a better choice to change it to false by default, and users can turn it on as needed.

Do you mean is the SoftAffinityListener cause of the slowness, I don't observe this, can you help tp share how this is repro in your env?

jackylee-ch · 2024-08-26T03:07:31Z

Do you mean is the SoftAffinityListener cause of the slowness, I don't observe this, can you help tp share how this is repro in your env?

This is how I reproduced the problem locally.

Start spark-sql with --conf spark.hadoop.parquet.page.size=1024 --conf spark.hadoop.parquet.block.size=2048 --conf spark.sql.files.maxPartitionBytes=2048
Then, run below sqls.

create table test(a string) using parquet;
create table test1(a string) using parquet;
insert into test values(0); // make sure there is 10000 values in it
insert into test1 select /*+ REPARTITION(10000) */ * from test;

finally, restart spark-sql with SoftAffinity and run bellow sql

select count(*) from test1;

zhli1142015 · 2024-08-26T07:54:39Z

Do you mean is the SoftAffinityListener cause of the slowness, I don't observe this, can you help tp share how this is repro in your env?

This is how I reproduced the problem locally.

Start spark-sql with --conf spark.hadoop.parquet.page.size=1024 --conf spark.hadoop.parquet.block.size=2048 --conf spark.sql.files.maxPartitionBytes=2048

Then, run below sqls.
create table test(a string) using parquet;
create table test1(a string) using parquet;
insert into test values(0); // make sure there is 10000 values in it
insert into test1 select /*+ REPARTITION(10000) */ * from test;
finally, restart spark-sql with SoftAffinity and run bellow sql
select count(*) from test1;

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing. From the code, there are only the cache get / put operations in the event handling logic. I thought the memory pressure (GC) is more like the cause.
10K values

120k values

jackylee-ch · 2024-08-26T07:57:29Z

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition? Oh, there is one more information, sorry for I missed, that you need run select count(*) from test for 30 times in one application.

github-actions · 2024-08-26T07:58:35Z

Run Gluten Clickhouse CI

zhli1142015 · 2024-08-26T08:04:29Z

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition?

6k+ partitions for 10K values, and 60k+ partitions for 120k values.

jackylee-ch · 2024-08-26T08:07:31Z

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition?

6k+ partitions for 10K values, and 60k+ partitions for 120k values.

Greate, could you try 30 sqls in one Application to check if you can reproduce?

zhli1142015 · 2024-08-26T08:45:52Z

From above I actually can't repro the issue. I think maybe this is because of the hardware differences, but with more partitions, we actually observe the latency increasing.

Did you final get 10000+ FilePartitions for sql or only have one partition?

6k+ partitions for 10K values, and 60k+ partitions for 120k values.

Greate, could you try 30 sqls in one Application to check if you can reproduce?

OOM happens with 120K values, but there is no event dropped logs.

zhli1142015 · 2024-08-26T09:19:55Z

OOM is not related, even SA is disable, it still happens.

jackylee-ch · 2024-08-26T09:29:06Z

OOM happens with 120K values, but there is no event dropped logs.

Hm, perhaps it is related to hardware performance. I tried ./bin/spark-sql --master local[192] with the default the gluten configs and above sqls, the problem can be stably reproduced in our environment.

BTW, maybe it's a better choice to change it to false by default, and users can turn it on as needed.

Since this pr has already closed the duplicate reading detection by default, It's ok to me now.

jackylee-ch

Thanks for your fix.

…ax cache items (apache#7003) * [Core] Limit soft affinity duplicate reading detection max cache items * disable duplicate_reading by default

[Core] Limit soft affinity duplicate reading detection max cache items

9fadaea

github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Aug 26, 2024

zhouyuan changed the title ~~[Core] Limit soft affinity duplicate reading detection max cache items~~ [GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items Aug 26, 2024

jackylee-ch reviewed Aug 26, 2024

View reviewed changes

gluten-core/src/main/scala/org/apache/gluten/softaffinity/SoftAffinityManager.scala Show resolved Hide resolved

disable duplicate_reading by default

209b9ab

zhli1142015 requested a review from jackylee-ch August 26, 2024 09:20

jackylee-ch approved these changes Aug 26, 2024

View reviewed changes

zhli1142015 merged commit edaf88a into apache:main Aug 28, 2024
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items #7003

[GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items #7003

zhli1142015 commented Aug 26, 2024 •

edited

Loading

github-actions bot commented Aug 26, 2024

github-actions bot commented Aug 26, 2024

github-actions bot commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024 •

edited

Loading

github-actions bot commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

jackylee-ch left a comment

[GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items #7003

[GLUTEN-6995][Core] Limit soft affinity duplicate reading detection max cache items #7003

Conversation

zhli1142015 commented Aug 26, 2024 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Aug 26, 2024

github-actions bot commented Aug 26, 2024

github-actions bot commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024 • edited Loading

github-actions bot commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

zhli1142015 commented Aug 26, 2024

jackylee-ch commented Aug 26, 2024

jackylee-ch left a comment

Choose a reason for hiding this comment

zhli1142015 commented Aug 26, 2024 •

edited

Loading

jackylee-ch commented Aug 26, 2024 •

edited

Loading