Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink: Disable classloader check in TestIcebergSourceWithWatermarkExtractor to fix flakiness #9408

Merged
merged 1 commit into from
Jan 4, 2024

Conversation

manuzhang
Copy link
Contributor

@manuzhang manuzhang commented Jan 4, 2024

I've seen the following flakiness several times. This PR fixes it by disabling Flink classloader check.

TestIcebergSourceWithWatermarkExtractor > testThrottling FAILED
    java.lang.AssertionError: 
    Expecting
      <CompletableFuture[Failed with the following stack trace:
    java.lang.IllegalStateException: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields. If the stacktrace suggests that the leak occurs in a third party library and cannot be fixed immediately, you can disable this check with the configuration 'classloader.check-leaked-classloader'.
    	at org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:184)
    	at org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.loadClass(FlinkUserCodeClassLoaders.java:192)
    	at java.lang.Class.forName0(Native Method)
    	at java.lang.Class.forName(Class.java:348)
    	at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136)
    	at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
    	at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641)
    	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752)
    	at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:400)
    	at org.apache.flink.streaming.api.operators.collect.CollectCoordinationResponse.getResults(CollectCoordinationResponse.java:84)
    	at org.apache.flink.streaming.api.operators.collect.AbstractCollectResultBuffer.addResults(AbstractCollectResultBuffer.java:127)
    	at org.apache.flink.streaming.api.operators.collect.AbstractCollectResultBuffer.dealWithResponse(AbstractCollectResultBuffer.java:91)
    	at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:147)
    	at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
    	at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
    	at org.apache.iceberg.flink.source.TestIcebergSourceWithWatermarkExtractor.lambda$waitForRecords$6(TestIcebergSourceWithWatermarkExtractor.java:415)
    	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
    	at java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596)
    	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
    ]>
    to be completed within 2M.

@manuzhang manuzhang force-pushed the fix-flink-flaky-test branch from 9b48382 to 2489a33 Compare January 4, 2024 12:57
@pvary pvary merged commit 6d75e7a into apache:main Jan 4, 2024
13 checks passed
@pvary
Copy link
Contributor

pvary commented Jan 4, 2024

Thanks @manuzhang for the fix.
CC: @stevenzwu

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
@ajantha-bhat
Copy link
Member

I am using the latest code (double checked) and the test case is still flaky.
https://github.com/apache/iceberg/actions/runs/7438225898/job/20236774791?pr=9298

TestIcebergSourceWithWatermarkExtractor > testThrottling FAILED
    java.lang.AssertionError: 
    Expecting
      <CompletableFuture[Failed with the following stack trace:
    java.lang.IllegalStateException: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields. If the stacktrace suggests that the leak occurs in a third party library and cannot be fixed immediately, you can disable this check with the configuration 'classloader.check-leaked-classloader'.
    	at org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:184)
    	at org.apache.flink.util.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.loadClass(FlinkUserCodeClassLoaders.java:192)
    	at java.lang.Class.forName0(Native Method)
    	at java.lang.Class.forName(Class.java:348)
    	at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:136)
    	at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
    	at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:641)
    	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:752)
    	at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:402)

I propose to disable this testcase (and run local fork to figure out the problem) to avoid causing problems for PR builders.
cc: @pvary, @manuzhang, @stevenzwu, @nastra, @Fokko

@manuzhang
Copy link
Contributor Author

@ajantha-bhat The error looks weird. There's no way SafetyNetWrapperClassLoader is still used when classloader check has been disabled

@manuzhang manuzhang deleted the fix-flink-flaky-test branch January 8, 2024 01:52
@pvary
Copy link
Contributor

pvary commented Jan 8, 2024

Anybody can reproduce this issue locally?
I was able to run 1000 runs using IntelliJ without an issue, but I might miss something

Screenshot 2024-01-08 at 11 44 35

@pvary
Copy link
Contributor

pvary commented Jan 9, 2024

Could not repro the case, but found the issue again in some tests, so decide to remove the checks for the record from the test (relying only on the metrics to check if the WA is working). This is the same approach as it is done in the Flink code, so it should be ok too. #9451

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants