Lots of rejected futures within IPC on driver node #930

dbeavon · 2021-04-29T02:22:55Z

dbeavon
Apr 29, 2021

When my .net application is running in an "all-purpose" databricks cluster under load, I get lots of rejected futures:


21/04/29 02:05:22 ERROR DotnetBackendHandler: Exception caught: 
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@42b847c1 rejected from java.util.concurrent.ThreadPoolExecutor@675f3e07[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
	at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:678)
	at org.apache.spark.api.dotnet.ThreadPool$.run(ThreadPool.scala:33)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.handleBackendRequest(DotnetBackendHandler.scala:105)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.channelRead0(DotnetBackendHandler.scala:28)
	at org.apache.spark.api.dotnet.DotnetBackendHandler.channelRead0(DotnetBackendHandler.scala:21)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

There is not any more detail in the log4j on the scala side.

Does anyone know what this part of the message means ?
Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0

On the .Net side it seems that the program is trying to do something simple ( Microsoft.Spark.Sql.DataFrame.ParseConnectionInfo ):


[At 02:05:21.739 on 0429-015206-shuns344-10-129-253-14 for file datarail/temp/utc_2021_04_29/f157511d-6522-42a0-a4d6-39a0c3058709.xml] Error is : System.Exception: JVM method execution failed: Nonstatic method 'toString' failed for class '95' when called with no arguments
 ---> Microsoft.Spark.JvmException: Exception of type 'Microsoft.Spark.JvmException' was thrown.
   --- End of inner exception stack trace ---
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] args)
   at Microsoft.Spark.Interop.Ipc.JvmBridge.CallNonStaticJavaMethod(JvmObjectReference objectId, String methodName, Object[] args)
   at Microsoft.Spark.Interop.Ipc.JvmObjectReference.Invoke(String methodName, Object[] args)
   at Microsoft.Spark.Sql.DataFrame.ParseConnectionInfo(Object info, Boolean parseServer)
   at Microsoft.Spark.Sql.DataFrame.GetConnectionInfo(String funcName, Object[] args)
   at Microsoft.Spark.Sql.DataFrame.GetRows(String funcName, Object[] args)+MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at UFP.DataRail.Spark.Driver.Program.Main(String[] args) in D:\a\1\s\DotNetWorkspaces\DataRailWorkspace\DataRail\Spark\Driver\Program.cs:line 370.

I don't get these issues when I'm running ten concurrent jobs in my cluster but as I slowly increase the number of concurrency jobs (to 30, 40, 50), then I start seeing lots of these errors. There is plenty of RAM and plenty of CPU, and the driver process isn't spending too much time on GC.

Any help would be appreciated. I'm several days behind schedule now. I thought that my spark cluster size was sufficient to avoid problems as I scaled up, but these issues don't seem to be resource related. They seem to be subtle timing bugs.

I may be able to create a repro. Is there a common place to upload that sort of thing?

I also have a support case open with databricks but they have very little familiarity with spark.net and will probably not want to support this as soon as they see the "Microsoft.Spark" referenced in the stack frames.

imback82 · 2021-04-29T03:23:35Z

imback82
Apr 29, 2021

If you have a repro outside databricks, I am happy to check this out. cc @suhsteve

9 replies

dbeavon May 5, 2021
Author

@imback82 I've created a fairly minimal repro and sent it over to azure-databricks (case 2104270040005818). They are working on involving databricks support too (.. and eventually talking to a databricks engineer ).

Do they have a way of getting in touch with you? I'm assuming that goes thru other channels, and they probably don't post messages into this project, right?

I'm guessing there is some kind of a timing issue in the IPC mechanism. It is probably unique to their driver daemon. Hopefully someone can come up with an innovative idea for getting things to work better. This .net for spark generally works fine, until a higher number of jobs are being processed in a single ("all-purpose") databricks cluster.

The symptoms that are most obvious are the unusual performance issues as well as unusual errors (rejected futures).

If I can't get any traction from the databricks support team, then I may try to synchronize all my (50) driver processes with the help of "global named mutexes" https://stackoverflow.com/a/46329498/4455524

This is very ugly, since the whole point of the cluster is to do the job processing in a concurrent way. However it is more important that I get some level of consistency, and avoid the unpredictable errors... This could buy me some more time to find another Spark platform that better interoperates with .net for spark.

imback82 May 5, 2021

Do they have a way of getting in touch with you? I'm assuming that goes thru other channels, and they probably don't post messages into this project, right?

They can reach out to me via the email in https://github.com/imback82. I am happy to get on the call.

I'm guessing there is some kind of a timing issue in the IPC mechanism. It is probably unique to their driver daemon.

Did you try to repro either Python or Scala if the bottleneck is in their driver daemon vs. .NET IPC mechanism?

dbeavon May 6, 2021
Author

@imback82 No I didn't dig that deep in python or scala yet.

As the performance bottleneck gets worse we start to see the error (rejected futures) and the exception stack frames always appear to have spark.net in them. So I'm assuming the performance bottleneck and the errors are both symptoms of the same underlying problem. They are probably both a result of the same compatibility issue between the .Net/Scala (probably IPC but I'm not certain).

I really wish I could recreate the problem outside of the databricks workspace. Everything is harder to troubleshoot when processes aren't running locally.

Locally with OSS spark everything is extremely fast.... As of now I've only tested my driver with the "client" option for "deploy-mode". I still plan on testing "cluster" but I doubt it would make a difference (I'm sure you would have already seen that long ago if there were any differences between the two in OSS spark).

I'm interested to see how much support we will get from databricks. The azure-databricks team seems to claim this is supported now, but doesn't have the means to support it (no actual spark engineers, nor source code). Ultimately they can try and say something is supported, but it is up to the databricks team to actually offer the support or not.

dbeavon May 14, 2021
Author

@imback82 Hi, I'm curious how hard would it be to simulate the way the databricks "driver daemon" behaves. It acts as a host for multiple .net core driver processes at the same time (via DotnetRunner). It would be really nice to get that simulation running locally on my desktop. Then it might be possible to recreate the same issues that I'm seeing within azure-databricks.

I was considering whether the "debug" mode for DotnetRunner, might help me. IE:

spark-submit -master spark://172.30.11.206:7077 --class org.apache.spark.deploy.dotnet.DotnetRunner ... microsoft-spark-3-0_2.12-1.1.1.jar debug

I have always been able to launch a single dotnet driver program that runs thru that "debug" variation of the DotnetRunner (of course).

And I was surprised that you can also launch multiple driver programs thru it at the same time. It works to a degree. But the "catalyst" package fails to create a logical plan. See below.

[2021-05-14T21:48:17.2153641Z] [24651-DESKTOP] [Error] [JvmBridge] 

JVM method execution failed: Nonstatic method 'col' failed for class '980' when called with 1 arguments ([Index=1, Type=String, Value=DIM_BalanceAccount], )

[2021-05-14T21:48:17.2155893Z] [24651-DESKTOP] [Error] [JvmBridge] org.apache.spark.sql.AnalysisException: Reference 'DIM_BalanceAccount' is ambiguous, 

could be: temp_weeksummary.DIM_BalanceAccount, temp_weeksummary.DIM_BalanceAccount.
;
        at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
        at org.apache.spark.sql.Dataset.resolve(Dataset.scala:262)
        at org.apache.spark.sql.Dataset.col(Dataset.scala:1353)
        at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.api.dotnet.DotnetBackendHandler.handleMethodCall(DotnetBackendHandler.scala:165)
        at org.apache.spark.api.dotnet.DotnetBackendHandler.$anonfun$handleBackendRequest$2(DotnetBackendHandler.scala:105)
        at org.apache.spark.api.dotnet.ThreadPool$$anon$1.run(ThreadPool.scala:34)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Is there some way to get past this exception and move on to the next? It appears like a Hive issue perhaps.
If I can configure an independent context for resolving references then maybe this could work. The sessions need to be independent of each other, despite the fact that all these .Net drivers are passing thru the same Spark/Scala driver. Does that seem possible?

Assuming I can get past some of these other errors, would this be at all similar to what Databricks is doing in their "driver daemon"?

I don't suppose that databricks would supply us with their own testing harness that would simulate their driver daemon?

imback82 May 15, 2021

For the AnalysisException above, I think you need to rewrite your query with a slight change. If it's a df query, you can do something like var newDf = df.As("df_new") then access the column as newDf["DIM_BalanceAccount"]

Assuming I can get past some of these other errors, would this be at all similar to what Databricks is doing in their "driver daemon"?

Sorry again, but I do not know the internals of their "driver daemon". :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of rejected futures within IPC on driver node #930

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Lots of rejected futures within IPC on driver node #930

dbeavon Apr 29, 2021

Replies: 1 comment · 9 replies

imback82 Apr 29, 2021

dbeavon May 5, 2021 Author

imback82 May 5, 2021

dbeavon May 6, 2021 Author

dbeavon May 14, 2021 Author

imback82 May 15, 2021

dbeavon
Apr 29, 2021

Replies: 1 comment 9 replies

imback82
Apr 29, 2021

dbeavon May 5, 2021
Author

dbeavon May 6, 2021
Author

dbeavon May 14, 2021
Author