Performance in a multi-GPU environment #47

francois-wellenreiter · 2017-04-26T07:37:11Z

Hi there,

Actually this is not a real issue but just a general question about how to keep performances using multi-GPU machines using the GPUEnabler plug-in.
I would like to run a Spark based application in a cluster composed of Intel Xeon based bi-socket machines.
Each of these machines is also equipped with 2 NVidia GPUs (1 GPU directly connected to a socket). No NVLink connection is available between these GPUs.

Given the NUMA factor that penalizes any process access to the farthest GPU, I am wondering if there is any parameter in Spark or in CUDA or any other tool that could allow us to have all the Spark tasks using the closest GPU for a given computation or maybe a mean to deny any process trying to use the GPU 0 when running on the socket 1 (considering that GPU 0 is connected to socket 0 and GPU 1 to socket 1).

Thank you for any help, François

josiahsams · 2017-06-01T07:29:32Z

Sorry @francois-wellenreiter , we still rely on the spark scheduler for CPU and as of now, we choose GPU's based on the execution ID and not through any other heuristics like proximity etc.
In short, all partitions handled by a single executor is processed by one GPU. But it a good idea to have it part of our future enhancements.

a-agrz · 2017-07-21T15:00:02Z

Hi,

I was able to run the SparkGPULR example in a machine that has 3 GPUs attached to it (2 Tesla M60 and 1 Tesla k20).
I configured spark to work in a cluster mode with two workers and one master. Each worker has 2 executors (4 in total) with 8 cores each. So I used 32 cores in totoal from the machine I'm running spark on. I also give 8Gb of memory to each executor.

When launching the application with spark-submit, I noticed the folowing behavior ragarding GPU usage by running nvidia-smi commmand :

So as you can see in the photo above :

GPU_0 is used by four executors, 2 executor are using less memory (72 MiB) than the two others(>300Mib)
GPU_1 is used by 1 executor, 385 Mib of memory usage
GPU_2 is used by the 1 executor, 376 Mib of memory usage

Given the source code I was able to understand that GPUEnabler choose the GPU's based on executor ID as shown here
def get = { this.synchronized { if (SparkEnv.get != oldSparkEnv) { oldSparkEnv = SparkEnv.get initalize() } if (env.isGPUEnabled) { val executorId = env.executorId match { case "driver" => 0 case _ => SparkEnv.get.executorId.toInt } JCuda.cudaSetDevice(executorId % env.gpuCount ) } env } } }

So every executor will be attached only to one GPU during the hol execution time ?? but that's not the case here !!
Could you help me please understand why we have such behavior ?

Thanks in advance

Abdallah

josiahsams · 2017-07-31T06:32:49Z

Each process(Spark executor) during its initialization with GPU, creates a context in GPU(refer JCuda.cudaSetDevice). The context is of size ~70MB and they get destroyed once the process completes execution.

Also other pointers are,

to make sure the load is evenly distributed across the available 3 GPUs(partition size multiples of 3).
Choose the available cores & memory required for an executor in spark-default.conf in such a way, the Cluster Manager starts only 3 executors(one per GPU) or multiples of 3 executor instances.

a-agrz · 2017-07-31T12:28:56Z

Each process(Spark executor) during its initialization with GPU, creates a context in GPU(refer JCuda.cudaSetDevice). The context is of size ~70MB and they get destroyed once the process completes execution.

Are you saying that JCuda.cudaSetDevice(executorId % env.gpuCount) will always create a context of size 70MB in GPU0 even if we are adressing to GPU1 when executorId=1 for instance ... ?

Also other pointers are,
to make sure the load is evenly distributed across the available 3 GPUs(partition size multiples of 3).
Choose the available cores & memory required for an executor in spark-default.conf in such a way, the Cluster Manager starts only 3 executors(one per GPU) or multiples of 3 executor instances.

Thank you for this clarifications,
Also when I'm working in a single machine do I have to create multiple workers when I can create as many executors as I want (of course taking in account how much cores I have in my machine) with one worker ? I mean is there an impact in GPU usage when I'm using multiple workers or one worker in a single machine ???

thanks in advance

Abdallah

josiahsams · 2017-07-31T13:18:09Z

@a-agrz , The cudaSetDevice will create a context only on the GPU we provide as input argument before any other operations on that GPU. But once all the tasks on the executor are done, till the executor PID is active, the GPU context is not destroyed. The executor can be in a state waiting for more tasks to be assigned.

Regarding your other query, consider a cluster with single node with 6 Cores & 12 GB Ram & 3 GPUs. By allocating executor.cores=2 , executor.memory=4GB the Spark Standalone Cluster manager will spawn exactly 3 executors if the application requested for 6 cores(--total-executor-cores 6) for its execution.

In our experiments, we figured out having upto 4 executors per GPU doesn't have much impact on the performance.

Hope my reply helps.

a-agrz · 2017-08-01T13:08:17Z

@josiahsams Thank you for your time and your answers;
But I still have two issues:
1)
The cudaSetDevice will create a context only on the GPU we provide as input argument before any other operations on that GPU. But once all the tasks on the executor are done, till the executor PID is active, the GPU context is not destroyed. The executor can be in a state waiting for more tasks to be assigned.

Yeah that's what is supposed to happen, each executor will be attached regarding to his Id to only one GPU, right! But that's not what happened..!!
Here I run the test example SparkGPULR with --executor-cores=10 , --total-executor-cores=30 and --executor-memory=10g. Spark Standalone Cluster manager spawn exactly 3 executors.
By running nvidia-smi I found that, at the beginning of execution we have three processes (executors) each attached to one GPU, which is exactly what we expected to happen, see photo bellow.

But then after 10 seconds of execution nvidia-smi gives that, executor 1 and 2 are not only occupying GPU 1 and GPU 2 as we said earlier but also occupying GPU0 with 72MiB of memory, see photo bellow, which after my understanding is not supposed to happen because of cudaSetDevice that attached them from the bingeing to one GPU !!

So why do we have such behavior?
I hope I explained will my question!

Regarding performance, I noticed that in cluster mode there is no gain in speed despite the use of three GPUs !!

In the other hand, when we run the application in local mode, there is a speed-up of a factor 1.2 while only one GPU is used in this mode (GPU0 is used because in local mode, we only have one executor process ).

So why there is no speed up in cluster mode even when we use all three GPUs ??

I thought about it and my first assumption will be that the overhead of data distribution among executors (3 in my example) is the cause of this slowdown, while in local mode there is only one process so this overhead is not remarkable... Is that right ?

Sorry for my long posts :p

thank you in advance

Abdallah

josiahsams · 2017-08-02T07:04:28Z

@a-agrz , Now I understand your problem. From the pids which get listed in nvidia-smi command it looks like all the executors create a context in GPU0 and then later moved on to the next GPU's, created a new context there and continue with its execution. So the context which got created in GPU0 is left till that PID exist.

We have done few changes in the relevant portion of the code recently which should have addressed this issue.

To verify with the latest patch, kindly apply the PR #62 (which is under review) and rerun your tests. Please note all the latest patches are getting into the branch "GPUDataset" which has support for Dataset.

W.r.t performance numbers, kindly use the dataset Logistic Regression example as follows,
For local Mode:

time ~/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --master local[*] 
 --conf spark.driver.memory=15g 
 --class com.ibm.gpuenabler.SparkDSLR  
 --jars ~/new/GPUEnabler/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar 
~/new/GPUEnabler/examples/target/gpu-enabler-examples_2.11-1.0.0.jar local[16] 16 1000000 400 5

For Cluster Mode:

time ~/spark/bin/spark-submit --master spark://soe15:7077 
 --conf spark.driver.memory=15g 
 --class com.ibm.gpuenabler.SparkDSLR  
 --jars ~/new/GPUEnabler/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar 
~/new/GPUEnabler/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://soe15:7077 16 1000000 400 10

GPU Usage:

(Note: Ignore the PID 87819 in the output which is driver PID)

Thanks,
Joe.

francois-wellenreiter closed this as completed Jul 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance in a multi-GPU environment #47

Performance in a multi-GPU environment #47

francois-wellenreiter commented Apr 26, 2017

josiahsams commented Jun 1, 2017

a-agrz commented Jul 21, 2017

josiahsams commented Jul 31, 2017

a-agrz commented Jul 31, 2017

josiahsams commented Jul 31, 2017

a-agrz commented Aug 1, 2017 •

edited

Loading

josiahsams commented Aug 2, 2017

Performance in a multi-GPU environment #47

Performance in a multi-GPU environment #47

Comments

francois-wellenreiter commented Apr 26, 2017

josiahsams commented Jun 1, 2017

a-agrz commented Jul 21, 2017

josiahsams commented Jul 31, 2017

a-agrz commented Jul 31, 2017

josiahsams commented Jul 31, 2017

a-agrz commented Aug 1, 2017 • edited Loading

josiahsams commented Aug 2, 2017

a-agrz commented Aug 1, 2017 •

edited

Loading