Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance in a multi-GPU environment #47

Closed
francois-wellenreiter opened this issue Apr 26, 2017 · 7 comments
Closed

Performance in a multi-GPU environment #47

francois-wellenreiter opened this issue Apr 26, 2017 · 7 comments

Comments

@francois-wellenreiter
Copy link

Hi there,

Actually this is not a real issue but just a general question about how to keep performances using multi-GPU machines using the GPUEnabler plug-in.
I would like to run a Spark based application in a cluster composed of Intel Xeon based bi-socket machines.
Each of these machines is also equipped with 2 NVidia GPUs (1 GPU directly connected to a socket). No NVLink connection is available between these GPUs.

Given the NUMA factor that penalizes any process access to the farthest GPU, I am wondering if there is any parameter in Spark or in CUDA or any other tool that could allow us to have all the Spark tasks using the closest GPU for a given computation or maybe a mean to deny any process trying to use the GPU 0 when running on the socket 1 (considering that GPU 0 is connected to socket 0 and GPU 1 to socket 1).

Thank you for any help, François

@josiahsams
Copy link
Member

Sorry @francois-wellenreiter , we still rely on the spark scheduler for CPU and as of now, we choose GPU's based on the execution ID and not through any other heuristics like proximity etc.
In short, all partitions handled by a single executor is processed by one GPU. But it a good idea to have it part of our future enhancements.

@a-agrz
Copy link

a-agrz commented Jul 21, 2017

Hi,

I was able to run the SparkGPULR example in a machine that has 3 GPUs attached to it (2 Tesla M60 and 1 Tesla k20).
I configured spark to work in a cluster mode with two workers and one master. Each worker has 2 executors (4 in total) with 8 cores each. So I used 32 cores in totoal from the machine I'm running spark on. I also give 8Gb of memory to each executor.

When launching the application with spark-submit, I noticed the folowing behavior ragarding GPU usage by running nvidia-smi commmand :

image

So as you can see in the photo above :

  • GPU_0 is used by four executors, 2 executor are using less memory (72 MiB) than the two others(>300Mib)
  • GPU_1 is used by 1 executor, 385 Mib of memory usage
  • GPU_2 is used by the 1 executor, 376 Mib of memory usage

Given the source code I was able to understand that GPUEnabler choose the GPU's based on executor ID as shown here
def get = { this.synchronized { if (SparkEnv.get != oldSparkEnv) { oldSparkEnv = SparkEnv.get initalize() } if (env.isGPUEnabled) { val executorId = env.executorId match { case "driver" => 0 case _ => SparkEnv.get.executorId.toInt } JCuda.cudaSetDevice(executorId % env.gpuCount ) } env } } }

So every executor will be attached only to one GPU during the hol execution time ?? but that's not the case here !!
Could you help me please understand why we have such behavior ?

Thanks in advance

Abdallah

@josiahsams
Copy link
Member

Each process(Spark executor) during its initialization with GPU, creates a context in GPU(refer JCuda.cudaSetDevice). The context is of size ~70MB and they get destroyed once the process completes execution.

Also other pointers are,

  1. to make sure the load is evenly distributed across the available 3 GPUs(partition size multiples of 3).
  2. Choose the available cores & memory required for an executor in spark-default.conf in such a way, the Cluster Manager starts only 3 executors(one per GPU) or multiples of 3 executor instances.

@a-agrz
Copy link

a-agrz commented Jul 31, 2017

Each process(Spark executor) during its initialization with GPU, creates a context in GPU(refer JCuda.cudaSetDevice). The context is of size ~70MB and they get destroyed once the process completes execution.

Are you saying that JCuda.cudaSetDevice(executorId % env.gpuCount) will always create a context of size 70MB in GPU0 even if we are adressing to GPU1 when executorId=1 for instance ... ?

Also other pointers are,
to make sure the load is evenly distributed across the available 3 GPUs(partition size multiples of 3).
Choose the available cores & memory required for an executor in spark-default.conf in such a way, the Cluster Manager starts only 3 executors(one per GPU) or multiples of 3 executor instances.

Thank you for this clarifications,
Also when I'm working in a single machine do I have to create multiple workers when I can create as many executors as I want (of course taking in account how much cores I have in my machine) with one worker ? I mean is there an impact in GPU usage when I'm using multiple workers or one worker in a single machine ???

thanks in advance

Abdallah

@josiahsams
Copy link
Member

@a-agrz , The cudaSetDevice will create a context only on the GPU we provide as input argument before any other operations on that GPU. But once all the tasks on the executor are done, till the executor PID is active, the GPU context is not destroyed. The executor can be in a state waiting for more tasks to be assigned.

Regarding your other query, consider a cluster with single node with 6 Cores & 12 GB Ram & 3 GPUs. By allocating executor.cores=2 , executor.memory=4GB the Spark Standalone Cluster manager will spawn exactly 3 executors if the application requested for 6 cores(--total-executor-cores 6) for its execution.

In our experiments, we figured out having upto 4 executors per GPU doesn't have much impact on the performance.

Hope my reply helps.

@a-agrz
Copy link

a-agrz commented Aug 1, 2017

@josiahsams Thank you for your time and your answers;
But I still have two issues:
1)
The cudaSetDevice will create a context only on the GPU we provide as input argument before any other operations on that GPU. But once all the tasks on the executor are done, till the executor PID is active, the GPU context is not destroyed. The executor can be in a state waiting for more tasks to be assigned.

Yeah that's what is supposed to happen, each executor will be attached regarding to his Id to only one GPU, right! But that's not what happened..!!
Here I run the test example SparkGPULR with --executor-cores=10 , --total-executor-cores=30 and --executor-memory=10g. Spark Standalone Cluster manager spawn exactly 3 executors.
By running nvidia-smi I found that, at the beginning of execution we have three processes (executors) each attached to one GPU, which is exactly what we expected to happen, see photo bellow.

image

But then after 10 seconds of execution nvidia-smi gives that, executor 1 and 2 are not only occupying GPU 1 and GPU 2 as we said earlier but also occupying GPU0 with 72MiB of memory, see photo bellow, which after my understanding is not supposed to happen because of cudaSetDevice that attached them from the bingeing to one GPU !!
image
So why do we have such behavior?
I hope I explained will my question!

Regarding performance, I noticed that in cluster mode there is no gain in speed despite the use of three GPUs !!
image

In the other hand, when we run the application in local mode, there is a speed-up of a factor 1.2 while only one GPU is used in this mode (GPU0 is used because in local mode, we only have one executor process ).

image

So why there is no speed up in cluster mode even when we use all three GPUs ??

I thought about it and my first assumption will be that the overhead of data distribution among executors (3 in my example) is the cause of this slowdown, while in local mode there is only one process so this overhead is not remarkable... Is that right ?

Sorry for my long posts :p

thank you in advance

Abdallah

@josiahsams
Copy link
Member

@a-agrz , Now I understand your problem. From the pids which get listed in nvidia-smi command it looks like all the executors create a context in GPU0 and then later moved on to the next GPU's, created a new context there and continue with its execution. So the context which got created in GPU0 is left till that PID exist.

We have done few changes in the relevant portion of the code recently which should have addressed this issue.

To verify with the latest patch, kindly apply the PR #62 (which is under review) and rerun your tests. Please note all the latest patches are getting into the branch "GPUDataset" which has support for Dataset.

W.r.t performance numbers, kindly use the dataset Logistic Regression example as follows,
For local Mode:

time ~/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --master local[*] 
 --conf spark.driver.memory=15g 
 --class com.ibm.gpuenabler.SparkDSLR  
 --jars ~/new/GPUEnabler/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar 
~/new/GPUEnabler/examples/target/gpu-enabler-examples_2.11-1.0.0.jar local[16] 16 1000000 400 5

screen shot 2017-08-02 at 12 20 27 pm

For Cluster Mode:

time ~/spark/bin/spark-submit --master spark://soe15:7077 
 --conf spark.driver.memory=15g 
 --class com.ibm.gpuenabler.SparkDSLR  
 --jars ~/new/GPUEnabler/gpu-enabler/target/gpu-enabler_2.11-1.0.0.jar 
~/new/GPUEnabler/examples/target/gpu-enabler-examples_2.11-1.0.0.jar spark://soe15:7077 16 1000000 400 10

screen shot 2017-08-02 at 12 25 46 pm

GPU Usage:
screen shot 2017-08-02 at 12 30 42 pm

(Note: Ignore the PID 87819 in the output which is driver PID)

Thanks,
Joe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants