-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance in a multi-GPU environment #47
Comments
Sorry @francois-wellenreiter , we still rely on the spark scheduler for CPU and as of now, we choose GPU's based on the execution ID and not through any other heuristics like proximity etc. |
Hi, I was able to run the SparkGPULR example in a machine that has 3 GPUs attached to it (2 Tesla M60 and 1 Tesla k20). When launching the application with spark-submit, I noticed the folowing behavior ragarding GPU usage by running nvidia-smi commmand : So as you can see in the photo above :
Given the source code I was able to understand that GPUEnabler choose the GPU's based on executor ID as shown here So every executor will be attached only to one GPU during the hol execution time ?? but that's not the case here !! Thanks in advance Abdallah |
Each process(Spark executor) during its initialization with GPU, creates a context in GPU(refer JCuda.cudaSetDevice). The context is of size ~70MB and they get destroyed once the process completes execution. Also other pointers are,
|
Each process(Spark executor) during its initialization with GPU, creates a context in GPU(refer JCuda.cudaSetDevice). The context is of size ~70MB and they get destroyed once the process completes execution. Are you saying that JCuda.cudaSetDevice(executorId % env.gpuCount) will always create a context of size 70MB in GPU0 even if we are adressing to GPU1 when executorId=1 for instance ... ? Also other pointers are, Thank you for this clarifications, thanks in advance Abdallah |
@a-agrz , The cudaSetDevice will create a context only on the GPU we provide as input argument before any other operations on that GPU. But once all the tasks on the executor are done, till the executor PID is active, the GPU context is not destroyed. The executor can be in a state waiting for more tasks to be assigned. Regarding your other query, consider a cluster with single node with 6 Cores & 12 GB Ram & 3 GPUs. By allocating executor.cores=2 , executor.memory=4GB the Spark Standalone Cluster manager will spawn exactly 3 executors if the application requested for 6 cores(--total-executor-cores 6) for its execution. In our experiments, we figured out having upto 4 executors per GPU doesn't have much impact on the performance. Hope my reply helps. |
@josiahsams Thank you for your time and your answers; Yeah that's what is supposed to happen, each executor will be attached regarding to his Id to only one GPU, right! But that's not what happened..!! But then after 10 seconds of execution nvidia-smi gives that, executor 1 and 2 are not only occupying GPU 1 and GPU 2 as we said earlier but also occupying GPU0 with 72MiB of memory, see photo bellow, which after my understanding is not supposed to happen because of cudaSetDevice that attached them from the bingeing to one GPU !! Regarding performance, I noticed that in cluster mode there is no gain in speed despite the use of three GPUs !! In the other hand, when we run the application in local mode, there is a speed-up of a factor 1.2 while only one GPU is used in this mode (GPU0 is used because in local mode, we only have one executor process ). So why there is no speed up in cluster mode even when we use all three GPUs ?? I thought about it and my first assumption will be that the overhead of data distribution among executors (3 in my example) is the cause of this slowdown, while in local mode there is only one process so this overhead is not remarkable... Is that right ? Sorry for my long posts :p thank you in advance Abdallah |
@a-agrz , Now I understand your problem. From the pids which get listed in nvidia-smi command it looks like all the executors create a context in GPU0 and then later moved on to the next GPU's, created a new context there and continue with its execution. So the context which got created in GPU0 is left till that PID exist. We have done few changes in the relevant portion of the code recently which should have addressed this issue. To verify with the latest patch, kindly apply the PR #62 (which is under review) and rerun your tests. Please note all the latest patches are getting into the branch "GPUDataset" which has support for Dataset. W.r.t performance numbers, kindly use the dataset Logistic Regression example as follows,
For Cluster Mode:
(Note: Ignore the PID 87819 in the output which is driver PID) Thanks, |
Hi there,
Actually this is not a real issue but just a general question about how to keep performances using multi-GPU machines using the GPUEnabler plug-in.
I would like to run a Spark based application in a cluster composed of Intel Xeon based bi-socket machines.
Each of these machines is also equipped with 2 NVidia GPUs (1 GPU directly connected to a socket). No NVLink connection is available between these GPUs.
Given the NUMA factor that penalizes any process access to the farthest GPU, I am wondering if there is any parameter in Spark or in CUDA or any other tool that could allow us to have all the Spark tasks using the closest GPU for a given computation or maybe a mean to deny any process trying to use the GPU 0 when running on the socket 1 (considering that GPU 0 is connected to socket 0 and GPU 1 to socket 1).
Thank you for any help, François
The text was updated successfully, but these errors were encountered: