Robust qv_scope_split_at implementation #9

eleon · 2021-10-23T00:52:56Z

Let’s say we have a compute node with 3 GPUs and 3 NUMA domains. The 3 GPUs hang off the first NUMA domain. When I use split_at(..., HW_OBJ_GPUs, ...) I would expect the subscopes to be derived from NUMA 0, but most likely this implementation will derive one subscope per NUMA. f2d5cee

The text was updated successfully, but these errors were encountered:

GuillaumeMercier · 2021-11-03T13:58:23Z

You can reuse the guided mode implementation from Hsplit if you wish because I think the functionality seems rather similar to me (this does not answer @eleon's comment).

samuelkgutierrez · 2022-02-22T18:07:20Z

Please re-test to see if 2703e88 fixes this issue.

eleon · 2022-02-26T19:34:19Z

Thank you, @samuelkgutierrez . Unfortunately, this is still an issue. Here's an example.

A node with 2 sockets (NUMAs) and 2 GPUs.
The 2 GPUs are attached to socket 0.
Socket 0 PUs: 0-17, 36-53
Socket 1 PUs: 18-35, 54-71
The following program will split_at 2 processes by GPU. What I would expect is that each process gets a GPU and both processes are assigned to socket (NUMA) 0 since the GPUs are attached to socket 0. However, each process is assign to a different socket.

leon@pascal4:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-phases
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:07:00.0
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
[1] Popped up to 18-35
[0] Popped up to 0-17
[...]
=> [1] Split@GPU: got 1 GPUs, running on 18-35,54-71
   [1] GPU 0 PCI Bus ID = 0000:07:00.0
=> [0] Split@GPU: got 1 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0

Perhaps, this will be solved by the affinity preserving CPU/GPU algorithms for split.
This test was using commit d54464b, since later commits have an issue.

samuelkgutierrez · 2022-02-28T17:03:48Z

Thank you for testing, @eleon. Yes, an affinity preserving algorithm should fix this issue.

eleon · 2022-11-22T00:21:48Z

Greetings @samuelkgutierrez There are still issues with the latest build. Same test machine as above, same command as above:

leon@pascal6:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-phases
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split: got 9 cores, running on 9-17,45-53
[1] Doing pthread_things with 9 cores
[1] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:07:00.0
[0] Popped up to 0-17
[1] Popped up to 18-35
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [1] Split@NUMA: got 0 NUMAs, running on 9-17,45-53
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 18 PUs
[0] Popped up to 0-17
[1] Popped up to 18-35
=> [0] Split@GPU: got 1 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split@GPU: got 1 GPUs, running on 9-17,45-53
   [1] GPU 0 PCI Bus ID = 0000:07:00.0

samuelkgutierrez · 2022-11-22T00:24:49Z

Can you please try again by modifying the test to use QV_SCOPE_SPLIT_AFFINITY_PRESERVING?

samuelkgutierrez · 2022-11-22T00:28:02Z

No, you will have to modify the test code to use QV_SCOPE_SPLIT_AFFINITY_PRESERVING where a specified group_id is provided. This applies to both qv_scope_split() and qv_scope_split_at().

eleon · 2022-11-22T02:16:01Z

Thank you, @samuelkgutierrez! Using QV_SCOPE_SPLIT_AFFINITY_PRESERVING:

leon@pascal6:qv$ QV_PORT=55996 srun -N1 -n2 quo-vadis/build-pascal/tests/test-mpi-phases

===Phase 1: Regular split===
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 2 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
GPU 1 PCI Bus ID = 0000:07:00.0
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 0 GPU kernels
[0] Popped up to 0-17
[1] Popped up to 18-35

===Phase 2: NUMA split===
[0]: #NUMAs=2 numa_scope_id=0
[1]: #NUMAs=2 numa_scope_id=0
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
=> [1] Split@NUMA: got 1 NUMAs, running on 18-35,54-71
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 36 PUs
[0] Popped up to 0-17
[1] Popped up to 18-35

===Phase 3: GPU split===
=> [0] Split@GPU: got 2 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
   [0] GPU 1 PCI Bus ID = 0000:07:00.0
=> [1] Split@GPU: got 0 GPUs, running on 18-35,54-71

The issues are with the following calls:

Phase 2 qv_scope_taskid
numa_scope_id should be 0 and 1.
Phase 3 qv_scope_split_at(..., QV_HW_OBJ_GPU, QV_SCOPE_SPLIT_AFFINITY_PRESERVING, ...)
each task should get one GPU rather than one task 2 GPUs and the other 0.

samuelkgutierrez · 2022-11-22T15:43:56Z

Thank you, @eleon. Can you push the changes you made so I can see what's going on? Regarding the second issue, are both GPUs attached to the package containing cores 0-17?

eleon · 2022-11-22T17:58:52Z

Greetings, @samuelkgutierrez --I pushed the changes to test-mpi-phases.c last night. Yes, both GPUs are attached to the first socket (cores 0-17). Please let me now if I can test further. Also, I am testing natively, but the same architecture is available as cts1-pascal.xml in the mpibind repo :)

samuelkgutierrez · 2022-11-22T19:20:53Z

Yikes! I didn't notice. My apologies.

If both GPUs are attached to the same socket, then QV_SCOPE_SPLIT_AFFINITY_PRESERVING is working properly. We will have to think about another policy that spreads the resources across tasks in a given scope to accomplish what you like. I'll take a closer look when I have a chance. Thank you.

eleon · 2022-11-22T19:30:52Z

Sounds good, @samuelkgutierrez. Thank you!
Yes, the objective of the split_at operation is to split the resources among tasks according to a specific hardware resource. It could be NUMA, GPU, etc. For example, when splitting at GPU onpascal, this would mean binding the tasks to the first socket (local NUMA to the GPUs) and distributing the available GPUs to the tasks. Hope this helps.

eleon · 2022-11-22T19:36:19Z

Adding another thought before I forget. The way I see the regular split and the split_at operation is that split is simply an instance of split_at by NUMA.

samuelkgutierrez · 2022-11-23T01:32:50Z

Adding another thought before I forget. The way I see the regular split and the split_at operation is that split is simply an instance of split_at by NUMA.

Maybe split_at() is more like split() with a SPREAD policy (which we don't yet have implemented)? I think the idea is that we want to maximally distribute the requested resource among the tasks in the provided scope. Is that a reasonable way of thinking about this?

eleon · 2022-11-23T01:54:43Z

Possibly, @samuelkgutierrez. It's just that I think about it the opposite way:split is a special case of split_at --thinking in terms of hwloc, split operates from the root of the tree down, while split_at requires a tree whose root is different depending on the hardware resource of interest.

eleon · 2023-07-21T16:09:32Z

Good morning, @samuelkgutierrez . Progress! but some issues too.

Case 1: Not using USE_AFFINITY_PRESERVING (this is the closest to the behavior we are looking for).

leon@pascal4:qv$ QV_PORT=55996 srun -n2 quo-vadis/build-pascal/tests/test-mpi-phases 

===Phase 1: Regular split===
[1] Base scope w/36 cores, running on 18-35
[0] Base scope w/36 cores, running on 0-17
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 1 GPU kernels
GPU 0 PCI Bus ID = 0000:07:00.0
[0] Popped up to 0-17
[1] Popped up to 18-35

===Phase 2: NUMA split===
[0]: #NUMAs=2 numa_scope_id=0
[1]: #NUMAs=2 numa_scope_id=0
=> [1] Split@NUMA: got 1 NUMAs, running on 18-35,54-71
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 36 PUs
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
[1] Popped up to 18-35
[0] Popped up to 0-17

===Phase 3: GPU split===
=> [0] Split@GPU: got 1 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
=> [1] Split@GPU: got 1 GPUs, running on 18-35,54-71
   [1] GPU 0 PCI Bus ID = 0000:07:00.0

The GPUs are split correctly among the MPI workers. However, the assigned CPUs are not local to the assigned GPUs.

Case 2: Using USE_AFFINITY_PRESERVING

leon@pascal4:qv$ QV_PORT=55996 srun -n2 quo-vadis/build-pascal/tests/test-mpi-phases 

===Phase 1: Regular split===
[0] Base scope w/36 cores, running on 0-17
[1] Base scope w/36 cores, running on 18-35
=> [1] Split: got 18 cores, running on 18-35,54-71
[1] Doing pthread_things with 18 cores
[1] Launching 0 GPU kernels
=> [0] Split: got 18 cores, running on 0-17,36-53
[0] Doing pthread_things with 18 cores
[0] Launching 2 GPU kernels
GPU 0 PCI Bus ID = 0000:04:00.0
GPU 1 PCI Bus ID = 0000:07:00.0
[1] Popped up to 18-35
[0] Popped up to 0-17

===Phase 2: NUMA split===
[0]: #NUMAs=2 numa_scope_id=0
[1]: #NUMAs=2 numa_scope_id=0
=> [1] Split@NUMA: got 1 NUMAs, running on 18-35,54-71
=> [0] Split@NUMA: got 1 NUMAs, running on 0-17,36-53
=> [1] NUMA leader: Launching OMP region
[1] Doing OpenMP things with 36 PUs
=> [0] NUMA leader: Launching OMP region
[0] Doing OpenMP things with 36 PUs
[1] Popped up to 18-35
[0] Popped up to 0-17

===Phase 3: GPU split===
=> [0] Split@GPU: got 2 GPUs, running on 0-17,36-53
   [0] GPU 0 PCI Bus ID = 0000:04:00.0
   [0] GPU 1 PCI Bus ID = 0000:07:00.0
=> [1] Split@GPU: got 0 GPUs, running on 0-17,36-53

Two main issues here:

While the CPUs are coming from the socket with the GPUs, they are not split across the MPI workers. I would expect something like Task 0 with CPUs 0-8 (+SMT-2 threads) and Task 1 with CPUs 9-17 (+SMT-2 threads).
The GPUs are not split across the tasks.

Thanks!

eleon · 2023-09-21T00:02:20Z

Another subtlety about the split_at operation and the group_id parameter (including USE_AFFINITY_PRESERVING).

qv_scope_split_at(ctx, scope, type, group_id, subscope)

Let's say we split at GPUs. What I'm looking to answer is can the resulting subscopes have more than one GPU?

Here's my desired behavior.

Let's say we have a node with 3 GPUs and a job with 2 tasks.

If the caller specifies a positive integer as the group_id, then each subscope should have exactly 1 GPU. For example:
```
qv_scope_split_at(ctx, scope, QV_HW_OBJ_GPU, rank % 3, &gpu_scope);
```
In this case, rank 0 may get GPU 0 and rank 1 may get GPU 1. GPU 2 will be idle because not enough tasks were launched.
If the caller specifies USE_AFFINITY_PRESERVING as the group_id, then a subscope may have more than 1 GPU. For example:
```
qv_scope_split_at(ctx, scope, QV_HW_OBJ_GPU, QV_SCOPE_SPLIT_AFFINITY_PRESERVING, &gpu_scope);
```
In this case, rank 0 may get GPUs 0 and 1 and rank 1 may get GPU 2. It is up to rank 0 to use 1 or 2 GPUs. As shown in test-mpi-phases.c, rank 0 can get the number of GPUs in the subscope and launch to both GPUs accordingly.

eleon mentioned this issue Nov 22, 2022

Deriving resources from a scope #4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robust qv_scope_split_at implementation #9

Robust qv_scope_split_at implementation #9

eleon commented Oct 23, 2021 •

edited

Loading

GuillaumeMercier commented Nov 3, 2021

samuelkgutierrez commented Feb 22, 2022

eleon commented Feb 26, 2022 •

edited

Loading

samuelkgutierrez commented Feb 28, 2022

eleon commented Nov 22, 2022

samuelkgutierrez commented Nov 22, 2022

samuelkgutierrez commented Nov 22, 2022

eleon commented Nov 22, 2022 •

edited

Loading

samuelkgutierrez commented Nov 22, 2022

eleon commented Nov 22, 2022 •

edited

Loading

samuelkgutierrez commented Nov 22, 2022

eleon commented Nov 22, 2022

eleon commented Nov 22, 2022

samuelkgutierrez commented Nov 23, 2022

eleon commented Nov 23, 2022

eleon commented Jul 21, 2023

eleon commented Sep 21, 2023 •

edited

Loading

Robust qv_scope_split_at implementation #9

Robust qv_scope_split_at implementation #9

Comments

eleon commented Oct 23, 2021 • edited Loading

GuillaumeMercier commented Nov 3, 2021

samuelkgutierrez commented Feb 22, 2022

eleon commented Feb 26, 2022 • edited Loading

samuelkgutierrez commented Feb 28, 2022

eleon commented Nov 22, 2022

samuelkgutierrez commented Nov 22, 2022

samuelkgutierrez commented Nov 22, 2022

eleon commented Nov 22, 2022 • edited Loading

samuelkgutierrez commented Nov 22, 2022

eleon commented Nov 22, 2022 • edited Loading

samuelkgutierrez commented Nov 22, 2022

eleon commented Nov 22, 2022

eleon commented Nov 22, 2022

samuelkgutierrez commented Nov 23, 2022

eleon commented Nov 23, 2022

eleon commented Jul 21, 2023

eleon commented Sep 21, 2023 • edited Loading

eleon commented Oct 23, 2021 •

edited

Loading

eleon commented Feb 26, 2022 •

edited

Loading

eleon commented Nov 22, 2022 •

edited

Loading

eleon commented Nov 22, 2022 •

edited

Loading

eleon commented Sep 21, 2023 •

edited

Loading