Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qv_scope_split issue when called from a thread? #283

Open
GuillaumeMercier opened this issue Oct 17, 2024 · 9 comments
Open

qv_scope_split issue when called from a thread? #283

GuillaumeMercier opened this issue Oct 17, 2024 · 9 comments
Assignees

Comments

@GuillaumeMercier
Copy link
Collaborator

in the test-pthread-split.c test program, the function called by qv_pthread_create
calls qv_scope_split to split the the thread scope into two parts and then frees the
obtained subscope immediately:

    fprintf(stdout,"[%d] Thread %d splitting in two pieces\n", tid, rank);
    qv_scope_t *pthread_subscope = NULL;
    rc = qv_scope_split(scope, 2, rank, &pthread_subscope);
    if (rc != QV_SUCCESS) {
        ers = "qv_scope_split failed";
        qvi_test_panic("%s (rc=%s)", ers, qv_strerr(rc));
    }
    rc = qv_scope_free(pthread_subscope);
    if (rc != QV_SUCCESS) {
        ers = "qv_scope_free failed";
        qvi_test_panic("%s (rc=%s)", ers, qv_strerr(rc));
    }

Adding some calls to output informations about the subscope
seem to fail:

   fprintf(stdout,"[%d] Thread %d splitting in two pieces\n", tid, rank);
    qv_scope_t *pthread_subscope = NULL;
    rc = qv_scope_split(scope, 2, rank, &pthread_subscope);
    if (rc != QV_SUCCESS) {
        ers = "qv_scope_split failed";
        qvi_test_panic("%s (rc=%s)", ers, qv_strerr(rc));
    }

    qvi_test_scope_report(pthread_subscope, "thread_subscope");
    qvi_test_emit_task_bind(pthread_subscope);

    rc = qv_scope_free(pthread_subscope);
    if (rc != QV_SUCCESS) {
        ers = "qv_scope_free failed";
        qvi_test_panic("%s (rc=%s)", ers, qv_strerr(rc));
    }

I've got either this output:

# Starting Hybrid MPI + Pthreads test.
[1965245] mpi_scope sgrank is 0
[1965245] mpi_scope sgsize is 1
[1965245] cpubind=0-7
[1965245] Testing thread_scope_split (nthreads=4)
Array values :val[0]: 0 |val[1]: 1 |val[2]: 0 |val[3]: 1 |
Array values :val[0]: 0 |val[1]: 1 |val[2]: 0 |val[3]: 1 |
[1965254] thread_scope sgrank is 3
[1965254] thread_scope sgsize is 4
[1965253] thread_scope sgrank is 2
[1965253] thread_scope sgsize is 4
[1965252] thread_scope sgrank is 1
[1965252] thread_scope sgsize is 4
[1965251] thread_scope sgrank is 0
[1965251] thread_scope sgsize is 4
[1965251] cpubind=0-1,4-5
[1965251] Thread 0 splitting in two pieces
[1965253] cpubind=0-1,4-5
[1965253] Thread 2 splitting in two pieces
[1965252] cpubind=2-3,6-7
[1965252] Thread 1 splitting in two pieces
[1965254] cpubind=2-3,6-7
[1965254] Thread 3 splitting in two pieces
[quo-vadis error at (quo-vadis.cc::qv_scope_group_rank::119)] An exception occurred at map::at
[quo-vadis error at (quo-vadis.cc::qv_scope_group_rank::119)] An exception occurred at map::at
[quo-vadis error at (quo-vadis.cc::qv_scope_group_rank::119)] An exception occurred at map::at

qvi_test_scope_report@82: 
qvi_test_scope_report@82: [quo-vadis error at (quo-vadis.cc::qv_scope_group_rank::119)] An exception occurred at map::at

qvi_test_scope_report@82: qv_scope_group_rank() failed (rc=Unspecified error)
qv_scope_group_rank() failed (rc=Unspecified error)

qvi_test_scope_report@82: qv_scope_group_rank() failed (rc=Unspecified error)
qv_scope_group_rank() failed (rc=Unspecified error)

Or a plain segfault :

# Starting Hybrid MPI + Pthreads test.
[1965270] mpi_scope sgrank is 0
[1965270] mpi_scope sgsize is 1
[1965270] cpubind=0-7
[1965270] Testing thread_scope_split (nthreads=4)
Array values :val[0]: 0 |val[1]: 1 |val[2]: 0 |val[3]: 1 |
Array values :val[0]: 0 |val[1]: 1 |val[2]: 0 |val[3]: 1 |
[1965277] thread_scope sgrank is 3
[1965277] thread_scope sgsize is 4
[1965274] thread_scope sgrank is 0
[1965274] thread_scope sgsize is 4
[1965275] thread_scope sgrank is 1
[1965275] thread_scope sgsize is 4
[1965276] thread_scope sgrank is 2
[1965276] thread_scope sgsize is 4
[1965276] cpubind=0-1,4-5
[1965277] cpubind=2-3,6-7
[1965277] Thread 3 splitting in two pieces
[1965276] Thread 2 splitting in two pieces
[1965275] cpubind=2-3,6-7
[1965275] Thread 1 splitting in two pieces
[1965274] cpubind=0-1,4-5
[1965274] Thread 0 splitting in two pieces
[Palamede:1965270] *** Process received signal ***
[Palamede:1965270] Signal: Segmentation fault (11)
[Palamede:1965270] Signal code: Address not mapped (1)
[Palamede:1965270] Failing at address: (nil)
[Palamede:1965270] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x75ee43842520]
[Palamede:1965270] [ 1] /home/mercierg/Developpement/Git/QuoVadis/qv-build/src/libquo-vadis.so(_ZN16qvi_hwsplit_coll13gather_valuesIiEEiT_RSt6vectorIS1_SaIS1_EE+0x17d)[0x75ee44723cd5]
[Palamede:1965270] [ 2] /home/mercierg/Developpement/Git/QuoVadis/qv-build/src/libquo-vadis.so(_ZN16qvi_hwsplit_coll6gatherEv+0x40)[0x75ee447210ee]
[Palamede:1965270] [ 3] /home/mercierg/Developpement/Git/QuoVadis/qv-build/src/libquo-vadis.so(_ZN16qvi_hwsplit_coll5splitEPiPP10qvi_hwpool+0x33)[0x75ee4472162b]
[Palamede:1965270] [ 4] /home/mercierg/Developpement/Git/QuoVadis/qv-build/src/libquo-vadis.so(_ZN8qv_scope5splitEii16qv_hw_obj_type_tPPS_+0xc1)[0x75ee4472dda5]
[Palamede:1965270] [ 5] /home/mercierg/Developpement/Git/QuoVadis/qv-build/src/libquo-vadis.so(qv_scope_split+0x9f)[0x75ee4472ffb9]
[Palamede:1965270] [ 6] ./test-pthread-split(+0x1a8f)[0x61ce684c3a8f]
[Palamede:1965270] [ 7] /home/mercierg/Developpement/Git/QuoVadis/qv-build/src/libquo-vadis.so(+0x130821)[0x75ee44730821]
[Palamede:1965270] [ 8] /home/mercierg/Developpement/Git/QuoVadis/qv-build/src/libquo-vadis.so(_ZN17qvi_pthread_group30call_first_from_pthread_createEPv+0x34b)[0x75ee4470ba2f]
[Palamede:1965270] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x75ee43894ac3]
[Palamede:1965270] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x75ee43926850]
[Palamede:1965270] *** End of error message ***
Segmentation fault (core dumped)

This needs investigation.

@GuillaumeMercier GuillaumeMercier self-assigned this Oct 17, 2024
@GuillaumeMercier
Copy link
Collaborator Author

GuillaumeMercier commented Oct 17, 2024

More information: qvi_pthread_group::rank() throws an out_of_range exception, indicating either that the element
is not present in the map or that we're out of bounds.
My code does this:

    std::lock_guard<std::mutex> guard(m_mutex);
    fprintf(stdout,"[%i]=================== Querying thread rank: ",qvi_gettid());
    int rank = -1;
    try{
        rank = m_tid2rank.at(qvi_gettid());
    }   
    catch(const std::out_of_range& ex)
    {
        fflush(stdout);
        std::cout << "1) out_of_range::what(): " << ex.what() << '\n';
    }
    fprintf(stdout," %i\n",rank);

And the program outputs this:

[1979082]=================== Querying thread rank: [1979080]=================== Querying thread rank: [1979083]=================== Querying thread rank: [1979081]=================== Querying thread rank: 1) out_of_range::what(): 1) out_of_range::what(): map::atmap::at1) out_of_range::what(): map::at

1) out_of_range::what(): map::at

 -1
 -1
 -1
 -1
[1979083] thread_subscope sgrank is -1
[1979083] thread_subscope sgsize is 2
[1979081] thread_subscope sgrank is -1
[1979081] thread_subscope sgsize is 2
[1979082] thread_subscope sgrank is -1
[1979082] thread_subscope sgsize is 2
[1979080] thread_subscope sgrank is -1
[1979080] thread_subscope sgsize is 2

The interesting thing is the sgrank value.
Remember that I'm splitting in 2 a scope whose size is 4.
But unless I'm mistaken the splitting operation should apply to resources, not to the group itself.
Therefore the size should remain at 4 and not 2.
So I'm enclined to think that the split semantics implemented for pthtreads is not correct.
@samuelkgutierrez : what is your take on this?

@samuelkgutierrez
Copy link
Member

Interesting. When we split a scope, we split both the group and the parent resources. The interesting thing here is that during the split, the rank values on each side should be 0 and 1.

@GuillaumeMercier
Copy link
Collaborator Author

GuillaumeMercier commented Oct 18, 2024

So, there are two issues:
1- qv_pthread_scope_split subscope management is not correct (sgsize should be 2 and not 4)
2- qv_scope_split does the right thing for the size but the rank management is not correct (in the case of splitting scopes created by a call to qv_pthread_scope_split(_at))

@GuillaumeMercier
Copy link
Collaborator Author

GuillaumeMercier commented Oct 18, 2024

For point 1: qv_pthread_scope_split doesn't seem to call qvi_pthread_group::split. Shouldn't it be the case though?

  const uint_t group_size = k;
    // Split the hardware, get the hardare pools.                                                                        
    qvi_hwpool **hwpools = nullptr;
    int rc = qvi_hwsplit::thread_split(
        this, npieces, kcolors, k, maybe_obj_type, &hwpools
    );
    if (rc != QV_SUCCESS) return rc;
    // Split off from our parent group. This call is called from a context in                                            
    // which a process is splitting its resources across threads, so create a                                            
    // new thread group for each child.                                                                                  
    qvi_group *thgroup = nullptr;
    rc = m_group->thsplit(group_size, &thgroup);
    if (rc != QV_SUCCESS) return rc;

I would expect something like rc = qvi_pthread_group::split(...) instead of rc = m_group->thsplit(group_size, &thgroup);. Here the group is created with a size corresponding to the total number of threads, but it's not the size
of the subgroups (as computed by qvi_pthread_group::m_subgroup_info)

@samuelkgutierrez
Copy link
Member

samuelkgutierrez commented Oct 18, 2024

This is my understanding, but I would double check this (and please correct me if I'm wrong).

  • qv_pthread_scope_split is called in the context of splitting off of a process and splitting resources among the threads that are spawned.
  • Further splitting of the pthread scopes once they are running would be performed by something like qv_scope_split, which should call qvi_pthread_group::split.

This raises a good point: are the names used here too confusing?

@GuillaumeMercier
Copy link
Collaborator Author

GuillaumeMercier commented Oct 18, 2024

Ok, I'm definitely lost here. We need to discuss this at the next meeting. And yes, if you're right, that's very confusing on several levels.

@GuillaumeMercier
Copy link
Collaborator Author

BTW, hwloc's physical numbering of PU/cores is used and it's not recommended.
For instance, in the previous example, instead of:

[1965276] cpubind=0-1,4-5
[1965277] cpubind=2-3,6-7

I should have:

[1965276] cpubind=0-2,1-3
[1965277] cpubind=4-6,5-7

Is there a particular reason to use physical numbering over logical numbering?

@samuelkgutierrez
Copy link
Member

Physical numbering should not be used. If it is, that's a regression in behavior.

@GuillaumeMercier
Copy link
Collaborator Author

* Further splitting of the pthread scopes once they are running would be performed by something like `qv_scope_split`, which should call `qvi_pthread_group::split`.

I tend to be against this approach because it seems impossible to me to implement it (correctly).
I'll explain this to you during the next meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants