Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/xhc not sessions friendly #13013

Open
hppritcha opened this issue Jan 2, 2025 · 6 comments
Open

coll/xhc not sessions friendly #13013

hppritcha opened this issue Jan 2, 2025 · 6 comments

Comments

@hppritcha
Copy link
Member

If I run the ompi-tests/ibm/sessions/sessions_init_twice.c test with the xhc module enabled there's a segfault in the second initialization of MPI:

 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000000cbe35 __strlen_avx2()  :0
 2 0x000000000009f682 __GI___strdup()  :0
 3 0x00000000001ff74c mca_coll_xhc_read_op_config()  /home/hpritchard/ompi-er2/ompi/mca/coll/xhc/coll_xhc.c:390
 4 0x0000000000206ac1 mca_coll_xhc_module_enable()  /home/hpritchard/ompi-er2/ompi/mca/coll/xhc/coll_xhc_module.c:209
 5 0x0000000000170240 mca_coll_base_comm_select()  /home/hpritchard/ompi-er2/ompi/mca/coll/base/coll_base_comm_select.c:257
 6 0x00000000000757e7 ompi_comm_activate_complete()  /home/hpritchard/ompi-er2/ompi/communicator/comm_cid.c:912
 7 0x0000000000076645 ompi_comm_activate_nb_complete()  /home/hpritchard/ompi-er2/ompi/communicator/comm_cid.c:1108
 8 0x0000000000079d26 ompi_comm_request_progress()  /home/hpritchard/ompi-er2/ompi/communicator/comm_request.c:154
 9 0x0000000000027115 opal_progress()  /home/hpritchard/ompi-er2/opal/runtime/opal_progress.c:224
10 0x0000000000072d48 ompi_request_wait_completion()  /home/hpritchard/ompi-er2/ompi/../ompi/request/request.h:493
11 0x0000000000075ff7 ompi_comm_activate()  /home/hpritchard/ompi-er2/ompi/communicator/comm_cid.c:1042
12 0x000000000006f255 ompi_comm_create_from_group()  /home/hpritchard/ompi-er2/ompi/communicator/comm.c:1595
13 0x00000000000db42d PMPI_Comm_create_from_group()  /home/hpritchard/ompi-er2/ompi/mpi/c/comm_create_from_group.c:106
14 0x00000000004011d2 main()  /home/hpritchard/ompi-tests/ibm/sessions/sessions_init_twice.c:135
15 0x000000000003ad85 __libc_start_main()  ???:0
16 0x0000000000400c4e _start()  ???:0
@jsquyres
Copy link
Member

jsquyres commented Jan 3, 2025

FYI: @gkatev

@gkatev
Copy link
Contributor

gkatev commented Jan 3, 2025

Thanks for the ping, I'll check it out

@gkatev
Copy link
Contributor

gkatev commented Jan 7, 2025

I'm having trouble reproducing this, can you tell me more about your environment?

$ mpirun -n 2 --mca coll basic,libnbc,xhc --mca coll_xhc_priority 100 ./sessions_init_twice
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()
Finished first finalize
Finished first finalize
Starting second init
Starting second init
Finished second init
Finished second init
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()
read_op_config()

(I'm on the latest main, 0bccfcd)

@gkatev
Copy link
Contributor

gkatev commented Jan 7, 2025

By only looking at the code, a segfault at that strdup means that either something's wrong with the info cstring, or with op_mca.hierarchy (= mca_coll_xhc_component.op_mca[colltype]).

config->hierarchy_string = strdup(info_flag ?
info_val->string : op_mca.hierarchy);

The path that involves the info stuff shouldn't be chosen by default. (I don't imagine you are setting an ompi_comm_coll_xhc_* info key somewhere??)

mca_coll_xhc_component.op_mca[].hierarchy is set in coll_xhc_component.c, either in the initialization of the global mca_coll_xhc_component, or in one of two mca_base_component_var_register() calls.

@hppritcha
Copy link
Member Author

i'll do some more debugging... i'm not setting any special mca params for xhc.

@hppritcha
Copy link
Member Author

okay i see what's going on. The strings in the mca_coll_xhc_component are being registered with the MCA system.
across a session_finalize/session_init - and the app has no other sessions (including MPI_Init generated one - the MCA system is torn down in the finalize and parameters registered as MCA_BASE_VAR_TYPE_STRING types are freed.

I added some debug output to xhc_register that shows this:

for op 0 hierarchy 0x61b5e0 chunk_size 0x61c020 cico_max 256
for op 1 hierarchy 0x61b5e0 chunk_size 0x61c020 cico_max 256
for op 2 hierarchy 0x61b5e0 chunk_size 0x61c020 cico_max 256
for op 0 hierarchy 0x61bd30 chunk_size 0x151d84d85f7e cico_max 0
for op 1 hierarchy 0x61bd30 chunk_size 0x151d84d85f7e cico_max 0
for op 0 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 1 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 2 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 0 hierarchy 0x61bdb0 chunk_size 0x15361fb64f7e cico_max 0
for op 1 hierarchy 0x61bdb0 chunk_size 0x15361fb64f7e cico_max 0
for op 2 hierarchy 0x61bdb0 chunk_size 0x15361fb64f7e cico_max 0
for op 0 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 1 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 2 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 0 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 0 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 1 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 2 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 0 hierarchy 0x61bdb0 chunk_size 0x14f04c08cf7e cico_max 0
for op 1 hierarchy 0x61bdb0 chunk_size 0x14f04c08cf7e cico_max 0
for op 2 hierarchy 0x61bdb0 chunk_size 0x14f04c08cf7e cico_max 0
for op 0 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 1 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 2 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 0 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 1 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 2 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 1 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 2 hierarchy 0x61b660 chunk_size 0x61c0a0 cico_max 256
for op 0 hierarchy 0x61bdb0 chunk_size 0x147d1db64f7e cico_max 0
for op 1 hierarchy 0x61bdb0 chunk_size 0x147d1db64f7e cico_max 0
for op 2 hierarchy 0x61bdb0 chunk_size 0x147d1db64f7e cico_max 0
for op 0 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 1 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 2 hierarchy 0x61c0c0 chunk_size 0x61ccd0 cico_max 4096
for op 0 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 1 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 2 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 2 hierarchy 0x61bd30 chunk_size 0x151d84d85f7e cico_max 0
for op 0 hierarchy 0x61c040 chunk_size 0x61cc50 cico_max 4096
for op 1 hierarchy 0x61c040 chunk_size 0x61cc50 cico_max 4096
for op 2 hierarchy 0x61c040 chunk_size 0x61cc50 cico_max 4096
for op 0 hierarchy 0x61c300 chunk_size 0x61ce60 cico_max 4096
for op 1 hierarchy 0x61c300 chunk_size 0x61ce60 cico_max 4096
for op 2 hierarchy 0x61c300 chunk_size 0x61ce60 cico_max 4096
for op 1 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
for op 2 hierarchy 0x61c380 chunk_size 0x61cee0 cico_max 4096
Finished first finalize
Finished first finalize
Finished first finalize
Finished first finalize
Starting second init
Starting second init
Starting second init
Starting second init
running xhc_register
running xhc_register - opal smsc thing returns -13
ran xhc_register
running xhc_register
running xhc_register - opal smsc thing returns -13
ran xhc_register
running xhc_register
running xhc_register - opal smsc thing returns -13
ran xhc_register
running xhc_register
running xhc_register - opal smsc thing returns -13
ran xhc_register
Finished second init
Finished second init
Finished second init
Finished second init
for op 0 hierarchy (nil) chunk_size (nil) cico_max 256
for op 0 hierarchy (nil) chunk_size (nil) cico_max 256

If one builds Open MPI with the --enable-mca-dso option this problem doesn't occur because the xhc.so module is dlopened again in the second session init so the values used in initializing mca_coll_xhc_component appear again.

hppritcha added a commit to hppritcha/ompi that referenced this issue Jan 7, 2025
in the case of multiple session init/finalize sequences
that result in MCA framework being destructed prior to
a restart with a new session.

related to open-mpi#13013

Signed-off-by: Howard Pritchard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants