Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/component sync #3269

Merged
merged 26 commits into from
Dec 12, 2024
Merged

Fix/component sync #3269

merged 26 commits into from
Dec 12, 2024

Conversation

andre-merzky
Copy link
Member

This PR cleans up the component startup synchronization by using a zmq pipe instead of heartbeat messages. The latter became superfluous as we have competing termination mechanisms in place: pwatcher, termination commands, and heartbeats. The latter now got removed to avoid conflicts in termination procedures.

@andre-merzky
Copy link
Member Author

Tests currently fail as this PR depends on radical-cybertools/radical.utils#428.

Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 33.87097% with 82 lines in your changes missing coverage. Please review.

Project coverage is 42.79%. Comparing base (c764c00) to head (0230d1e).
Report is 27 commits behind head on devel.

Files with missing lines Patch % Lines
src/radical/pilot/utils/component_manager.py 7.69% 24 Missing ⚠️
src/radical/pilot/utils/component.py 9.52% 19 Missing ⚠️
src/radical/pilot/session.py 37.93% 18 Missing ⚠️
src/radical/pilot/agent/agent_0.py 44.44% 5 Missing ⚠️
src/radical/pilot/pilot.py 50.00% 5 Missing ⚠️
src/radical/pilot/proxy.py 0.00% 4 Missing ⚠️
src/radical/pilot/task_manager.py 0.00% 3 Missing ⚠️
src/radical/pilot/utils/staging_helper.py 0.00% 2 Missing ⚠️
src/radical/pilot/pilot_manager.py 0.00% 1 Missing ⚠️
src/radical/pilot/states.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #3269      +/-   ##
==========================================
- Coverage   42.83%   42.79%   -0.05%     
==========================================
  Files          97       97              
  Lines       11282    11308      +26     
==========================================
+ Hits         4833     4839       +6     
- Misses       6449     6469      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andre-merzky
Copy link
Member Author

@AymenFJA , @mtitov : this is ready for review now, tests are passing after the latest RU merge.

src/radical/pilot/agent/executing/base.py Show resolved Hide resolved
self._log.error("[Callback]: pilot '%s' failed (exit)", uid)
self._sub.stop()
self._sub = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we setting self._sub = None after failure? I am assuming it will not affect anything because the pilot failed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right, that's not needed anymore - fixed!

@@ -71,7 +71,7 @@ def _pilot_state_progress(pid, current, target):
# allow to transition from FAILED to DONE (done gets picked up from DB,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done gets picked up from DB -->no more DB!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, also fixed now.

@andre-merzky
Copy link
Member Author

TODO AM: CI examples should only be done in release

@andre-merzky
Copy link
Member Author

@mtitov , @AymenFJA : tests are passing, this is ready for review again. Thanks!

@AymenFJA
Copy link
Contributor

Thanks @andre-merzky this looks good to me. I also just ran it Ben's test:

import pytest
import threading
import parsl
from parsl.tests.configs.local_radical_mpi import fresh_config as local_config


@parsl.python_app
def some_mpi_func(msg, sleep, comm=None, parsl_resource_specification={}):
    import time
    msg = 'hello %d/%d: %s' % (comm.rank, comm.size, msg)
    time.sleep(sleep)
    print(msg)
    return comm.size


apps = []


@pytest.mark.local
@pytest.mark.radical
def test_radical_mpi(n=7):
    # rank size should be > 1 for the
    # radical runtime system to run this function in MPI env
    for i in range(2, n):
        spec = {'ranks': i}
        t = some_mpi_func(msg='mpi.func.%06d' % i, sleep=1, comm=None, parsl_resource_specification=spec)
        apps.append(t)
    assert [len(app.result()) for app in apps] == list(range(2, n))

assert threading.active_count() == 1, "test left threads running: " + repr(threading.enumerate())

and it passed:

(parsl_rp) aymen@alienware:~/RADICAL/PARSL/parsl$ pytest parsl/tests/test_radical/test_mpi_funcs.py --config local
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.12.3, pytest-7.4.4, pluggy-1.5.0
Test order randomisation NOT enabled. Enable with --random-order or --random-order-bucket=<bucket_type>
rootdir: /home/aymen/RADICAL/PARSL/parsl/parsl/tests
configfile: pytest.ini
plugins: cov-6.0.0, typeguard-4.4.1, random-order-1.1.1
collected 1 item                                                                                                                                                           

parsl/tests/test_radical/test_mpi_funcs.py .                                                                                                                         [100%]

============================================================================= warnings summary =============================================================================
parsl/tests/test_radical/test_mpi_funcs.py:20
  /home/aymen/RADICAL/PARSL/parsl/parsl/tests/test_radical/test_mpi_funcs.py:20: PytestUnknownMarkWarning: Unknown pytest.mark.radical - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    @pytest.mark.radical

test_radical/test_mpi_funcs.py::test_radical_mpi
  /home/aymen/ve/parsl_rp/lib/python3.12/site-packages/radical/utils/atfork/stdlib_fixer.py:61: UserWarning: Import `radical` modules before `logging` to avoid the application to deadlock on `fork()`!
    warnings.warn(msg)

test_radical/test_mpi_funcs.py::test_radical_mpi
  /home/aymen/ve/parsl_rp/lib/python3.12/site-packages/radical/utils/atfork/atfork.py:217: DeprecationWarning: This process (pid=2653458) is multi-threaded, use of fork() may lead to deadlocks in the child.
    pid = _orig_os_fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== 1 passed, 3 warnings in 25.43s ======================================================================
(parsl_rp) aymen@alienware:~/RADICAL/PARSL/parsl$ 

Copy link
Contributor

@AymenFJA AymenFJA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @andre-merzky

@andre-merzky andre-merzky merged commit c23c288 into devel Dec 12, 2024
31 checks passed
@andre-merzky andre-merzky deleted the fix/component_sync branch December 12, 2024 15:15
github-merge-queue bot pushed a commit to Parsl/parsl that referenced this pull request Dec 14, 2024
In addition to the work happening
[here](radical-cybertools/radical.pilot#3269) to
address this issue #3708. This PR
ensures the cleanup of threads generated by RPEX.
- Bug fix (partially #3708)

---------

Co-authored-by: Ben Clifford <[email protected]>
github-merge-queue bot pushed a commit to Parsl/parsl that referenced this pull request Dec 14, 2024
In addition to the work happening
[here](radical-cybertools/radical.pilot#3269) to
address this issue #3708. This PR
ensures the cleanup of threads generated by RPEX.
- Bug fix (partially #3708)

---------

Co-authored-by: Ben Clifford <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants