Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring can report to somebody else's monitoring hub #3287

Open
benclifford opened this issue Mar 24, 2024 · 2 comments
Open

Monitoring can report to somebody else's monitoring hub #3287

benclifford opened this issue Mar 24, 2024 · 2 comments

Comments

@benclifford
Copy link
Collaborator

Describe the bug

I saw this in a superficially-unrelated PR #3286

2024-03-24T10:16:18.0960186Z parsl/tests/test_monitoring/test_htex_init_blocks_vs_monitoring.py .F
2024-03-24T10:16:18.0960831Z 
2024-03-24T10:16:18.0961086Z =================================== FAILURES ===================================
2024-03-24T10:16:18.1023586Z ____________________________ test_row_counts[none] _____________________________
2024-03-24T10:16:18.1024233Z 
2024-03-24T10:16:18.1025500Z tmpd_cwd = PosixPath('/home/runner/work/parsl/parsl/.pytest/parsltest-current/test_row_counts_none_-bqmyez8f')
2024-03-24T10:16:18.1026682Z strategy = 'none'
2024-03-24T10:16:18.1026952Z 
2024-03-24T10:16:18.1027125Z     @pytest.mark.local
2024-03-24T10:16:18.1027956Z     @pytest.mark.parametrize("strategy", ('none', 'simple', 'htex_auto_scale'))
2024-03-24T10:16:18.1028832Z     def test_row_counts(tmpd_cwd, strategy):
2024-03-24T10:16:18.1029561Z         # this is imported here rather than at module level because
2024-03-24T10:16:18.1030543Z         # it isn't available in a plain parsl install, so this module
2024-03-24T10:16:18.1031425Z         # would otherwise fail to import and break even a basic test
2024-03-24T10:16:18.1032096Z         # run.
2024-03-24T10:16:18.1032462Z         import sqlalchemy
2024-03-24T10:16:18.1033307Z         from sqlalchemy import text
2024-03-24T10:16:18.1033992Z     
2024-03-24T10:16:18.1034479Z         db_url = f"sqlite:///{tmpd_cwd}/monitoring.db"
2024-03-24T10:16:18.1035221Z         parsl.load(fresh_config(tmpd_cwd, strategy, db_url))
2024-03-24T10:16:18.1035839Z     
2024-03-24T10:16:18.1036168Z         this_app().result()
2024-03-24T10:16:18.1036583Z     
2024-03-24T10:16:18.1036918Z         parsl.dfk().cleanup()
2024-03-24T10:16:18.1037370Z         parsl.clear()
2024-03-24T10:16:18.1037758Z     
2024-03-24T10:16:18.1038153Z         engine = sqlalchemy.create_engine(db_url)
2024-03-24T10:16:18.1038775Z         with engine.begin() as connection:
2024-03-24T10:16:18.1039302Z     
2024-03-24T10:16:18.1039945Z             result = connection.execute(text("SELECT COUNT(DISTINCT block_id) FROM block"))
2024-03-24T10:16:18.1040796Z             (c, ) = result.first()
2024-03-24T10:16:18.1041504Z             assert c == 1, "We should see a single block in this database"
2024-03-24T10:16:18.1042205Z     
2024-03-24T10:16:18.1043226Z             result = connection.execute(text("SELECT COUNT(*) FROM block WHERE block_id = 0 AND status = 'PENDING'"))
2024-03-24T10:16:18.1045240Z             (c, ) = result.first()
2024-03-24T10:16:18.1045913Z >           assert c == 1, "There should be a single pending status"
2024-03-24T10:16:18.1046767Z E           AssertionError: There should be a single pending status
2024-03-24T10:16:18.1047448Z E           assert 2 == 1
2024-03-24T10:16:18.1047742Z 
2024-03-24T10:16:18.1048245Z parsl/tests/test_monitoring/test_htex_init_blocks_vs_monitoring.py:78: AssertionError
@benclifford benclifford changed the title failure in htex init blocks vs monitoring test in CI Failure in htex init blocks vs monitoring test in CI Mar 24, 2024
@benclifford
Copy link
Collaborator Author

Inspecting the artifacts (monitoring.db and log files), it looks like there is a PENDING status recorded in the database for two different run IDs. In database_manager.log:

2024-03-24 10:16:12.037 database_manager:619 [INFO] [Monitoring-migrate-block 140008366049024] Will put [{'run_id': '3fde70cb-ce68-4473-af86-874f65ae156f', 'status': 'PENDING', 'timestamp': datetime.datetime(2024, 3, 24, 10, 10, 56, 544680), 'executor_label': 'htex_Local', 'job_id': '18046', 'block_id': '0'}] to pending block queue
...
2024-03-24 10:16:12.164 database_manager:619 [INFO] [Monitoring-migrate-block 140008366049024] Will put [{'run_id': '72783ebc-417e-4890-903d-683150696422', 'status': 'PENDING', 'timestamp': datetime.datetime(2024, 3, 24, 10, 16, 12, 162436), 'executor_label': 'htex_local', 'job_id': '27303', 'block_id': '0'}] to pending block queue

The run for this test is 72783ebc-417e-4890-903d-683150696422 so the first PENDING message (with run ID 3fde70cb-ce68-4473-af86-874f65ae156f) is for the wrong run

sqlite> select * from block;
3fde70cb-ce68-4473-af86-874f65ae156f|htex_Local|0|18046|2024-03-24 10:10:56.544680|PENDING
72783ebc-417e-4890-903d-683150696422|htex_local|0|27303|2024-03-24 10:16:12.162436|PENDING
72783ebc-417e-4890-903d-683150696422|htex_local|0|27303|2024-03-24 10:16:12.219174|RUNNING
72783ebc-417e-4890-903d-683150696422|htex_local|0|27303|2024-03-24 10:16:14.789575|CANCELLED

That bad pending message is from a run around 6 seconds before.

None of the other monitoring.db databases in this test run contain any other reference to that bad 3fde7... workflow run, and in the relevant run for that bad run, it looks like the monitoring router hung on startup:

$ cat runinfo/008/monitoring_router.log 
1711275051.469708 2024-03-24 10:10:51 Monitoring-Router-Process-18007 MainThread-140017148053312 monitoring_router:85 __init__ INFO: Initialized the UDP socket on 0.0.0.0:55055
1711275051.470886 2024-03-24 10:10:51 Monitoring-Router-Process-18007 MainThread-140017148053312 monitoring_router:203 router_starter INFO: Starting MonitoringRouter in router_starter
$

so then the ZMQ layer would likely have sat around in the pytest processing continuing to try to asynchronously sent a monitoring message - and eventually it would have found one.

So two pieces to this issue:

i) monitoring startup hung without failing the test - this is, in some part, a (mis?)feature of the original monitoring design, based I think on lack of confidence in monitoring code turning into "we can't let broken monitoring break 'real' parsl"

ii) this is the monitoring version of htex issue #2199 - components from one run can connect to the components from another run

@benclifford
Copy link
Collaborator Author

The reason that monitoring router did not continue to log stuff is not because of a hang, but because this 3fde7... run is a test of killing monitoring components - so it killed the monitoring router deliberately. So this test is successfully detecting (although presenting it in a very awkward way) some misbehaviour around there.

@benclifford benclifford changed the title Failure in htex init blocks vs monitoring test in CI Monitoring can report to somebody else's monitoring hub Mar 25, 2024
benclifford added a commit that referenced this issue Mar 26, 2024
A more principled fix of #3287 looks to be much more deeply invasive, and
involves a big rework of how ZMQ is used.
benclifford added a commit that referenced this issue Mar 26, 2024
…3293)

A more principled fix of #3287 looks to be much more deeply invasive, and
involves a big rework of how ZMQ is used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant