Monitoring can report to somebody else's monitoring hub #3287

benclifford · 2024-03-24T10:36:35Z

Describe the bug

I saw this in a superficially-unrelated PR #3286

2024-03-24T10:16:18.0960186Z parsl/tests/test_monitoring/test_htex_init_blocks_vs_monitoring.py .F
2024-03-24T10:16:18.0960831Z 
2024-03-24T10:16:18.0961086Z =================================== FAILURES ===================================
2024-03-24T10:16:18.1023586Z ____________________________ test_row_counts[none] _____________________________
2024-03-24T10:16:18.1024233Z 
2024-03-24T10:16:18.1025500Z tmpd_cwd = PosixPath('/home/runner/work/parsl/parsl/.pytest/parsltest-current/test_row_counts_none_-bqmyez8f')
2024-03-24T10:16:18.1026682Z strategy = 'none'
2024-03-24T10:16:18.1026952Z 
2024-03-24T10:16:18.1027125Z     @pytest.mark.local
2024-03-24T10:16:18.1027956Z     @pytest.mark.parametrize("strategy", ('none', 'simple', 'htex_auto_scale'))
2024-03-24T10:16:18.1028832Z     def test_row_counts(tmpd_cwd, strategy):
2024-03-24T10:16:18.1029561Z         # this is imported here rather than at module level because
2024-03-24T10:16:18.1030543Z         # it isn't available in a plain parsl install, so this module
2024-03-24T10:16:18.1031425Z         # would otherwise fail to import and break even a basic test
2024-03-24T10:16:18.1032096Z         # run.
2024-03-24T10:16:18.1032462Z         import sqlalchemy
2024-03-24T10:16:18.1033307Z         from sqlalchemy import text
2024-03-24T10:16:18.1033992Z     
2024-03-24T10:16:18.1034479Z         db_url = f"sqlite:///{tmpd_cwd}/monitoring.db"
2024-03-24T10:16:18.1035221Z         parsl.load(fresh_config(tmpd_cwd, strategy, db_url))
2024-03-24T10:16:18.1035839Z     
2024-03-24T10:16:18.1036168Z         this_app().result()
2024-03-24T10:16:18.1036583Z     
2024-03-24T10:16:18.1036918Z         parsl.dfk().cleanup()
2024-03-24T10:16:18.1037370Z         parsl.clear()
2024-03-24T10:16:18.1037758Z     
2024-03-24T10:16:18.1038153Z         engine = sqlalchemy.create_engine(db_url)
2024-03-24T10:16:18.1038775Z         with engine.begin() as connection:
2024-03-24T10:16:18.1039302Z     
2024-03-24T10:16:18.1039945Z             result = connection.execute(text("SELECT COUNT(DISTINCT block_id) FROM block"))
2024-03-24T10:16:18.1040796Z             (c, ) = result.first()
2024-03-24T10:16:18.1041504Z             assert c == 1, "We should see a single block in this database"
2024-03-24T10:16:18.1042205Z     
2024-03-24T10:16:18.1043226Z             result = connection.execute(text("SELECT COUNT(*) FROM block WHERE block_id = 0 AND status = 'PENDING'"))
2024-03-24T10:16:18.1045240Z             (c, ) = result.first()
2024-03-24T10:16:18.1045913Z >           assert c == 1, "There should be a single pending status"
2024-03-24T10:16:18.1046767Z E           AssertionError: There should be a single pending status
2024-03-24T10:16:18.1047448Z E           assert 2 == 1
2024-03-24T10:16:18.1047742Z 
2024-03-24T10:16:18.1048245Z parsl/tests/test_monitoring/test_htex_init_blocks_vs_monitoring.py:78: AssertionError

The text was updated successfully, but these errors were encountered:

benclifford · 2024-03-24T11:13:58Z

Inspecting the artifacts (monitoring.db and log files), it looks like there is a PENDING status recorded in the database for two different run IDs. In database_manager.log:

2024-03-24 10:16:12.037 database_manager:619 [INFO] [Monitoring-migrate-block 140008366049024] Will put [{'run_id': '3fde70cb-ce68-4473-af86-874f65ae156f', 'status': 'PENDING', 'timestamp': datetime.datetime(2024, 3, 24, 10, 10, 56, 544680), 'executor_label': 'htex_Local', 'job_id': '18046', 'block_id': '0'}] to pending block queue
...
2024-03-24 10:16:12.164 database_manager:619 [INFO] [Monitoring-migrate-block 140008366049024] Will put [{'run_id': '72783ebc-417e-4890-903d-683150696422', 'status': 'PENDING', 'timestamp': datetime.datetime(2024, 3, 24, 10, 16, 12, 162436), 'executor_label': 'htex_local', 'job_id': '27303', 'block_id': '0'}] to pending block queue

The run for this test is 72783ebc-417e-4890-903d-683150696422 so the first PENDING message (with run ID 3fde70cb-ce68-4473-af86-874f65ae156f) is for the wrong run

sqlite> select * from block;
3fde70cb-ce68-4473-af86-874f65ae156f|htex_Local|0|18046|2024-03-24 10:10:56.544680|PENDING
72783ebc-417e-4890-903d-683150696422|htex_local|0|27303|2024-03-24 10:16:12.162436|PENDING
72783ebc-417e-4890-903d-683150696422|htex_local|0|27303|2024-03-24 10:16:12.219174|RUNNING
72783ebc-417e-4890-903d-683150696422|htex_local|0|27303|2024-03-24 10:16:14.789575|CANCELLED

That bad pending message is from a run around 6 seconds before.

None of the other monitoring.db databases in this test run contain any other reference to that bad 3fde7... workflow run, and in the relevant run for that bad run, it looks like the monitoring router hung on startup:

$ cat runinfo/008/monitoring_router.log 
1711275051.469708 2024-03-24 10:10:51 Monitoring-Router-Process-18007 MainThread-140017148053312 monitoring_router:85 __init__ INFO: Initialized the UDP socket on 0.0.0.0:55055
1711275051.470886 2024-03-24 10:10:51 Monitoring-Router-Process-18007 MainThread-140017148053312 monitoring_router:203 router_starter INFO: Starting MonitoringRouter in router_starter
$

so then the ZMQ layer would likely have sat around in the pytest processing continuing to try to asynchronously sent a monitoring message - and eventually it would have found one.

So two pieces to this issue:

i) monitoring startup hung without failing the test - this is, in some part, a (mis?)feature of the original monitoring design, based I think on lack of confidence in monitoring code turning into "we can't let broken monitoring break 'real' parsl"

ii) this is the monitoring version of htex issue #2199 - components from one run can connect to the components from another run

benclifford · 2024-03-24T11:16:07Z

The reason that monitoring router did not continue to log stuff is not because of a hang, but because this 3fde7... run is a test of killing monitoring components - so it killed the monitoring router deliberately. So this test is successfully detecting (although presenting it in a very awkward way) some misbehaviour around there.

A more principled fix of #3287 looks to be much more deeply invasive, and involves a big rework of how ZMQ is used.

…3293) A more principled fix of #3287 looks to be much more deeply invasive, and involves a big rework of how ZMQ is used.

benclifford added the bug label Mar 24, 2024

benclifford changed the title ~~failure in htex init blocks vs monitoring test in CI~~ Failure in htex init blocks vs monitoring test in CI Mar 24, 2024

benclifford added the monitoring label Mar 25, 2024

benclifford changed the title ~~Failure in htex init blocks vs monitoring test in CI~~ Monitoring can report to somebody else's monitoring hub Mar 25, 2024

benclifford added a commit that referenced this issue Mar 26, 2024

Tighten database access in htex/monitoring test to avoid issue #3287

696c9fb

A more principled fix of #3287 looks to be much more deeply invasive, and involves a big rework of how ZMQ is used.

benclifford mentioned this issue Mar 26, 2024

Tighten database access in htex/monitoring test to avoid issue #3287 #3293

Merged

benclifford added a commit that referenced this issue Mar 26, 2024

Tighten database access in htex/monitoring test to avoid issue #3287 (#…

20c7bf3

…3293) A more principled fix of #3287 looks to be much more deeply invasive, and involves a big rework of how ZMQ is used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring can report to somebody else's monitoring hub #3287

Monitoring can report to somebody else's monitoring hub #3287

benclifford commented Mar 24, 2024

benclifford commented Mar 24, 2024

benclifford commented Mar 24, 2024

Monitoring can report to somebody else's monitoring hub #3287

Monitoring can report to somebody else's monitoring hub #3287

Comments

benclifford commented Mar 24, 2024

benclifford commented Mar 24, 2024

benclifford commented Mar 24, 2024