Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChainUpdates: model bad peer behavior #3856

Closed
wants to merge 14 commits into from

Conversation

amesgen
Copy link
Member

@amesgen amesgen commented Jul 4, 2022

Description

Closes CAD-4314

Previously, we did not generate any ChainUpdate behavior modeling a malicious or ill-configured peer. This meant that existing tests could only check that our disconnection routines are not overly restrictive, but not that they are strict enough.

This PR adds this possibility, namely by randomly toggling some blocks to be invalid. As the result is not necessarily modeling bad behavior, we add a function to "classify" a sequence of chain updates after the fact.

Concerning the BlockFetch client test: as certain invalid behavior is only caught by the ChainSync client, we now also run it in the background.

Checklist

  • Branch
    • Commit sequence broadly makes sense
    • Commits have useful messages
    • New tests are added if needed and existing tests are updated
    • If this branch changes Consensus and has any consequences for downstream repositories or end users, said changes must be documented in interface-CHANGELOG.md
    • If this branch changes Network and has any consequences for downstream repositories or end users, said changes must be documented in interface-CHANGELOG.md
    • If serialization changes, user-facing consequences (e.g. replay from genesis) are confirmed to be intentional.
  • Pull Request
    • Self-reviewed the diff
    • Useful pull request description at least containing the following information:
      • What does this PR change?
      • Why these changes were needed?
      • How does this affect downstream repositories and/or end-users?
      • Which ticket does this PR close (if any)? If it does, is it linked?
    • Reviewer requested

@amesgen amesgen requested review from nfrisby and dnadales as code owners July 4, 2022 07:48
@amesgen amesgen added consensus issues related to ouroboros-consensus testing labels Jul 4, 2022
@amesgen amesgen force-pushed the amesgen/CAD-4314-model-bad-peers branch from 10579e5 to 644413a Compare July 6, 2022 12:47
@iohk-bors iohk-bors bot deleted the branch master July 6, 2022 15:30
@iohk-bors iohk-bors bot closed this Jul 6, 2022
@amesgen amesgen reopened this Jul 6, 2022
@amesgen amesgen changed the base branch from amesgen/CAD-4193-BlockFetch-client-test to master July 6, 2022 16:01
@amesgen amesgen force-pushed the amesgen/CAD-4314-model-bad-peers branch from 644413a to 205193d Compare July 14, 2022 16:30
@amesgen amesgen requested a review from a team as a code owner July 14, 2022 16:30
@amesgen amesgen force-pushed the amesgen/CAD-4314-model-bad-peers branch from 205193d to 77852a5 Compare July 14, 2022 16:37
@amesgen amesgen removed the request for review from a team July 14, 2022 16:37
@amesgen amesgen force-pushed the amesgen/CAD-4314-model-bad-peers branch 2 times, most recently from c542272 to f0aa7d0 Compare July 21, 2022 08:36
The BFT protocol uses block numbers as its `SelectView`, so the induced order on
blocks is not total. When we generate a rollback, we usually end up with the
same chain length, which is then not improving upon the previous chain.
Right now, the peer disconnection logic relies on the interplay of the
BlockFetch and the ChainSync client to catch invalid behavior, so this commit
adds an actual ChainSync client for every peer.

Concretely, consider the case when a peer wants to extend an invalid block. In
that case, the ChainSync client will disconnect, either when the extending
header is received, or via the invalid block rejector in a background thread.

In contrast, when we simply add a block (together with a punishment) to the
ChainDB, this punishment will *not* be enacted, as the block is not validated as
it is not reachable via any (valid) block in the VolDB.
@amesgen amesgen force-pushed the amesgen/CAD-4314-model-bad-peers branch from f0aa7d0 to cae6ffd Compare October 18, 2022 16:22
Copy link
Contributor

@nfrisby nfrisby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major. Lots of small things and a couple open-ended questions.

ouroboros-consensus-test/src/Test/Util/ChainUpdates.hs Outdated Show resolved Hide resolved
ouroboros-consensus-test/src/Test/Util/ChainUpdates.hs Outdated Show resolved Hide resolved
ouroboros-consensus-test/src/Test/Util/ChainUpdates.hs Outdated Show resolved Hide resolved
ouroboros-consensus-test/src/Test/Util/ChainUpdates.hs Outdated Show resolved Hide resolved
Copy link
Contributor

@nfrisby nfrisby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished a full pass. Similar kinds of comments as in the review above.

Invalid ->
counterexample "Invalid behavior not caught" $
tabulateFailureMode $
property (isLeft blockFetchRes || isLeft chainSyncRes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a short comment here explaining why either might catch it and why it's OK whichever does.

@@ -181,8 +221,6 @@ runBlockFetchTest BlockFetchClientTestSetup{..} = withRegistry \registry -> do
bfClient
bfServer

-- On every tick, we schedule updates to the shared chain fragment
-- (mocking ChainSync).
forkTicking peerId =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could preserve the comment, saying that it's the ChainSync server that's being mocked.

varChains <- uncheckedNewTVarM Map.empty
varControlMessage <- uncheckedNewTVarM Mux.Continue
varFetchedBlocks <- uncheckedNewTVarM (0 <$ peerUpdates)
varCandidates <- uncheckedNewTVarM Map.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the invariants among these?

EG TickWatcher updates both varChains and producerStateVars. It seems like those two are two views onto "the peer's current chain". Is that right? Maybe combine them into a single map of pairs?


forkChainSync peerId =
forkLinkedThread registry ("BracketSync" <> condense peerId) $
forkThread registry ("BracketSync" <> condense peerId) $
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Would it be equivalent to keep it a linked thread, but catch ChainSyncClientException, record them, and then terminate successfully. That way any (unexpected) other kind of exception would still be propagated via the link?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have the same question about the BlockFetch clients too, I suppose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds also like a possible approach, I am not necessarily opposed. I think the main thing here is that we then are no longer able to independently see whether ChainSync or BlockFetch threw an exception (i.e. we only have Either ChainSyncEx BlockFetchEx instead of These ChainSyncEx BlockFetchEx). Right now, we e.g. use this information for labelling, see tabulateFailureMode, which results e.g. in

      Expected failure due to (3388 in total):
      64.14% ChainSync
      34.65% BlockFetch,ChainSync
       1.21% BlockFetch

topLevelConfig
chainDbView
ntnVersion
(pure Mux.Continue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the ChainSync client never terminates itself successfully?

... I would guess that the bracketSyncWithFetchClient terminates the ChainSync client (successfully?) when the BlockFetch client terminates itself successfully?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "graceful" termination would be somewhat nicer, but the straightforward approach (reading from a TVar here, which we set to Terminate in the end) does not work; but maybe one just needs to adapt the ChainSync client implementation in a few places to better respect these control messages; right now, we do that just here:

https://github.com/input-output-hk/ouroboros-network/blob/2793b6993c8f6ed158f432055fa4ef581acdb661/ouroboros-consensus/src/Ouroboros/Consensus/MiniProtocol/ChainSync/Client.hs#L631

In any case, the same thing is present in the actual ChainSync client test:

https://github.com/input-output-hk/ouroboros-network/blob/2793b6993c8f6ed158f432055fa4ef581acdb661/ouroboros-consensus-test/test-consensus/Test/Consensus/MiniProtocol/ChainSync/Client.hs#L296

Copy link
Contributor

@nfrisby nfrisby Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We chatted. It's awkward: the ChainSync client ignores the control message until it returns to the top of its loop. But in this case, at the end of the test, we won't be sending another message, so it will ignore the control message indefinitely.

Also, other tests do this. So it seems like a fine "over"-simplification to keep.

Edit: ooof, but I do want #3856 (comment) , so Esgen will try to see how much extra work it is to get this graceful termination to happen.

bfcoPeerOutcomes <-
flip Map.traverseWithKey blockFetchThreads \peerId threadId -> do
blockFetchRes <- waitThread threadId
chainSyncRes <- (Map.! peerId) <$> readTVarIO varChainSyncRes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Instead of this explicit mutvar stuff, could you also capture the ChainSync ThreadId, also send Mux.Terminate to the ChainSync client control vars, and read those threads' results the same as you're doing for BlockFetch? (Maybe you'd need to shutdown a peer's ChainSync before shutting down its BlockFetch, to appease bracketSyncWithFetchClient?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #3856 (comment)

, InvalidChainBehavior
]
numPeers <- case behaviorValidity behavior of
-- Interaction of multiple clients can hide invalid behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, this is an disappointing gap. Let's chat on a call.

genChainUpdates behavior maxRollback 20 >>= genSchedule strat
where
strat = case behaviorValidity behavior of
-- Multiple updates per tick can hide invalid behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Member Author

@amesgen amesgen Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same point as in #3856 (comment)

The `serverUpdates` only depend on the `securityParam`, and the `clientUpdates`
additionally depend on `serverUpdates` due to `removeLateClientUpdates`.
@amesgen
Copy link
Member Author

amesgen commented Nov 6, 2023

Superseded by IntersectMBO/ouroboros-consensus#492

@amesgen amesgen closed this Nov 6, 2023
@amesgen amesgen deleted the amesgen/CAD-4314-model-bad-peers branch November 6, 2023 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants