Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sudo_network_unstable_watch #91

Merged
merged 4 commits into from
Jan 23, 2024

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Sep 12, 2023

This is a revival of paritytech/smoldot#2245

This new function lets the JSON-RPC client subscribe to the networking state of the server, which allows visualizing what is happening, which in turn should be useful for debugging and teaching purposes.

One opinionated decision I took is that the JSON-RPC server might omit some events if the events are generated too quickly. For example, if a peer-to-peer connection opens then closes repeatedly 1000 times per second, we don't have to send 2000 events per second to the JSON-RPC client and instead we're allowed to not send anything.
This means that the JSON-RPC client might not be aware of everything that's happening. There's an event to notify it when that happens so that a warning or something can be shown to the user.

An alternative would be to report everything and generate a stop event if the queue is too large, but that's not much better, because the JSON-RPC client wouldn't be able to know what happens between the moment when the stop happens and the moment when it resubscribes. In other words, this wouldn't solve the problem and is more complicated.

There's no possible way (CAP theorem) to guarantee that the JSON-RPC client will receive events about everything that happens, unless you slow down the networking of the node to the speed of the JSON-RPC client. This isn't something that we want to happen, normally. Maybe we could imagine a privileged mode where that is the case, but I haven't included this in this PR at the moment because it seems overkill.

"event": "missedEvents"
}
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also be beneficial to have an optional even reported for cases of miss-behaving / reputation banning?

Copy link
Contributor Author

@tomaka tomaka Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The networking in general has a shit-ton of concepts. Beyond reputations, you could also think of reporting ping times, handshake and request sizes, k-buckets states, grandpa neighbor packets, block announces, etc.

This PR focuses on just open connections and substreams because 1) these concepts are easy to understand 2) we know for sure that Polkadot will never not have connections and substreams.

lexnv
lexnv previously approved these changes Sep 12, 2023
{
"event": "substreamState",
"connectionId": "...",
"substreamId": "...",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question to make sure I got this right: we want to report all open-streams (ie /ipfs/ping, /ipfs/indentify, /[hash]/kad) together with notification protocols, not just notification-protocols (/[hash]/tx/1 etc)?

For notification-protocols, we could easily generate a substreamID when we establish a connection just before the handshake is done. As for getting all substreams, I believe we'd need to interface substrate somehow with yamux implementation directly, and we'd most probably end-up either:

And since we could change the muxer in the future, for example changing the transport to quic, wouldn't the substreamID feel like an implementation detail? (Since we'd need to find a workaround to expose the substreamID from the transport level in that case) 🤔

Let me know if I got this right, or if we could achieve the same behavior easier 🙏

// cc @tomaka @altonen @jsdw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you just need a wrapper (a "middleware") of the StreamMuxer trait, no need to touch Yamux itself. Every type of connection implements StreamMuxer and is only ever accessed through this trait by the rest of the stack.

However what I didn't realize is that getting the protocol name is kind of tricky, because it is done by sending messages within the substream.
Substreams themselves don't inherently have protocol names, it's just that in libp2p every single substream starts with a handshake that negotiates its protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substreams themselves don't inherently have protocol names, it's just that in libp2p every single substream starts with a handshake that negotiates its protocol.

Which also means that this RFC needs to clarify what to do with substreams whose protocol hasn't been negotiated yet.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these meant to be implemented for libp2p protocols in the first place? Substrate may have no knowledge that a substream for Kademlia or PING has been opened so there would have to some tracking implemented in libp2p that would expose this information to Substrate.

I don't see the problem with protocol names though. Once the protocol has been negotiated, the name is known and can be reported. If there was a separate "opening" state, then it would be more challenging.

Copy link
Contributor Author

@tomaka tomaka Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these meant to be implemented for libp2p protocols in the first place? Substrate may have no knowledge that a substream for Kademlia or PING has been opened so there would have to some tracking implemented in libp2p that would expose this information to Substrate.

Substrate entirely wraps around all the libp2p protocols: https://github.com/paritytech/polkadot-sdk/blob/a56fd32e0affbe9cb61abb5ee3ec0a0bd15f5bf0/substrate/client/network/src/behaviour.rs#L45-L47

I think that you can do it by:

In your NetworkBehaviour wrapper, you can modify the ToSwarm type to include events about substreams being opened/closed, and then intercept these events in the main network events handler.

I haven't touched these traits in years now, so I might be missing something, but I think it should work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can do it
Adding a wrapper around the ConnectionHandler trait that intercepts on_connection_event.
Adding a wrapper around the NetworkBehaviour trait so that you can wrap around the ConnectionHandler

From looking at the code, my findings also point in this direction. Thanks!

I started to implement a BehaviorWrapper a few days ago, which looked like this:

pub struct BehaviorWrapper<B: BlockT> {
  behavior: Behavior<B>,
}

impl NetworkBehavior for ... { }

This was the first step to expose the ConnectionID since the SwarmEvents from our libp2p version does not expose it at the moment.

I gave up on going that path because of two reasons:

However, I cannot think of a better way than what Pierre suggested:

pub struct BehaviorWrapper<B: BlockT> {
  behavior: Behavior<B>,
}

/// Must propagate the connection ID for proper identification of
/// connection IDs + streamIDs
pub struct HandlerWrapper {
  connectionID: ...,
  ..
}

impl NetworkBehavior for BehaviorWrapper {
  type ConnectionHandler = HandlerWrapper;

fn handle_established_inbound_connection(connectionId, ..) {
 HandlerWrapper::new(connectionId, is_inbound=true, ...)
}

fn handle_established_outbound_connection(connectionId, ..) {
 HandlerWrapper::new(connectionId, is_inbound=false, ...)
}

...
}


impl ConnectionHandler for HandlerWrapper {
// Might have missed something here 
// types = tbd / maybe other wrappers here

fn on_connection_event(..) {
   // Generate a stream ID
   stream_id = AtomicIndex::increment();
   
   // Propagate up to the swarm
   MetricsEventWrapper::CustomProtocolOpen { connectionId, streamId, protocolName, direction, .. }
}

This feels indeed a bit involved and I'd resume implementation once the substrate-p2p refactor is complete:

Since I expect most of this work will generate conflicts with those PRs

Copy link
Collaborator

@jsdw jsdw Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prefix of the function in this PR is sudo_network, so it's only the sudo_network that aren't implemented.

Ah ok, I see that now; the other groups all have a singe word like chainHead_ or chainSpec_ or archive_, but we have sudo_, sudo_sessionKeys_ and now sudo_network_ as three more groups, which I was confused by since it's a bit inconsistent in my mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit inconsistent in my mind.

Yeah it is, because ultimately the sudo functions should be renamed.

Copy link

@altonen altonen Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexnv @jsdw

If the PoC/prototype for this code turns out to add non-trivial complexity to sc-network, I hope you reconsider adding this feature.

The libp2p upgrades are really painful to deal with and I would hope to reduce the surface area of libp2p in sc-network to leave us more time to work on things other than dealing with libp2p and its upgrades. That looks to be at odds with the proposed wrapper implementation and my concern is that if this wrapper adds more code that has to be maintained during each libp2p upgrade, it just leaves us with less time to work on actually improving the Polkadot networking code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't answer that, but I didn't intend this to be implemented in Substrate quickly.

The reason for this function comes from a very pragmatic problem: most of the time, when smoldot isn't working, it's because of a connectivity issue.

When you're running a Substrate node of some kind, you're either a developer or a node operator. You can read logs, run Prometheus, open GitHub issues, etc.

When you're running smoldot, you're an end user, and you shouldn't be reading logs. The only reasonable way to help end users figure out why smoldot isn't working is put some kind of UI on top of it. This JSON-RPC function is what would power this UI.

@tomaka
Copy link
Contributor Author

tomaka commented Nov 16, 2023

As a heads up, I'm going to implement this in smoldot (merged or not). It's one of the milestones that I've put in the smoldot financing treasury proposal.

Copy link
Collaborator

@jsdw jsdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this merge; it's unstable anyway, and probably mostly useful in Smoldot and perhaps less so in Substrate if I understand right!

@josepot
Copy link
Contributor

josepot commented Jan 23, 2024

I have reviewed this PR and I think that it's ready to be merged. Also, I can't wait to leverage this from polkadot-api and it seems that smoldot has already made a lot of progress.

As @tomaka already explained: this API is mostly meant for light-clients and it lives in its own group of functions. Meaning that a simple query to rpc_methods would let the consumer know whether this API is supported. Perhaps, it could make sense to have a separate/complementary API for full-nodes/substrate?

@josepot josepot merged commit 607b71b into paritytech:main Jan 23, 2024
3 checks passed
@tomaka tomaka deleted the sudo_network_unstable_watch branch January 23, 2024 11:38
@tomaka
Copy link
Contributor Author

tomaka commented Jan 25, 2024

I've encountered an issue while implementing this in smoldot due to a corner case.

Let's say that you start an instance of smoldot and add the Polkadot and Kusama chains, and add the same IP address and PeerId as bootnode at the same time to both Polkadot and Kusama.

When I wrote this RFC, smoldot would open networking connections for Polkadot and networking connections for Kusama. That's what is reflected in this RFC. In that example scenario, smoldot would open two connections to the same IP address and PeerId, and thus would report through the JSON-RPC function only events that concern the connection of the chain corresponding to that JSON-RPC subscription.

However, smoldot now supports sharing connections. In that example scenario, smoldot would only open one connection and use the same connection for both Polkadot and Kusama.
So let's imagine that smoldot connects to that IP address, reports a JSON-RPC event about the connection being open to the active subscriptions, but then realizes that the node is only capable of serving Polkadot and not Kusama.
Previously, smoldot would close the Kusama connection and keep the Polkadot one, but now smoldot would keep the connection alive anyway and simply ignore that node when it comes to Kusama.

This leads to a tricky question: should smoldot, in that case, generate an event to the Kusama JSON-RPC subscription that the connection has been closed, even though it hasn't actually been closed? Doing so would be the logical thing to do, but would be misleading. The objective of this JSON-RPC function is to show to the user what smoldot is doing, and reporting events that don't match the reality (even when logically coherent) goes against this objective.

An alternative would be to report every single connection to every subscription, no matter if this connection is going to be used for the chain the subscription corresponds to. This is also misleading and spammy.

I don't really know right now how to solve that problem.

@josepot
Copy link
Contributor

josepot commented Jan 25, 2024

This leads to a tricky question: should smoldot, in that case, generate an event to the Kusama JSON-RPC subscription that the connection has been closed, even though it hasn't actually been closed?

I'm inclined to think that smoldot should generate an event for the Kusama JSON-RPC subscription indicating the connection is closed, focusing on the "logical connection".

From the API consumer's perspective, the connection is effectively "closed" once Smoldot ceases communication with the Kusama node, regardless of the physical connection status. This approach would deem the persistence of the physical connection as an implementation detail.

Given the API is primary intended to be implemented by light-clients, it would be beneficial to update the specification to clearly differentiate between "logical" and "physical" connections, emphasizing that the API's notifications pertain to the "logical connection".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants