Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content that is available in the network "not found" depending on network key / node id #1596

Open
kdeme opened this issue Dec 3, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@kdeme
Copy link
Contributor

kdeme commented Dec 3, 2024

This is an issue that I first encountered when running a Fluffy node, but the same is reproducible with a Trin node.

The issue:

For some content key(s) there are deterministic failures when doing a RecursiveFindContent from a node with a specific NodeId / network key. This traces back to the node(s) that store the content failing in setting up a uTP connection to retrieve the data.
The node that is looking for the content does end up requesting it to the node that has the content.

When then directly targeting the node that has the content with a FindContent request, this will fail.
Changing network key and running the same call will most likely work (unless perhaps you hit by chance another NodeId or NodeId range that does not work? I'm not sure about this exactly as I don't understand what causes this in the first place).

How to reproduce the issue

Here is a way to reproduce this for one specific content key / content id combo.

  1. Build Trin: cargo build
  2. Run trin with specific network key: cargo run -- --web3-transport http --unsafe-private-key "0x24b336eb9522dbe3afa8f31a2ec277333538ff211480d32633c45c29477bebe8"
  3. Request specific block with FindContent method: curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":"1","method":"portal_historyFindContent","params":["enr:-LS4QJi4_7PnH3BsKPej9J8--A1AWczIYojJIU_NWdWCIgFbflGJ4f4x35InkIMyQuBdz_XCaNx44V2X1O-dw_zprFKEZyImyGOqdCA2ZmQ0NzRmYWQ2MjNhNWY2MzVmOTFiOWVmZGE2MjI3NmJiYjE2MmZkgmlkgnY0gmlwhI_0qIWJc2VjcDI1NmsxoQIIu0viHReNIKnjhuSCXDMcC6TLKEDF5oMmqogiPISN4oN1ZHCCIzE", "0x018a24f51c42f5c1e216351c6c2ab29d2ae25fc4f366ea690a4e13c640844412e7"]}' http://localhost:8545 | jq

For Trin this will result in this, not so great, error message:

    "message": "Failed to parse error message from history subnetwork: Error(\"expected value\", line: 1, column: 1)"

But if you look at the Trin node logs, there is also this error:

2024-12-03T19:48:43.965343Z ERROR utp_rs::socket: failed to open connection with ConnectionId { send: 8865, recv: 8864, peer: UtpEnr { enr: Enr { id: Some("v4"), seq: 1730291400, NodeId: 0x3514da5e6fae802b62dbd381813a4fdd24d208f78506e10e7d94eaac0045354f, signature: "98b8ffb3e71f706c28f7a3f49f3ef80d4059ccc86288c9214fcd59d58222015b7e5189e1fe31df922790833242e05dcff5c268dc78e15d97d4ef9dc3fce9ac52", IpV4 UDP Socket: Some(143.244.168.133:9009), IpV6 UDP Socket: None, IpV4 TCP Socket: None, IpV6 TCP Socket: None, Other Pairs: [("c", "aa742036666434373466616436323361356636333566393162396566646136323237366262623136326664"), ("secp256k1", "a10208bb4be21d178d20a9e386e4825c331c0ba4cb2840c5e68326aa88223c848de2")] }, Peer Client Type: "t 6fd474fad623a5f635f91b9efda62276bbb162fd" } }

  1. Stop the Trin node and run it again with another key: cargo run -- --web3-transport http
  2. Do the same request: curl -s -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":"1","method":"portal_historyFindContent","params":["enr:-LS4QJi4_7PnH3BsKPej9J8--A1AWczIYojJIU_NWdWCIgFbflGJ4f4x35InkIMyQuBdz_XCaNx44V2X1O-dw_zprFKEZyImyGOqdCA2ZmQ0NzRmYWQ2MjNhNWY2MzVmOTFiOWVmZGE2MjI3NmJiYjE2MmZkgmlkgnY0gmlwhI_0qIWJc2VjcDI1NmsxoQIIu0viHReNIKnjhuSCXDMcC6TLKEDF5oMmqogiPISN4oN1ZHCCIzE", "0x018a24f51c42f5c1e216351c6c2ab29d2ae25fc4f366ea690a4e13c640844412e7"]}' http://localhost:8545 | jq

It will properly download and show the block body.

As mentioned, the issue seems to be rather on the other end that has the content. The reason I'm creating an issue here on the Trin repo is because so far the nodes that I've seen fail on this were Trin nodes. And on the requesting side it is reproducible with both Trin and Fluffy nodes.

fyi: I used tag: v0.1.0 to build Trin from

Linking to the fluffy issue that I created here first: status-im/nimbus-eth1#2901

@kdeme
Copy link
Contributor Author

kdeme commented Dec 3, 2024

Here is another one:

  • content key: 0x01c90aa1c74e87bcd507614914482ab9cf0c397a5f0732796b5df0b47f0cbe2f33
  • priv netkey: 0xfa1c8e77ec6bee3a56d50d3bca5591a2b9c7336676cb3d69ad583946e1e4df82

ENRs that fail but have the content:
enr:-LS4QBtfNW6Fp3g4RBNS6xZMADUX_Nf2G0CMXYN_7RR1Me2ZBjiYmXoMm9ubuDv84G2LjamGr8gZG3M3nESWn8mvjLqEZyIm9mOqdCA2ZmQ0NzRmYWQ2MjNhNWY2MzVmOTFiOWVmZGE2MjI3NmJiYjE2MmZkgmlkgnY0gmlwhES3YZSJc2VjcDI1NmsxoQKOZAvGZpYdJubfPmvLmD1UCTingT9LgT9UzQjsmVnYmoN1ZHCCIzE

enr:-LS4QCFhklF9rvvRaA_5n2PT79XA3rCsCOBIHxve00G2FyO_P4BmNnnLbUVNN54zfB7k4tor1MpVZ2vwl2yah_HYXM2EZyInqWOqdCA2ZmQ0NzRmYWQ2MjNhNWY2MzVmOTFiOWVmZGE2MjI3NmJiYjE2MmZkgmlkgnY0gmlwhJgq0dSJc2VjcDI1NmsxoQKVsXnP5bf1IFLWjJ0B2IdvMh8UDTrIBxyW-DUutgTwAYN1ZHCCIzE

@morph-dev
Copy link
Collaborator

Initially, I wasn't able to reproduce the issue, but now I am. And I think this is important fact for this issue.

After some debugging, I think I know what the problem is.

The receiving client has his kbuckets full, and can't store requesting Enr in them. While processing the request, later at some point we only have NodeId and we try to find Enr. We look for it at various places (first one being kbuckets).

From what I observed, it's possible that we find stale Enr (e.g. wrong ip_address or port) and we try to send data to it.

We had similar problems in the past and it seems that they are not fully fixed (one related pr: #935 ).

I'm going to create PR that adds extra debug logs to confirm my suspicion.

@morph-dev morph-dev self-assigned this Dec 7, 2024
@morph-dev morph-dev added the bug Something isn't working label Dec 7, 2024
@kdeme
Copy link
Contributor Author

kdeme commented Dec 12, 2024

I don't really understand why you need the ENR.

Is that some specific discv5 module or utp-over-discv5 module interface thing? Because in theory you don't really need that ENR to be able to respond,. As long as you have the NodeId and the IP address + port, it should be possible?

@morph-dev
Copy link
Collaborator

I believe that we get IP address+port from Enr.

@kdeme
Copy link
Contributor Author

kdeme commented Dec 16, 2024

I believe that we get IP address+port from Enr.

Yes, but those you will have from the request and thus in theory can pass along for the response & uTP setup. So I assume it is a module specific interface design that makes you require this.

@kdeme
Copy link
Contributor Author

kdeme commented Dec 18, 2024

@morph-dev pointed out that the issue seems to be occuring when for the same NodeId there are multiple ENRs lingering around, some of which stale and with old IP/Port information.

After looking back at my original tests I did notice the following:

  • There was a local network setup issue with the Nimbus EL with Portal node integrated (doing full sync) which caused the Portal node not to be reachable (behind NAT and no port forwarded / external IP available) and thus not putting an IP/Port in its ENR.
  • When rerunning to test same blocks with a standalone Fluffy node, I would however have my network setup in order and thus an IP in the ENR & reachable.

So this definitely looks like a plausible case for why it would fail. As with the current design in Trin, where it needs the ENR data instead of re-using the IP/Port from the original requests, this would occasionally indeed hit this issue. Note that this is a valid case and would/will occasionally cause NAT'ed peers without proper network setup to hit this.

I did some new tests:

  • I made sure that the Nimbus EL with Portal node integrated test had its network setup in order and was reachable (= IP address in ENR). I was no longer hitting blocks that could not be found only by this NodeId. (side note: I am still hitting blocks not found, but they cannot be found no matter which NodeId and I basically have been re-injecting those blocks into the network).
  • I reran it again but forcing the ENR to not have an IP/port and I hit the originally issue fairly quickly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants