Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue the RecursiveFindContent query until the uTP transfer is complete and the content is validated #1395

Merged
merged 1 commit into from
Sep 20, 2024

Conversation

carver
Copy link
Collaborator

@carver carver commented Aug 24, 2024

What was wrong?

Work toward #1383

How was it fixed?

This PR will delay completion of RFC until data is validated, but will not continue the search with other peers. I decided it was getting too big. I will do the follow-up work in another PR.

There are quite a few moving parts here, so the goal was to minimize the
number of functional changes in this refactor.

Overall: add a new query state for while the content is being validated.
Don't exit the query until validation is complete.

Some other changes along the way:

Moving the bigendian conversion of connection ID

Note that when handling the FindContentQueryResponse::ConnectionId in
findcontent.rs, we don't do the bigendian conversion anymore. That has
been moved to earlier in the process, when handling the Content message,
to reduce the number of places that the conversion needs to be handled.

The query trace has slightly different end-stage behavior

The received_from peer and the cancelled peers are not set in the query
until the content gets validated.

Naturally, tests need to reflect the new query structure

For example:

  • the query trace test needs to have the content be marked validated to
    progress
  • update overlay service tests to use new Validating query step
  • immediately mark content as verified, after marking it as received,
    during the test

bugfix: the content search continues even if many peers don't have it

Formerly, the content query would exit out if too many peers replied
successfully (even if the responses were just more nodes to look up).

Best we can tell, the limit on the number of maximum results is
copy-pasta from the recursive-find-nodes logic. (Confirmation from
Mike)

Now, the query continues until the content is found. It doesn't track
the number of successful responses at all.

(Perhaps worth noting that tests fail without this last change. I didn't look that hard into why they weren't failing before)

Refactor

At first, I was trying to change as little as possible. That meant that I kind of shoe-horned the pending result into the Query::Result enum. It works, but is ugly, because of having to handle the pending results in some logic, and the final results in other logic.

So e0e5437 goes a little further and adds a new Pending result enum. It still leaves us with some assertions that certain code paths should be unreachable, but I don't see any better alternatives. I explored another Query trait type, but it required duplicating too much of query_event_poll logic. So I landed on this approach, largely implemented by @morph-dev

To-Do

  • trin compiles locally
  • Manual gut-check that RFC still works on the happy path (when this PR was first opened, it did not yet)
    • always delay Finished until validation succeeds (previously it was exploring all nodes, then exiting immediately)
    • Fix uTP transfer. I seem to have broken it, because it never connects successfully now. Last I checked, it was still working fine on master
  • Constrain active number of downloads/validations to 1
    • hacky version that only allows one concurrent
    • stop querying peers for content when there are active validations
    • move the follow-ups to the tracking issue
  • Tests
    • ethportal-api
    • overlay service
    • findcontent
    • findnodes
    • peertest
  • The FindContent queries shouldn't have a maximum number of results, it should just run till completion. In findcontent.rs, see if *count >= self.config.num_results {
  • Rebase on Credit first peer to return content #1447
    • after 1447 tests pass
    • after merge
  • After addressing PR feedback, clean up commit history and use conventional commits.

@carver carver changed the title {WIP] Continue the RecursiveFindContent query until the uTP transfer is complete and the content is validated [WIP] Continue the RecursiveFindContent query until the uTP transfer is complete and the content is validated Aug 24, 2024
@carver carver self-assigned this Aug 24, 2024
@carver carver force-pushed the rfc-run-till-validation branch 2 times, most recently from b3bf363 to 03ebb02 Compare September 9, 2024 03:53
@carver carver mentioned this pull request Sep 13, 2024
1 task
@carver carver force-pushed the rfc-run-till-validation branch 3 times, most recently from e618e5c to 3ea0dbf Compare September 14, 2024 02:17
@carver carver changed the title [WIP] Continue the RecursiveFindContent query until the uTP transfer is complete and the content is validated Continue the RecursiveFindContent query until the uTP transfer is complete and the content is validated Sep 14, 2024
@carver carver marked this pull request as ready for review September 14, 2024 02:28
@carver carver force-pushed the rfc-run-till-validation branch 3 times, most recently from 876de63 to 74fb7ae Compare September 19, 2024 01:33
Copy link
Collaborator

@mrferris mrferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, I see the vision. Good find on the unnecessary num_results abort.

Tested on a local glados instance and saw virtually same results as with trin deployed to main glados instance.

There are quite a few moving parts here, so the goal was to minimize the
number of functional changes in this refactor.

Overall: add a new query state for while the content is being validated.
Don't exit the query until validation is complete.

Some other changes along the way:

* Moving the bigendian conversion of connection ID

Note that when handling the FindContentQueryResponse::ConnectionId in
findcontent.rs, we don't do the bigendian conversion anymore. That has
been moved to earlier in the process, when handling the Content message,
to reduce the number of places that the conversion needs to be handled.

* The query trace has slightly different end-stage behavior

The received_from peer and the cancelled peers are not set in the query
until the content gets validated.

* Naturally, tests need to reflect the new query structure

For example:
- the query trace test needs to have the content be marked validated to
  progress
- update overlay service tests to use new Validating query step
- immediately mark content as verified, after marking it as received,
  during the test

* bugfix: the content search continues even if many peers don't have it

Formerly, the content query would exit out if too many peers replied
successfully (even if the responses were just more nodes to look up).

Best we can tell, the limit on the number of maximum results is
copy-pasta from the recursive-find-nodes logic. (Confirmation from
Mike)

Now, the query continues until the content is found. It doesn't track
the number of successful responses at all.

Finally, this includes a big simplification of validation logic, by
Milos (morph-dev).
@carver carver force-pushed the rfc-run-till-validation branch from 74fb7ae to a45e272 Compare September 20, 2024 21:53
@carver carver merged commit a45e272 into ethereum:master Sep 20, 2024
9 checks passed
@carver carver deleted the rfc-run-till-validation branch September 20, 2024 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants