You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently have two layers of goroutines hiding under filecoin.Retriever#Request(cid.Cid) that's called for each CID coming in from Bitswap that we don't have.
If we have candidates from the indexer for that CID, we:
Spawn a goroutine for filecoin.Retriever#retrieveFromBestCandidate()
Within that execution we call filecoin.Retriever#queryCandidates() which in turns spawns an additional goroutine for each candidate so we can call Filclient#RetrievalQueryToPeer() for it.
After we've finished the query phase, those query goroutines are ended and if we have one or more viable query responses we continue in the parent goroutine, calling Filclient#RetrieveContentFromPeerWithProgressCallback() in sequence for each candidate.
Among other things, these calls come back to Retriever#OnRetrievalEvent(). If it blocks because the EventManager is blocked, then that can hold up further Filclient retrieval events, which will block the parent Filclient calls, which in turn block our goroutines.
Our new async EventRecorder has a queuedReports that it slurps in from Retriever and will buffer in a slice. So if that's working well then the OnRetrievalEvent calls shouldn't block. But that just means we don't have any practical form of backpressure that applies through the system now.
As the number of CIDs with a positive result from the indexer increases, an Autoretrieve instance is going to have an equivalent increase in retrieval attempts—even if it's just failed query attempts, that's still costly. We're already hitting that with our instances with the added event reporting load (Ref: #112) but it'll get worse.
Ideal
Bitswap requests for CIDs should be rejected if we are currently "at capacity". Capacity can be defined as either actual maximum capacity of the instance (dictated by network, memory, and other resources), or by the owner in configuration options.
EventRecorder should block when it's at some pre-defined capacity which can be configured, the default could be ~100 or more (assuming we're able to batch-report, otherwise it should probably be lower).
Implement a retrieval event execution goroutine pool: the executors can either execute queries or full retrievals. Queries or retrievals are "actions" that can be queued, the executors pick off queued actions in order and runs them.
Action queue depth can be configured, the default could be in the order of ~100 or more.
Number of executors can be configured, the default could be in the order of ~20.
Calls to filecoin.Retriever#Request(cid.Cid) where the action queue is full should return with an appropriate error. Adding retrieval actions to the queue should not block or return an error - i.e. the queue could grow beyond full, but retrieval actions are a normal follow-up to query actions so they should be allowed through. The queue max enforcement should only be applied to entirely new retrievals.
The Bitswap provider should not continue to wait for the block if the call to Request() errored on a full queue (this is the current actions - if error from Request() return a dont_have and maybe log).
The text was updated successfully, but these errors were encountered:
Summary of current state
We currently have two layers of goroutines hiding under
filecoin.Retriever#Request(cid.Cid)
that's called for each CID coming in from Bitswap that we don't have.If we have candidates from the indexer for that CID, we:
filecoin.Retriever#retrieveFromBestCandidate()
filecoin.Retriever#queryCandidates()
which in turns spawns an additional goroutine for each candidate so we can callFilclient#RetrievalQueryToPeer()
for it.Filclient#RetrieveContentFromPeerWithProgressCallback()
in sequence for each candidate.Among other things, these calls come back to
Retriever#OnRetrievalEvent()
. If it blocks because theEventManager
is blocked, then that can hold up further Filclient retrieval events, which will block the parent Filclient calls, which in turn block our goroutines.Our new async
EventRecorder
has aqueuedReports
that it slurps in fromRetriever
and will buffer in a slice. So if that's working well then theOnRetrievalEvent
calls shouldn't block. But that just means we don't have any practical form of backpressure that applies through the system now.As the number of CIDs with a positive result from the indexer increases, an Autoretrieve instance is going to have an equivalent increase in retrieval attempts—even if it's just failed query attempts, that's still costly. We're already hitting that with our instances with the added event reporting load (Ref: #112) but it'll get worse.
Ideal
Bitswap requests for CIDs should be rejected if we are currently "at capacity". Capacity can be defined as either actual maximum capacity of the instance (dictated by network, memory, and other resources), or by the owner in configuration options.
EventRecorder
should block when it's at some pre-defined capacity which can be configured, the default could be ~100 or more (assuming we're able to batch-report, otherwise it should probably be lower).filecoin.Retriever#Request(cid.Cid)
where the action queue is full should return with an appropriate error. Adding retrieval actions to the queue should not block or return an error - i.e. the queue could grow beyond full, but retrieval actions are a normal follow-up to query actions so they should be allowed through. The queue max enforcement should only be applied to entirely new retrievals.Request()
errored on a full queue (this is the current actions - if error fromRequest()
return a dont_have and maybe log).The text was updated successfully, but these errors were encountered: