-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State resets are hard to recover from #412
Comments
This has been encountered in the wild by |
This can cause UTDs if it happens in an E2EE room. |
This is ultimately a synapse bug, but the proxy may be able to mitigate against the worst of this. There's a few heuristics that can be applied:
For potentially state reset rooms:
This means there's two tasks:
I've mentioned a detection threshold of 24h. If this is too high then we won't catch all state resets. If this is too low then we'll catch clock skewed HSes and cause additional traffic on the HS to re-query state. We should begin tracking every state update that breaks temporal causality (that is, the update has a lower timestamp than the state being replaced), so we can monitor what value would be appropriate. We need to protect against race conditions on the Finally, we need to ensure that a persistently clock skewed HS (malicious or not) cannot cause a DoS on the proxy. This may involve dumping updates for that room into a temporary holding area. A true state reset will be sent to all pollers, meaning we MUST de-duplicate the work. |
This issue is using the term "state reset" to refer to the situation where Synapse recalculates room state incorrectly and sends very old room state down
/sync
.The proxy is a Matrix client like any other client (e.g Element-Web). This means it is vulnerable to state resets like any other client. Unlike clients however (which can just "clear cache and reload") there is no mechanism to recover from a state resetted room.
This can manifest as rooms spontaneously appearing/disappearing, based on historical state. This is made worse because you can't just do an initial sync on the affected clients and have it self-heal because this is the typical failure mode:
A native implementation would not have this problem because it does not rely on /sync v2's state calculations.
The text was updated successfully, but these errors were encountered: