-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat (sync-service): Prevent shape consumer errors from affecting other shapes #2009
Conversation
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
0539459
to
1d1b5e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, and good tests, but I did have 2 concerns
end | ||
|
||
defp is_error?(reason) do | ||
reason not in [:normal, :shutdown, {:shutdown, :truncate}, :killed] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is {:shutdown, :truncate}
special-cased here? I think we should just handler {:shutdown, _}
as "normal" exit code as per OTP convention, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Done.
reply_to_snapshot_waiters({:error, "Shape terminated before snapshot was ready"}, state) | ||
|
||
if is_error?(reason) do | ||
cleanup(state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, is there a world where this call is slow? Calling this inside the terminate
callback implies a time limit on this operation before the process is hard-killed, and that's if it's not hard-killed in the first place.
Elixir docs say:
terminate/2 is useful for cleanup that requires access to the GenServer's state. However, it is not guaranteed that terminate/2 is called when a GenServer exits. Therefore, important cleanup should be done using process links and/or monitors. A monitoring process will receive the same exit reason that would be passed to terminate/2.
so I guess my question is: is this good enough, or should we have a monitoring process that actually does the cleanup to guarantee it being done? With the outlook that it's possible that our storage may not be on local disk at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. I'm wondering if this is good enough for now, because:
- The first and most important part of cleanup is deleting the shape_status which should be quick. With this deleted the other shapes can carry on happily.
- The second part of cleanup is deleting the shape from storage, which may be slow, but if this doesn't happen it'll just use more disk space than it needs to, which seems fine especially as...
- Cleanup on terminate will only occur if we have a bug
bae5608
to
49bec3a
Compare
…#2019) PR by @icehaunter and me - makes the `StackSupervisor` accept a stack event registry that it uses to dispatch status events about the state of the stack. This was preliminary work for multitenancy, and also fixes #1922 since now we hold connections when the stack is not ready, and release them when we receive a "ready event" or time them out with a 503 - avoids crashing the ETS inspector which was trying to use a DB connection from an uninitialised pool. Integration test is broken from #2009
defp selector(%Transaction{changes: changes}, shape) do | ||
defp selector(event, shape) do | ||
process_event?(event, shape) | ||
rescue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's almost always a bad idea to swallow errors like this.
If we know that kinds of errors that may be raised, we should match on them explicitly.
If we don't know what can be raised, we should either log the error explicitly before black-holing it or (when possible) let the process die naturally and let Elixir's builtin logging machinery kick in.
# Return `true` so the event is processed, which will then error | ||
# for the same reason and cleanup the shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on how this part happens? Does "error for the same reason" assume that similar code is executed when the event is processed by the consumer as in this process_event?()
function? If that's the case, can we make it a single private function that is called in both places? Or at least explain it in a code comment to avoid the confusion I've just found myself in?
Fixes #1925 . Errors that occur while consuming the replication stream that are to do with a specific shape, cause that shape's consumer to remove the shape and shut down, leaving the other shapes unaffected.
Errors can occur:
This PR addresses both types of error. Previous to this PR, these errors would cause the sync service to get into an infinite crash loop and would stop responding to HTTP requests.