Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (sync-service): Prevent shape consumer errors from affecting other shapes #2009

Merged
merged 11 commits into from
Nov 21, 2024

Conversation

robacourt
Copy link
Contributor

@robacourt robacourt commented Nov 20, 2024

Fixes #1925 . Errors that occur while consuming the replication stream that are to do with a specific shape, cause that shape's consumer to remove the shape and shut down, leaving the other shapes unaffected.

Errors can occur:

  1. In the selector which happens on a process common to all shapes.
  2. On the consumer process.

This PR addresses both types of error. Previous to this PR, these errors would cause the sync service to get into an infinite crash loop and would stop responding to HTTP requests.

Copy link

netlify bot commented Nov 20, 2024

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 49bec3a
🔍 Latest deploy log https://app.netlify.com/sites/electric-next/deploys/673f0d937ea7790008c41914
😎 Deploy Preview https://deploy-preview-2009--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@robacourt robacourt force-pushed the rob/shape-cleanup-on-error branch from 0539459 to 1d1b5e0 Compare November 20, 2024 16:59
Copy link
Contributor

@icehaunter icehaunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, and good tests, but I did have 2 concerns

end

defp is_error?(reason) do
reason not in [:normal, :shutdown, {:shutdown, :truncate}, :killed]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is {:shutdown, :truncate} special-cased here? I think we should just handler {:shutdown, _} as "normal" exit code as per OTP convention, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Done.

reply_to_snapshot_waiters({:error, "Shape terminated before snapshot was ready"}, state)

if is_error?(reason) do
cleanup(state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, is there a world where this call is slow? Calling this inside the terminate callback implies a time limit on this operation before the process is hard-killed, and that's if it's not hard-killed in the first place.

Elixir docs say:

terminate/2 is useful for cleanup that requires access to the GenServer's state. However, it is not guaranteed that terminate/2 is called when a GenServer exits. Therefore, important cleanup should be done using process links and/or monitors. A monitoring process will receive the same exit reason that would be passed to terminate/2.

so I guess my question is: is this good enough, or should we have a monitoring process that actually does the cleanup to guarantee it being done? With the outlook that it's possible that our storage may not be on local disk at some point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. I'm wondering if this is good enough for now, because:

  1. The first and most important part of cleanup is deleting the shape_status which should be quick. With this deleted the other shapes can carry on happily.
  2. The second part of cleanup is deleting the shape from storage, which may be slow, but if this doesn't happen it'll just use more disk space than it needs to, which seems fine especially as...
  3. Cleanup on terminate will only occur if we have a bug

@robacourt robacourt force-pushed the rob/shape-cleanup-on-error branch from bae5608 to 49bec3a Compare November 21, 2024 10:38
@robacourt robacourt merged commit 598aa28 into main Nov 21, 2024
25 of 26 checks passed
@robacourt robacourt deleted the rob/shape-cleanup-on-error branch November 21, 2024 10:56
msfstef added a commit that referenced this pull request Nov 21, 2024
…#2019)

PR by @icehaunter and me - makes the `StackSupervisor` accept a stack
event registry that it uses to dispatch status events about the state of
the stack.

This was preliminary work for multitenancy, and also fixes
#1922 since now we hold
connections when the stack is not ready, and release them when we
receive a "ready event" or time them out with a 503 - avoids crashing
the ETS inspector which was trying to use a DB connection from an
uninitialised pool.


Integration test is broken from
#2009
defp selector(%Transaction{changes: changes}, shape) do
defp selector(event, shape) do
process_event?(event, shape)
rescue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost always a bad idea to swallow errors like this.

If we know that kinds of errors that may be raised, we should match on them explicitly.

If we don't know what can be raised, we should either log the error explicitly before black-holing it or (when possible) let the process die naturally and let Elixir's builtin logging machinery kick in.

Comment on lines +98 to +99
# Return `true` so the event is processed, which will then error
# for the same reason and cleanup the shape.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on how this part happens? Does "error for the same reason" assume that similar code is executed when the event is processed by the consumer as in this process_event?() function? If that's the case, can we make it a single private function that is called in both places? Or at least explain it in a code comment to avoid the confusion I've just found myself in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove a shape if anything goes wrong with it
3 participants