feat (sync-service): Prevent shape consumer errors from affecting other shapes #2009

robacourt · 2024-11-20T14:50:40Z

Fixes #1925 . Errors that occur while consuming the replication stream that are to do with a specific shape, cause that shape's consumer to remove the shape and shut down, leaving the other shapes unaffected.

Errors can occur:

In the selector which happens on a process common to all shapes.
On the consumer process.

This PR addresses both types of error. Previous to this PR, these errors would cause the sync service to get into an infinite crash loop and would stop responding to HTTP requests.

netlify · 2024-11-20T14:51:45Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`49bec3a`
🔍 Latest deploy log	https://app.netlify.com/sites/electric-next/deploys/673f0d937ea7790008c41914
😎 Deploy Preview	https://deploy-preview-2009--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

icehaunter

Good work, and good tests, but I did have 2 concerns

icehaunter · 2024-11-21T08:06:40Z

packages/sync-service/lib/electric/shapes/consumer.ex

+  end
+
+  defp is_error?(reason) do
+    reason not in [:normal, :shutdown, {:shutdown, :truncate}, :killed]


why is {:shutdown, :truncate} special-cased here? I think we should just handler {:shutdown, _} as "normal" exit code as per OTP convention, no?

Good idea. Done.

icehaunter · 2024-11-21T08:15:16Z

packages/sync-service/lib/electric/shapes/consumer.ex

+      reply_to_snapshot_waiters({:error, "Shape terminated before snapshot was ready"}, state)
+
+    if is_error?(reason) do
+      cleanup(state)


I wonder, is there a world where this call is slow? Calling this inside the terminate callback implies a time limit on this operation before the process is hard-killed, and that's if it's not hard-killed in the first place.

Elixir docs say:

terminate/2 is useful for cleanup that requires access to the GenServer's state. However, it is not guaranteed that terminate/2 is called when a GenServer exits. Therefore, important cleanup should be done using process links and/or monitors. A monitoring process will receive the same exit reason that would be passed to terminate/2.

so I guess my question is: is this good enough, or should we have a monitoring process that actually does the cleanup to guarantee it being done? With the outlook that it's possible that our storage may not be on local disk at some point

It's a good point. I'm wondering if this is good enough for now, because:

The first and most important part of cleanup is deleting the shape_status which should be quick. With this deleted the other shapes can carry on happily.

The second part of cleanup is deleting the shape from storage, which may be slow, but if this doesn't happen it'll just use more disk space than it needs to, which seems fine especially as...

Cleanup on terminate will only occur if we have a bug

…cted shape

@icehaunter

…#2019) PR by @icehaunter and me - makes the `StackSupervisor` accept a stack event registry that it uses to dispatch status events about the state of the stack. This was preliminary work for multitenancy, and also fixes #1922 since now we hold connections when the stack is not ready, and release them when we receive a "ready event" or time them out with a 503 - avoids crashing the ETS inspector which was trying to use a DB connection from an uninitialised pool. Integration test is broken from #2009

alco · 2024-11-29T12:50:37Z

packages/sync-service/lib/electric/shapes/consumer.ex

-  defp selector(%Transaction{changes: changes}, shape) do
+  defp selector(event, shape) do
+    process_event?(event, shape)
+  rescue


It's almost always a bad idea to swallow errors like this.

If we know that kinds of errors that may be raised, we should match on them explicitly.

If we don't know what can be raised, we should either log the error explicitly before black-holing it or (when possible) let the process die naturally and let Elixir's builtin logging machinery kick in.

alco · 2024-11-29T12:59:07Z

packages/sync-service/lib/electric/shapes/consumer.ex

+    # Return `true` so the event is processed, which will then error
+    # for the same reason and cleanup the shape.


Could you elaborate on how this part happens? Does "error for the same reason" assume that similar code is executed when the event is processed by the consumer as in this process_event?() function? If that's the case, can we make it a single private function that is called in both places? Or at least explain it in a code comment to avoid the confusion I've just found myself in?

robacourt force-pushed the rob/shape-cleanup-on-error branch from 0539459 to 1d1b5e0 Compare November 20, 2024 16:59

icehaunter approved these changes Nov 21, 2024

View reviewed changes

robacourt added 11 commits November 21, 2024 10:34

Error in handle_events will clean up shape

006ed70

Simplify test

86e9a97

Remove unnecessary assignment

727a439

Swallow selector errors

2bb7b2d

Refactor to simplify do_handle_events to handle_event

91ac52d

Make consumer crashing stops the affected consumer and clean the affe…

60a65ac

…cted shape

Remove unnecessary try rescue block

11ae029

Add changeset

d3a4762

Revert unnecessary change

6a7a519

Add test to make sure normal shutdown doesn't clear shape

615a2ca

Make all {:shutdown, _} reasons mean non-error

49bec3a

robacourt force-pushed the rob/shape-cleanup-on-error branch from bae5608 to 49bec3a Compare November 21, 2024 10:38

robacourt merged commit 598aa28 into main Nov 21, 2024
25 of 26 checks passed

robacourt deleted the rob/shape-cleanup-on-error branch November 21, 2024 10:56

msfstef mentioned this pull request Nov 21, 2024

feat: Add global stack event registry and block requests before ready #2019

Merged

This was referenced Nov 24, 2024

Shapes with malformed WHERE clauses should be removed #1926

Closed

electric crashing/restarting - possibly due to consumer with an array where filter #1911

Closed

alco reviewed Nov 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat (sync-service): Prevent shape consumer errors from affecting other shapes #2009

feat (sync-service): Prevent shape consumer errors from affecting other shapes #2009

robacourt commented Nov 20, 2024 •

edited

Loading

netlify bot commented Nov 20, 2024 •

edited

Loading

icehaunter left a comment

icehaunter Nov 21, 2024

robacourt Nov 21, 2024

icehaunter Nov 21, 2024

robacourt Nov 21, 2024

alco Nov 29, 2024

alco Nov 29, 2024

		# Return `true` so the event is processed, which will then error
		# for the same reason and cleanup the shape.

feat (sync-service): Prevent shape consumer errors from affecting other shapes #2009

feat (sync-service): Prevent shape consumer errors from affecting other shapes #2009

Conversation

robacourt commented Nov 20, 2024 • edited Loading

netlify bot commented Nov 20, 2024 • edited Loading

✅ Deploy Preview for electric-next ready!

icehaunter left a comment

Choose a reason for hiding this comment

icehaunter Nov 21, 2024

Choose a reason for hiding this comment

robacourt Nov 21, 2024

Choose a reason for hiding this comment

icehaunter Nov 21, 2024

Choose a reason for hiding this comment

robacourt Nov 21, 2024

Choose a reason for hiding this comment

alco Nov 29, 2024

Choose a reason for hiding this comment

alco Nov 29, 2024

Choose a reason for hiding this comment

robacourt commented Nov 20, 2024 •

edited

Loading

netlify bot commented Nov 20, 2024 •

edited

Loading