-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
electric crashing/restarting - possibly due to consumer with an array where filter #1911
Comments
I was helping @ericallam look at this. I found it odd too that all the consumers failed when it seems just one of them errored. This bug is stopping all shapes from updating. Even if there's a bug w/ where clause matching, ideally it'd just affect the one shape. |
There array where clause looks like this: WHERE "runTags" @> ARRAY['foobar'] |
Just tried to recreate this issue locally and I'm not able to. Looking back at the logs above, I wonder if the following line is the cause, or just a consequence of the issue:
|
If one consumer fails, we don't move forward with other consumers to ensure the transaction is applied across all shape logs. (Maybe) it not necessary to restart all consumers, but we don't allow the replication stream to advance. From logs, it seems that the consumers have shut down and trying to reestablish connection. It also look like they are not succeeding because database connection is gone. We'll try to simulating some crash scenarios to see if we can reproduce the erro. |
Why would we want that? I'm struggling to think of a scenario where this would be a good thing... If a shape consumer fails my assumption is there's a bug of some sort. If we have 99 good shapes and 1 bad shape, why not just kill the bad shape and let the others continue to update? The problem here is there's nothing Eric or any other Electric user can do to resolve an issue with a shape consumer. There's no alerts, there's no way to remediate it. Eric only heard about the problem because a customer complained. One bug broke every other unrelated user on the system. And then there was nothing he could do other than kill the server. How I imagine this would work is that a shape consumer failing would spit out a big warning with diagnostic info and then it would get removed. If the same shape definition is then recreated and then fails say 2 more times, then we'd permanently block it from being created again until the bug is fixed and the server has upgraded. We still don't know what happened here of course so I'm not saying the above is the fix for this exact issue — but in general, we want graceful degradation where we're keeping as many things going as fast as possible and not halt the world type approaches as those are far more disruptive. |
@KyleAMathews @balegas One other consideration is that from our perspective, pretty much the worst thing that the electric server could do is fail to keep up with the WAL and let the replication slot get behind, because eventually (in a few hours) our entire database will blow up because we run a sequence scan fairly frequently (yay queuing in postgres!) and sequence scan + xmin horizon lag + dead tuples not being vacuumed = database go 💥 |
Removing the problematic shapes is a good idea, that way we preserve integrity across the healthy shapes and have better chance of not failing in the round. Absolutely, Electric needs to be able to keep up with the load from Postgres and recover from errors without lagging for too long. Note that you can safeguard against unpredicted WAL growth by tuning Postgres WAL size. |
Closing this issue, as we now prevent issues from a single shape to propagate up the tree #2009 |
Our 0.7.5 electric server is crashing periodically, and I think it's related to a consumer with an array where clause, when that consumer encounters a change with the column being nil. Here are some logs to show what's happening:
The text was updated successfully, but these errors were encountered: