Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track replication lag in otel metrics and log warning when it gets too high #2031

Closed
KyleAMathews opened this issue Nov 22, 2024 · 3 comments · Fixed by #2043
Closed

Track replication lag in otel metrics and log warning when it gets too high #2031

KyleAMathews opened this issue Nov 22, 2024 · 3 comments · Fixed by #2043

Comments

@KyleAMathews
Copy link
Contributor

We really need visibility into this.

@balegas balegas added this to the Production Readiness milestone Nov 22, 2024
@alco
Copy link
Member

alco commented Nov 25, 2024

@KyleAMathews

@kevin-dp and I have been discussing the best way to measure the "replication lag". There are multiple options:

  • if we're interested in the amount of WAL that's kept around between the replication slot's starting LSN and the latest LSN in Postgres, we can track that by periodically querying Postgres for its current LSN and comparing that to the slot's LSN
  • alternatively, we could measure the difference between the commit time of the transaction that's being processed and the time it gets written to the shape log or served to a client. But here we'd have to also take into account the clock difference between Postgres and the Electric instance

Regarding the warning, can that be setup in Honeycomb? Or do we want a configuration option for Electric to specify when it needs to log. I imagine the threshold will be different for different clients based on their database write rate and other parameters.

@balegas
Copy link
Contributor

balegas commented Nov 25, 2024

Why not do both? I think it is useful to know the amount of bytes that are pending in the WAL and what is the latency for electric to write a transaction into the log

@KyleAMathews
Copy link
Contributor Author

Yeah both sound great. I didn't know we could get the actual time diff so didn't mention it but that'd be great to have as well.

Yeah warnings from your otel collector is a lot more flexible — people can already set warnings from postgres data directly. We can potentially add warnings directly in the Electric logs as well but a good sequencing of work is to first gather the data and then later decide exactly how to communicate it.

kevin-dp added a commit that referenced this issue Nov 26, 2024
Fixes #2031.

- exports the replication lag in bytes as a metric to Prometheus
- also creates a span including the replication lag in milliseconds for
every transaction

### Note on clock drift

The replication lag in milliseconds may be affected by clock drift
between Electric and Postgres. This may occur because Electric and
Postgres may be running on different machines and we compare the
transaction's commit timestamp (generated by PG) to Electric's timestamp
at the time of writing the transaction to the shape log.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants