Expand metrics on pg_stat_replication to include lag expressed as time. #115

ahjmorton · 2022-04-28T08:15:08Z

Expands the metrics exposed from pg_stat_replication to include lag as reported from the wal sender perspective. Also adds a collector for pg_stat_wal_receiver for monitoring from the standby side.

The following examples are from a locally running streaming replica setup and calling postgres_exporter

Primary:

# HELP postgres_stat_replication_flush_lag_seconds flush_lag as reported by the pg_stat_replication view converted to seconds
# TYPE postgres_stat_replication_flush_lag_seconds gauge
postgres_stat_replication_flush_lag_seconds{application_name="walreceiver",client_addr="172.28.0.3",state="streaming",sync_state="sync"} 0.002844
# HELP postgres_stat_replication_lag_bytes delay in bytes pg_wal_lsn_diff(pg_current_wal_lsn(), replay_location)
# TYPE postgres_stat_replication_lag_bytes gauge
postgres_stat_replication_lag_bytes{application_name="walreceiver",client_addr="172.28.0.3",state="streaming",sync_state="sync"} 0
# HELP postgres_stat_replication_replay_lag_seconds replay_lag as reported by the pg_stat_replication view converted to seconds
# TYPE postgres_stat_replication_replay_lag_seconds gauge
postgres_stat_replication_replay_lag_seconds{application_name="walreceiver",client_addr="172.28.0.3",state="streaming",sync_state="sync"} 0.00317
8
# HELP postgres_stat_replication_write_lag_seconds write_lag as reported by the pg_stat_replication view converted to seconds
# TYPE postgres_stat_replication_write_lag_seconds gauge

Standby:

# HELP postgres_wal_receiver_replay_lag_bytes delay in standby wal replay bytes pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())::float
# TYPE postgres_wal_receiver_replay_lag_bytes gauge
postgres_wal_receiver_replay_lag_bytes{status="streaming"} 0
# HELP postgres_wal_receiver_replay_lag_seconds delay in standby wal replay seconds EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()
# TYPE postgres_wal_receiver_replay_lag_seconds gauge
postgres_wal_receiver_replay_lag_seconds{status="streaming"} 0

References

Offical documentation for version 10 of PG on pg_stat_replication

rnaveiras · 2022-04-28T09:38:34Z

hey @ahjmorton

Firstly, thank you so much for this. I will review the PR in detail in the next few days.

I think you can drop support for Postgres older than 10, as 9.6 reached the end of life. See https://www.postgresql.org/support/versioning/

That will make the changes a bit easier, I think

…iver` view

rnaveiras · 2022-04-28T17:25:11Z

collector/stat_replication.go

@@ -83,17 +89,24 @@ func (c *statReplicationScraper) Scrape(ctx context.Context, conn *pgx.Conn, ver
 	var applicationName, state, syncState string
 	var clientAddr net.IP
 	var pgXlogLocationDiff float64
+	/* When querying pg_stat_replication it may be that we don't have


@ahjmorton I didn't have time to go into details here, but could you share more details about these metrics disappearing?

Later you make a conditional have a value to report these metrics. We should avoid that, as metrics that disappear from Prometheus are complex to deal with

https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

@ahjmorton I didn't have time to go into details here, but could you share more details about these metrics disappearing?

If the WAL sender process didn't have an activity for some period of time (not sure how long, might need to look at source) then Postgres would null out those "*_lag" fields.

Later you make a conditional have a value to report these metrics. We should avoid that, as metrics that disappear from Prometheus are complex to deal with

I was trying to avoid reporting metrics for nodes that didn't have a WAL sender process however those would be eliminated by the query as they wouldn't have a row in pg_stat_replication.
Going to change this to default to zero in case of null.

ttamimi

I tested this branch on my local machine with a streaming replication setup, and I loaded it with tens of thousands of records of test data. I can confirm that the five new metrics introduced in this PR all appeared for me with non-zero values.

Metrics from the primary PostgreSQL instance

postgres_stat_replication_flush_lag_seconds{
  application_name="walreceiver",
  client_addr="<nil>",
  instance="localhost:9187",
  job="postgresql",
  state="streaming",
  sync_state="async"
}
postgres_stat_replication_replay_lag_seconds{
  application_name="walreceiver",
  client_addr="<nil>",
  instance="localhost:9187",
  job="postgresql",
  state="streaming",
  sync_state="async"
}
postgres_stat_replication_write_lag_seconds{
  application_name="walreceiver",
  client_addr="<nil>",
  instance="localhost:9187",
  job="postgresql",
  state="streaming",
  sync_state="async"
}

Metrics from the replica

postgres_wal_receiver_replay_lag_bytes{
  instance="localhost:9188",
  job="postgresql",
  status="streaming"
}
postgres_wal_receiver_replay_lag_seconds{
  instance="localhost:9188",
  job="postgresql",
  status="streaming"
}

ttamimi · 2022-06-14T12:27:53Z

collector/stat_wal_receiver.go

+/* When pg_basebackup is running in stream mode, it opens a second connection
+to the server and starts streaming the transaction log in parallel while
+running the backup. In both connections (state=backup and state=streaming) the
+pg_log_location_diff is null and it requires to be excluded */


Where is pg_log_location_diff referenced? We aren't using it anywhere in this code, are we? Maybe a third party library? It is definitely getting queried though. I tested the exporter with a streaming replication setup on my machine and I saw the following errors in the logs of the primary PostgreSQL instance:

2022-06-14 10:32:45.934 BST [94070] STATEMENT: select * from pg_log_location_diff; 2022-06-14 10:32:47.403 BST [94070] ERROR: relation "pg_log_location_diff" does not exist at character 15

Even Google has no clue what pg_log_location_diff is!!

ahjmorton · 2023-09-29T10:03:06Z

Hey @rnaveiras and @ttamimi . I'm happy to leave this PR open and work on getting the metrics in there. Reckon it's worth me taking a look? Realise it's been a long time

ahjmorton force-pushed the add-replication-lag-time branch 2 times, most recently from 09b98a9 to 44495fa Compare April 28, 2022 08:19

Andrew Morton added 3 commits April 28, 2022 09:40

Add additional metrics for replication lag as time

f70335a

Add new metric documentation

50baa5f

Correct date on change log

103ac73

ahjmorton force-pushed the add-replication-lag-time branch from 44495fa to 103ac73 Compare April 28, 2022 08:40

rnaveiras self-requested a review April 28, 2022 09:22

rnaveiras self-assigned this Apr 28, 2022

rnaveiras mentioned this pull request Apr 28, 2022

Drop support for postgres versions older than 10 #116

Closed

Andrew Morton added 2 commits April 28, 2022 13:25

Drop Postgres 9 support from stat_replication

458791c

Re-work standby side metrics to be based around the `pg_stat_wal_rece…

4aaf535

…iver` view

ahjmorton force-pushed the add-replication-lag-time branch from 5c3f6cf to 4aaf535 Compare April 28, 2022 15:40

Add replay bytes as well as seconds

bac46c7

ahjmorton marked this pull request as ready for review April 28, 2022 15:55

rnaveiras reviewed Apr 28, 2022

View reviewed changes

Andrew Morton and others added 4 commits May 3, 2022 15:07

Have WAL sender lag default to zero

f5c3d70

Merge branch 'main' into add-replication-lag-time

242f796

Merge branch 'main' into add-replication-lag-time

cbcc86c

Merge branch 'main' into add-replication-lag-time

9cc3974

ttamimi suggested changes Jun 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand metrics on pg_stat_replication to include lag expressed as time. #115

Expand metrics on pg_stat_replication to include lag expressed as time. #115

ahjmorton commented Apr 28, 2022 •

edited

Loading

rnaveiras commented Apr 28, 2022

rnaveiras Apr 28, 2022

ahjmorton May 3, 2022

ttamimi left a comment

ttamimi Jun 14, 2022

ahjmorton commented Sep 29, 2023

Expand metrics on pg_stat_replication to include lag expressed as time. #115

Are you sure you want to change the base?

Expand metrics on pg_stat_replication to include lag expressed as time. #115

Conversation

ahjmorton commented Apr 28, 2022 • edited Loading

References

rnaveiras commented Apr 28, 2022

rnaveiras Apr 28, 2022

Choose a reason for hiding this comment

ahjmorton May 3, 2022

Choose a reason for hiding this comment

ttamimi left a comment

Choose a reason for hiding this comment

ttamimi Jun 14, 2022

Choose a reason for hiding this comment

ahjmorton commented Sep 29, 2023

ahjmorton commented Apr 28, 2022 •

edited

Loading