Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand metrics on pg_stat_replication to include lag expressed as time. #115

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ahjmorton
Copy link

@ahjmorton ahjmorton commented Apr 28, 2022

Expands the metrics exposed from pg_stat_replication to include lag as reported from the wal sender perspective. Also adds a collector for pg_stat_wal_receiver for monitoring from the standby side.

The following examples are from a locally running streaming replica setup and calling postgres_exporter

Primary:

# HELP postgres_stat_replication_flush_lag_seconds flush_lag as reported by the pg_stat_replication view converted to seconds
# TYPE postgres_stat_replication_flush_lag_seconds gauge
postgres_stat_replication_flush_lag_seconds{application_name="walreceiver",client_addr="172.28.0.3",state="streaming",sync_state="sync"} 0.002844
# HELP postgres_stat_replication_lag_bytes delay in bytes pg_wal_lsn_diff(pg_current_wal_lsn(), replay_location)
# TYPE postgres_stat_replication_lag_bytes gauge
postgres_stat_replication_lag_bytes{application_name="walreceiver",client_addr="172.28.0.3",state="streaming",sync_state="sync"} 0
# HELP postgres_stat_replication_replay_lag_seconds replay_lag as reported by the pg_stat_replication view converted to seconds
# TYPE postgres_stat_replication_replay_lag_seconds gauge
postgres_stat_replication_replay_lag_seconds{application_name="walreceiver",client_addr="172.28.0.3",state="streaming",sync_state="sync"} 0.00317
8
# HELP postgres_stat_replication_write_lag_seconds write_lag as reported by the pg_stat_replication view converted to seconds
# TYPE postgres_stat_replication_write_lag_seconds gauge

Standby:

# HELP postgres_wal_receiver_replay_lag_bytes delay in standby wal replay bytes pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())::float
# TYPE postgres_wal_receiver_replay_lag_bytes gauge
postgres_wal_receiver_replay_lag_bytes{status="streaming"} 0
# HELP postgres_wal_receiver_replay_lag_seconds delay in standby wal replay seconds EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()
# TYPE postgres_wal_receiver_replay_lag_seconds gauge
postgres_wal_receiver_replay_lag_seconds{status="streaming"} 0

References

@ahjmorton ahjmorton force-pushed the add-replication-lag-time branch 2 times, most recently from 09b98a9 to 44495fa Compare April 28, 2022 08:19
@ahjmorton ahjmorton force-pushed the add-replication-lag-time branch from 44495fa to 103ac73 Compare April 28, 2022 08:40
@rnaveiras rnaveiras self-requested a review April 28, 2022 09:22
@rnaveiras rnaveiras self-assigned this Apr 28, 2022
@rnaveiras
Copy link
Owner

hey @ahjmorton

Firstly, thank you so much for this. I will review the PR in detail in the next few days.

I think you can drop support for Postgres older than 10, as 9.6 reached the end of life. See https://www.postgresql.org/support/versioning/

That will make the changes a bit easier, I think

@ahjmorton ahjmorton force-pushed the add-replication-lag-time branch from 5c3f6cf to 4aaf535 Compare April 28, 2022 15:40
@ahjmorton ahjmorton marked this pull request as ready for review April 28, 2022 15:55
@@ -83,17 +89,24 @@ func (c *statReplicationScraper) Scrape(ctx context.Context, conn *pgx.Conn, ver
var applicationName, state, syncState string
var clientAddr net.IP
var pgXlogLocationDiff float64
/* When querying pg_stat_replication it may be that we don't have
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahjmorton I didn't have time to go into details here, but could you share more details about these metrics disappearing?

Later you make a conditional have a value to report these metrics. We should avoid that, as metrics that disappear from Prometheus are complex to deal with

https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahjmorton I didn't have time to go into details here, but could you share more details about these metrics disappearing?

If the WAL sender process didn't have an activity for some period of time (not sure how long, might need to look at source) then Postgres would null out those "*_lag" fields.

Later you make a conditional have a value to report these metrics. We should avoid that, as metrics that disappear from Prometheus are complex to deal with

I was trying to avoid reporting metrics for nodes that didn't have a WAL sender process however those would be eliminated by the query as they wouldn't have a row in pg_stat_replication.
Going to change this to default to zero in case of null.

Copy link

@ttamimi ttamimi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this branch on my local machine with a streaming replication setup, and I loaded it with tens of thousands of records of test data. I can confirm that the five new metrics introduced in this PR all appeared for me with non-zero values.

Metrics from the primary PostgreSQL instance

postgres_stat_replication_flush_lag_seconds{
  application_name="walreceiver",
  client_addr="<nil>",
  instance="localhost:9187",
  job="postgresql",
  state="streaming",
  sync_state="async"
}
postgres_stat_replication_replay_lag_seconds{
  application_name="walreceiver",
  client_addr="<nil>",
  instance="localhost:9187",
  job="postgresql",
  state="streaming",
  sync_state="async"
}
postgres_stat_replication_write_lag_seconds{
  application_name="walreceiver",
  client_addr="<nil>",
  instance="localhost:9187",
  job="postgresql",
  state="streaming",
  sync_state="async"
}

Metrics from the replica

postgres_wal_receiver_replay_lag_bytes{
  instance="localhost:9188",
  job="postgresql",
  status="streaming"
}
postgres_wal_receiver_replay_lag_seconds{
  instance="localhost:9188",
  job="postgresql",
  status="streaming"
}

/* When pg_basebackup is running in stream mode, it opens a second connection
to the server and starts streaming the transaction log in parallel while
running the backup. In both connections (state=backup and state=streaming) the
pg_log_location_diff is null and it requires to be excluded */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is pg_log_location_diff referenced? We aren't using it anywhere in this code, are we? Maybe a third party library? It is definitely getting queried though. I tested the exporter with a streaming replication setup on my machine and I saw the following errors in the logs of the primary PostgreSQL instance:

2022-06-14 10:32:45.934 BST [94070] STATEMENT:  select * from pg_log_location_diff;
2022-06-14 10:32:47.403 BST [94070] ERROR:  relation "pg_log_location_diff" does not exist at character 15

Even Google has no clue what pg_log_location_diff is!!

@ahjmorton
Copy link
Author

Hey @rnaveiras and @ttamimi . I'm happy to leave this PR open and work on getting the metrics in there. Reckon it's worth me taking a look? Realise it's been a long time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants