-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand metrics on pg_stat_replication to include lag expressed as time. #115
base: main
Are you sure you want to change the base?
Conversation
09b98a9
to
44495fa
Compare
44495fa
to
103ac73
Compare
hey @ahjmorton Firstly, thank you so much for this. I will review the PR in detail in the next few days. I think you can drop support for Postgres older than 10, as 9.6 reached the end of life. See https://www.postgresql.org/support/versioning/ That will make the changes a bit easier, I think |
5c3f6cf
to
4aaf535
Compare
collector/stat_replication.go
Outdated
@@ -83,17 +89,24 @@ func (c *statReplicationScraper) Scrape(ctx context.Context, conn *pgx.Conn, ver | |||
var applicationName, state, syncState string | |||
var clientAddr net.IP | |||
var pgXlogLocationDiff float64 | |||
/* When querying pg_stat_replication it may be that we don't have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahjmorton I didn't have time to go into details here, but could you share more details about these metrics disappearing?
Later you make a conditional have a value to report these metrics. We should avoid that, as metrics that disappear from Prometheus are complex to deal with
https://prometheus.io/docs/practices/instrumentation/#avoid-missing-metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahjmorton I didn't have time to go into details here, but could you share more details about these metrics disappearing?
If the WAL sender process didn't have an activity for some period of time (not sure how long, might need to look at source) then Postgres would null out those "*_lag" fields.
Later you make a conditional have a value to report these metrics. We should avoid that, as metrics that disappear from Prometheus are complex to deal with
I was trying to avoid reporting metrics for nodes that didn't have a WAL sender process however those would be eliminated by the query as they wouldn't have a row in pg_stat_replication
.
Going to change this to default to zero in case of null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this branch on my local machine with a streaming replication setup, and I loaded it with tens of thousands of records of test data. I can confirm that the five new metrics introduced in this PR all appeared for me with non-zero values.
Metrics from the primary PostgreSQL instance
postgres_stat_replication_flush_lag_seconds{
application_name="walreceiver",
client_addr="<nil>",
instance="localhost:9187",
job="postgresql",
state="streaming",
sync_state="async"
}
postgres_stat_replication_replay_lag_seconds{
application_name="walreceiver",
client_addr="<nil>",
instance="localhost:9187",
job="postgresql",
state="streaming",
sync_state="async"
}
postgres_stat_replication_write_lag_seconds{
application_name="walreceiver",
client_addr="<nil>",
instance="localhost:9187",
job="postgresql",
state="streaming",
sync_state="async"
}
Metrics from the replica
postgres_wal_receiver_replay_lag_bytes{
instance="localhost:9188",
job="postgresql",
status="streaming"
}
postgres_wal_receiver_replay_lag_seconds{
instance="localhost:9188",
job="postgresql",
status="streaming"
}
/* When pg_basebackup is running in stream mode, it opens a second connection | ||
to the server and starts streaming the transaction log in parallel while | ||
running the backup. In both connections (state=backup and state=streaming) the | ||
pg_log_location_diff is null and it requires to be excluded */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is pg_log_location_diff
referenced? We aren't using it anywhere in this code, are we? Maybe a third party library? It is definitely getting queried though. I tested the exporter with a streaming replication setup on my machine and I saw the following errors in the logs of the primary PostgreSQL instance:
2022-06-14 10:32:45.934 BST [94070] STATEMENT: select * from pg_log_location_diff;
2022-06-14 10:32:47.403 BST [94070] ERROR: relation "pg_log_location_diff" does not exist at character 15
Even Google has no clue what pg_log_location_diff
is!!
Hey @rnaveiras and @ttamimi . I'm happy to leave this PR open and work on getting the metrics in there. Reckon it's worth me taking a look? Realise it's been a long time |
Expands the metrics exposed from
pg_stat_replication
to include lag as reported from the wal sender perspective. Also adds a collector forpg_stat_wal_receiver
for monitoring from the standby side.The following examples are from a locally running streaming replica setup and calling
postgres_exporter
Primary:
Standby:
References