Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VaultAutopilotHealthy alert name and time window in alert descriptions #1157

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion charts/vault-monitoring/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
name: vault-monitoring
description: monitor your vault server from within Kubernetes' prometheus
type: application
version: 0.3.1
version: 0.4.0
home: https://github.com/adfinis/helm-charts/tree/main/charts/vault-monitoring
sources:
- https://github.com/adfinis/helm-charts/tree/main/charts/vault-monitoring
Expand Down
2 changes: 1 addition & 1 deletion charts/vault-monitoring/README.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 5 additions & 7 deletions charts/vault-monitoring/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,9 +54,7 @@ prometheusRules:
# -- set of prometheus alerts to define
# @default -- list of predefined alerts
rules:
- alert: VaultAutopilotNodeHealthy
# Set to 1 if Autopilot considers all nodes healthy
# https://www.vaultproject.io/docs/internals/telemetry#integrated-storage-raft-autopilot
- alert: VaultAutopilotHealthy
expr: vault_autopilot_healthy < 1
for: 1m
labels:
Expand Down Expand Up @@ -95,28 +93,28 @@ prometheusRules:
severity: critical
annotations:
summary: High frequency of failed Vault requests
description: There has been an increased number of failed Vault requests in the last 15 minutes
description: There has been an increased number of failed Vault requests in the last 20 minutes
- alert: VaultResponseFailures
expr: increase(vault_audit_log_response_failure[5m]) > 0
for: 15m
labels:
severity: critical
annotations:
summary: High frequency of failed Vault responses
description: There has been an increased number of failed Vault responses in the last 15 minutes
description: There has been an increased number of failed Vault responses in the last 20 minutes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How has this changed? It's still for: 15m

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be because 15m on the check + 5m in the query add up to 20m?

- alert: VaultTokenCreate
expr: increase(vault_token_create_count[5m]) > 100
for: 15m
labels:
severity: critical
annotations:
summary: High frequency of created Vault token
description: There has been an increased number of Vault token creation in the last 15 minutes
description: There has been an increased number of Vault token creation in the last 20 minutes
- alert: VaultTokenStore
expr: increase(vault_token_store_count[5m]) > 100
for: 15m
labels:
severity: critical
annotations:
summary: High frequency of stored Vault token
description: There has been an increased number of Vault token storing in the last 15 minutes
description: There has been an increased number of Vault token storing in the last 20 minutes