Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating the prod monitoring alerts and update ref app node selector to schedule on specific node pool #633

Merged
merged 22 commits into from
Oct 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
538ec92
Removing duplicate alerts from ci recommended alerts
Sohamdg081992 Jun 30, 2023
52afdc9
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Jul 8, 2023
667fa62
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Jul 14, 2023
5ed84bc
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Jul 18, 2023
4280706
Remove test branch
Sohamdg081992 Jul 18, 2023
5244d82
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Jul 26, 2023
c0e61b9
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Jul 26, 2023
def76a5
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 2, 2023
94a94f8
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 10, 2023
12ad6c4
Remove preview keyword from policy readme
Sohamdg081992 Aug 10, 2023
d882600
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 11, 2023
2d85282
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 16, 2023
61cd57b
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 19, 2023
431e85d
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 22, 2023
fad0d47
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 28, 2023
aa2b1ec
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Aug 30, 2023
ad196bd
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Sep 8, 2023
2dbe39d
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Sep 28, 2023
05c5435
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Oct 3, 2023
f657b5e
Merge branch 'main' of https://github.com/Azure/prometheus-collector
Sohamdg081992 Oct 13, 2023
cdf0894
Updating the prod monitoring alerts and update ref app node selector …
Sohamdg081992 Oct 14, 2023
42eaf0e
.
Sohamdg081992 Oct 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 29 additions & 29 deletions internal/alerts/example-alert-template.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
{
"alert": "Amd64 metric missing in cluster ci-dev-aks-mac-eus",
"expression": "absent(node_uname_info{machine=\"x86_64\"}) == 1 or node_uname_info{machine=\"x86_64\"} == 0",
"for": "PT3M",
"for": "PT30M",
"annotations": {
"description": "Amd64 metric missing in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -38,8 +38,8 @@
},
{
"alert": "up metric missing for target = node in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"node\"}) == 1 or up{job=\"node\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"node\"}[30m]) == 1 or count(up{job=\"node\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = node in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -56,8 +56,8 @@
},
{
"alert": "up metric missing for target = kubelet in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"kubelet\"}) == 1 or up{job=\"kubelet\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"kubelet\"}[30m]) == 1 or count(up{job=\"kubelet\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = kubelet in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -74,8 +74,8 @@
},
{
"alert": "up metric missing for target = windows-exporter in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"windows-exporter\"}) == 1 or up{job=\"windows-exporter\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"windows-exporter\"}[30m]) == 1 or count(up{job=\"windows-exporter\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = windows-exporter in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -92,8 +92,8 @@
},
{
"alert": "up metric missing for target = kube-proxy in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"kube-proxy\"}) == 1 or up{job=\"kube-proxy\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"kube-proxy\"}[30m]) == 1 or count(up{job=\"kube-proxy\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = kube-proxy in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -110,8 +110,8 @@
},
{
"alert": "up metric missing for target = kube-apiserver in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"kube-apiserver\"}) == 1 or up{job=\"kube-apiserver\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"kube-apiserver\"}[30m]) == 1 or count(up{job=\"kube-apiserver\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = kube-apiserver in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -128,8 +128,8 @@
},
{
"alert": "up metric missing for target = kube-proxy-windows in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"kube-proxy-windows\"}) == 1 or up{job=\"kube-proxy-windows\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"kube-proxy-windows\"}[30m]) == 1 or count(up{job=\"kube-proxy-windows\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = kube-proxy-windows in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -146,8 +146,8 @@
},
{
"alert": "up metric missing for target = kube-state-metrics in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"kube-state-metrics\"}) == 1 or up{job=\"kube-state-metrics\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"kube-state-metrics\"}[30m]) == 1 or count(up{job=\"kube-state-metrics\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = kube-state-metrics in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -164,8 +164,8 @@
},
{
"alert": "up metric missing for target = cadvisor in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"cadvisor\"}) == 1 or up{job=\"cadvisor\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"cadvisor\"}[30m]) == 1 or count(up{job=\"cadvisor\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = cadvisor in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -182,8 +182,8 @@
},
{
"alert": "up metric missing for target = kube-dns in cluster ci-dev-aks-mac-eus",
"expression": "absent(up{job=\"kube-dns\"}) == 1 or up{job=\"kube-dns\"} == 0",
"for": "PT3M",
"expression": "absent_over_time(up{job=\"kube-dns\"}[30m]) == 1 or count(up{job=\"kube-dns\"} == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "up metric is not flowing for target = kube-dns in cluster ci-dev-aks-mac-eus"
},
Expand All @@ -199,11 +199,11 @@
]
},
{
"alert": "CPU usage % greater than 90 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
"expression": "sum(sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job=\"cadvisor\", image!=\"\", namespace=\"kube-system\", container=\"prometheus-collector\"}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=\"\", namespace=\"kube-system\"}) )) by (container, pod) > 0.9",
"alert": "CPU usage % greater than 75 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
"expression": "sum(sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job=\"cadvisor\", image!=\"\", namespace=\"kube-system\", container=\"prometheus-collector\"}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=\"\", namespace=\"kube-system\"}) )) by (container, pod) *100 > 75",
"for": "PT3M",
"annotations": {
"description": "CPU usage greater than 90% for prometheus-collector on cluster ci-dev-aks-mac-eus"
"description": "CPU usage greater than 75% for prometheus-collector on cluster ci-dev-aks-mac-eus"
},
"severity": 4,
"resolveConfiguration": {
Expand All @@ -217,11 +217,11 @@
]
},
{
"alert": "CPU usage % greater than 50 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
"expression": "sum(sum by (cluster, namespace, pod, container) ( rate(container_cpu_usage_seconds_total{job=\"cadvisor\", image!=\"\", namespace=\"kube-system\", container=\"prometheus-collector\"}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=\"\", namespace=\"kube-system\"}) )) by (container, pod) > 0.5",
"alert": "Memory usage % greater than 75 for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
"expression": "(sum(container_memory_working_set_bytes{namespace=\"kube-system\", container=\"prometheus-collector\", image!=\"\"}) by (container, pod) / sum(kube_pod_container_resource_limits{namespace=\"kube-system\", container=\"prometheus-collector\", resource=\"memory\"}) by (container, pod)) > 75",
"for": "PT3M",
"annotations": {
"description": "CPU usage greater than 5% for prometheus-collector on cluster ci-dev-aks-mac-eus"
"description": "Memory usage greater than 75% for prometheus-collector containers on cluster ci-dev-aks-mac-eus"
},
"severity": 4,
"resolveConfiguration": {
Expand All @@ -235,11 +235,11 @@
]
},
{
"alert": "Memory usage is high for prometheus-collector containers on cluster ci-dev-aks-mac-eus",
"expression": "(sum(container_memory_working_set_bytes{namespace=\"kube-system\", container=\"prometheus-collector\", image!=\"\"}) by (container, pod) / sum(kube_pod_container_resource_requests{namespace=\"kube-system\", container=\"prometheus-collector\", resource=\"memory\"}) by (container, pod)) > 1.9",
"for": "PT3M",
"alert": "Custom job metric missing for target = prometheus_ref_app in cluster ci-dev-aks-mac-eus",
"expression": "absent_over_time(myapp_rainfall_histogram_sum[30m]) == 1 or count(myapp_rainfall_histogram_sum == 1) == 0",
"for": "PT30M",
"annotations": {
"description": "Memory usage is high for prometheus-collector containers on cluster ci-dev-aks-mac-eus"
"description": "Custom job metric missing for target = prometheus_ref_app in cluster ci-dev-aks-mac-eus"
},
"severity": 4,
"resolveConfiguration": {
Expand Down
1 change: 1 addition & 0 deletions internal/referenceapp/prometheus-reference-app.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ spec:
protocol: TCP
nodeSelector:
kubernetes.io/os: linux
architecture: amd64
---
apiVersion: v1
kind: Service
Expand Down