Skip to content

Commit

Permalink
Add monitoring setup example
Browse files Browse the repository at this point in the history
  • Loading branch information
michaeljguarino committed Mar 30, 2024
1 parent 41b7893 commit 1cd42a9
Show file tree
Hide file tree
Showing 27 changed files with 1,153 additions and 1 deletion.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ override.tf.json
.terraformrc
terraform.rc

helm-values
test/helm-values

# IDE
Expand Down
26 changes: 26 additions & 0 deletions resources/monitoring/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Prometheus Monitoring Setup

This gives an overview of a production-ready observability setup with Prometheus for timeseries metrics collection and Loki for log aggregation. It sets up central instances of Prometheus and Loki (on your management cluster in this case, but could go elsewhere), and promtail plus prometheus agent installs to collect and ship metrics remotely.

A quick overview of repository structure:

* `/terraform` - example terraform you can rework to set up cloud resources. In this case it sets up the s3 bucket needed by loki to persist the service and terraform's the initial services to start the service-of-service process to provision everything else
* `/helm-values` - values files for all the charts needed, note the `.liquid` variants support templating both configuration values and other contextual information that's useful especially in global service contexts
* `/services` - the service-of-services that sets up all the main components, in order these are:
- prometheus agent, replicated across clusters as a global service
- prometheus itself, deployed via kube-prometheus-stack
- loki, deployed on the mgmt cluster
- promtail, replicated as a global service
* `/helm-repositories` - flux helm repositories crds needed to create helm repository services for the various resources

## Adopting this setup

We'd recommend copy-pasting this into a repo you own to assist customization. There are a few points you'd need to know to customize:

* urls for your prometheus/loki domain, in the kps-* and loki-* helm values. They will usually be in ingress configuration, but also elsewhere. Our defaults were `loki.boot-aws.onplural.sh`, `prometheus.boot-aws.onplural.sh`, etc
* cluster names in servicedeployment.yaml files, eg in `services/kps-agent-fleet/servicedeployment.yaml`, it's wired to our default cluster name, `boot-staging`
* you currently need to manually set `basicAuthUser` and `basicAuthPassword` in your root service-of-services' secrets to configure basic auth for both loki and prometheus.

## Configure Prometheus and Loki for your Console ui

The console has the ability to take prometheus and loki connection information to begin providing log aggregation and metrics views in useful places in-ui. The configuration is nested under the deployment settings tab, at `/cd/settings/observability`. Be sure to use the same values as for the basic auth configuration above.
8 changes: 8 additions & 0 deletions resources/monitoring/helm-repositories/grafana.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
name: grafana
namespace: {{ configuration.namespace }}
spec:
interval: 5m0s
url: https://grafana.github.io/helm-charts
8 changes: 8 additions & 0 deletions resources/monitoring/helm-repositories/opencost.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
name: opencost
namespace: {{ configuration.namespace }}
spec:
interval: 5m0s
url: https://opencost.github.io/opencost-helm-chart
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: HelmRepository
metadata:
name: prometheus-community
namespace: {{ configuration.namespace }}
spec:
interval: 5m0s
url: https://prometheus-community.github.io/helm-charts
115 changes: 115 additions & 0 deletions resources/monitoring/helm-values/kps-agent.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
fullnameOverride: monitoring

defaultRules:
create: false
rules:
alertmanager: true
etcd: true
configReloaders: true
general: true
k8sContainerCpuUsageSecondsTotal: true
k8sContainerMemoryCache: true
k8sContainerMemoryRss: true
k8sContainerMemorySwap: true
k8sContainerResource: true
k8sContainerMemoryWorkingSetBytes: true
k8sPodOwner: true
kubeApiserverAvailability: true
kubeApiserverBurnrate: true
kubeApiserverHistogram: true
kubeApiserverSlos: true
kubeControllerManager: true
kubelet: true
kubeProxy: true
kubePrometheusGeneral: true
kubePrometheusNodeRecording: true
kubernetesApps: true
kubernetesResources: true
kubernetesStorage: true
kubernetesSystem: true
kubeSchedulerAlerting: true
kubeSchedulerRecording: true
kubeStateMetrics: true
network: true
node: true
nodeExporterAlerting: true
nodeExporterRecording: true
prometheus: true
prometheusOperator: true
windows: true


alertmanager:
enabled: false
fullnameOverride: kps-alertmanager


prometheusOperator:
tls:
enabled: false
admissionWebhooks:
enabled: false
prometheusConfigReloader:
resources:
requests:
cpu: 200m
memory: 50Mi
limits:
memory: 100Mi
grafana:
enabled: false

# monitored k8s components
kubeApiServer:
enabled: true

kubelet:
enabled: true

coreDns:
enabled: true

# already monitored with coreDns
kubeDns:
enabled: false

kubeProxy:
enabled: true

kubeStateMetrics:
enabled: true

kube-state-metrics:
fullnameOverride: kps-kube-state-metrics
selfMonitor:
enabled: true

nodeExporter:
enabled: true

prometheus-node-exporter:
fullnameOverride: kps-node-exporter
prometheus:
monitor:
enabled: true
resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 2048Mi

# EKS hides metrics for controller manager, scheduler, and etcd
# https://github.com/aws/eks-anywhere/issues/4405
# disable kube controller manager scraping
kubeControllerManager:
enabled: false

# disable kube scheduler scraping
kubeScheduler:
enabled: false

kubeEtcd:
enabled: false


24 changes: 24 additions & 0 deletions resources/monitoring/helm-values/kps-agent.yaml.liquid
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
prometheus:
enabled: true
agentMode: true
extraSecret:
name: basic-auth-remote
data:
user: {{ configuration.basicAuthUser }}
password: {{ configuration.basicAuthPassword }}
prometheusSpec:
remoteWrite:
- url: https://prometheus.boot-aws.onplural.sh/api/v1/write
name: mgmt-cluster-prometheus
basicAuth:
username:
name: basic-auth-remote
key: user
password:
name: basic-auth-remote
key: password
writeRelabelConfigs:
- sourceLabels: []
targetLabel: 'cluster'
replacement: {{ cluster.Handle }}

130 changes: 130 additions & 0 deletions resources/monitoring/helm-values/kps-mgmt.liquid
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
prometheus:
prometheusSpec:
prometheusExternalLabelName: {{ cluster.Handle }}
# incoming metrics from workload clusters will have a cluster label set to the cluster handle, but that's only assigned at the remote write push
# the mgmt cluster itself will not have a cluster label set, one can use an external label, but that's only added on push time
# ideally we would have a scrape class that would add a cluster label to all targets scraped by prometheus, this is currently not supported, but will be with probably the next release of the prometheus operator
# in the meantime we add relabel config to all servicemonitors individually (see below)
# this is how it would look like:
# use scrape classes in the future to add a cluster label to all targets scraped by prometheus
# this will make sure we can always identify the cluster a target belongs to, even for the mgmt cluster prometheus
# https://github.com/prometheus-operator/prometheus-operator/pull/5978
# https://github.com/prometheus-operator/prometheus-operator/pull/6379
#additionalConfig:
# scrapeClasses:
# - name: cluster_label
# default: true
# relabelings:
# - sourceLabels: []
# targetLabel: cluster
# replacement: {{ cluster.Handle }}
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

alertmanager:
serviceMonitor:
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

grafana:
serviceMonitor:
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

kubeApiServer:
serviceMonitor:
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}


kubelet:
serviceMonitor:
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}


coreDns:
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

# already monitored with coreDns
kubeDns:
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

kubeProxy:
enabled: true
serviceMonitor:
enabled: true
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

kubeStateMetrics:
enabled: true

kube-state-metrics:
fullnameOverride: kps-kube-state-metrics
selfMonitor:
enabled: true
prometheus:
monitor:
enabled: true
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

nodeExporter:
enabled: true

prometheus-node-exporter:
fullnameOverride: kps-node-exporter
prometheus:
monitor:
enabled: true
relabelings:
- sourceLabels: []
targetLabel: cluster
replacement: {{ cluster.Handle }}

resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 2048Mi

# EKS hides metrics for controller manager, scheduler, and etcd
# https://github.com/aws/eks-anywhere/issues/4405
# disable kube controller manager scraping
kubeControllerManager:
enabled: false

# disable kube scheduler scraping
kubeScheduler:
enabled: false

kubeEtcd:
enabled: false
Loading

0 comments on commit 1cd42a9

Please sign in to comment.