-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
41b7893
commit 1cd42a9
Showing
27 changed files
with
1,153 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,7 +33,6 @@ override.tf.json | |
.terraformrc | ||
terraform.rc | ||
|
||
helm-values | ||
test/helm-values | ||
|
||
# IDE | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Prometheus Monitoring Setup | ||
|
||
This gives an overview of a production-ready observability setup with Prometheus for timeseries metrics collection and Loki for log aggregation. It sets up central instances of Prometheus and Loki (on your management cluster in this case, but could go elsewhere), and promtail plus prometheus agent installs to collect and ship metrics remotely. | ||
|
||
A quick overview of repository structure: | ||
|
||
* `/terraform` - example terraform you can rework to set up cloud resources. In this case it sets up the s3 bucket needed by loki to persist the service and terraform's the initial services to start the service-of-service process to provision everything else | ||
* `/helm-values` - values files for all the charts needed, note the `.liquid` variants support templating both configuration values and other contextual information that's useful especially in global service contexts | ||
* `/services` - the service-of-services that sets up all the main components, in order these are: | ||
- prometheus agent, replicated across clusters as a global service | ||
- prometheus itself, deployed via kube-prometheus-stack | ||
- loki, deployed on the mgmt cluster | ||
- promtail, replicated as a global service | ||
* `/helm-repositories` - flux helm repositories crds needed to create helm repository services for the various resources | ||
|
||
## Adopting this setup | ||
|
||
We'd recommend copy-pasting this into a repo you own to assist customization. There are a few points you'd need to know to customize: | ||
|
||
* urls for your prometheus/loki domain, in the kps-* and loki-* helm values. They will usually be in ingress configuration, but also elsewhere. Our defaults were `loki.boot-aws.onplural.sh`, `prometheus.boot-aws.onplural.sh`, etc | ||
* cluster names in servicedeployment.yaml files, eg in `services/kps-agent-fleet/servicedeployment.yaml`, it's wired to our default cluster name, `boot-staging` | ||
* you currently need to manually set `basicAuthUser` and `basicAuthPassword` in your root service-of-services' secrets to configure basic auth for both loki and prometheus. | ||
|
||
## Configure Prometheus and Loki for your Console ui | ||
|
||
The console has the ability to take prometheus and loki connection information to begin providing log aggregation and metrics views in useful places in-ui. The configuration is nested under the deployment settings tab, at `/cd/settings/observability`. Be sure to use the same values as for the basic auth configuration above. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
apiVersion: source.toolkit.fluxcd.io/v1beta1 | ||
kind: HelmRepository | ||
metadata: | ||
name: grafana | ||
namespace: {{ configuration.namespace }} | ||
spec: | ||
interval: 5m0s | ||
url: https://grafana.github.io/helm-charts |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
apiVersion: source.toolkit.fluxcd.io/v1beta1 | ||
kind: HelmRepository | ||
metadata: | ||
name: opencost | ||
namespace: {{ configuration.namespace }} | ||
spec: | ||
interval: 5m0s | ||
url: https://opencost.github.io/opencost-helm-chart |
8 changes: 8 additions & 0 deletions
8
resources/monitoring/helm-repositories/prometheuscommunity.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
apiVersion: source.toolkit.fluxcd.io/v1beta1 | ||
kind: HelmRepository | ||
metadata: | ||
name: prometheus-community | ||
namespace: {{ configuration.namespace }} | ||
spec: | ||
interval: 5m0s | ||
url: https://prometheus-community.github.io/helm-charts |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
fullnameOverride: monitoring | ||
|
||
defaultRules: | ||
create: false | ||
rules: | ||
alertmanager: true | ||
etcd: true | ||
configReloaders: true | ||
general: true | ||
k8sContainerCpuUsageSecondsTotal: true | ||
k8sContainerMemoryCache: true | ||
k8sContainerMemoryRss: true | ||
k8sContainerMemorySwap: true | ||
k8sContainerResource: true | ||
k8sContainerMemoryWorkingSetBytes: true | ||
k8sPodOwner: true | ||
kubeApiserverAvailability: true | ||
kubeApiserverBurnrate: true | ||
kubeApiserverHistogram: true | ||
kubeApiserverSlos: true | ||
kubeControllerManager: true | ||
kubelet: true | ||
kubeProxy: true | ||
kubePrometheusGeneral: true | ||
kubePrometheusNodeRecording: true | ||
kubernetesApps: true | ||
kubernetesResources: true | ||
kubernetesStorage: true | ||
kubernetesSystem: true | ||
kubeSchedulerAlerting: true | ||
kubeSchedulerRecording: true | ||
kubeStateMetrics: true | ||
network: true | ||
node: true | ||
nodeExporterAlerting: true | ||
nodeExporterRecording: true | ||
prometheus: true | ||
prometheusOperator: true | ||
windows: true | ||
|
||
|
||
alertmanager: | ||
enabled: false | ||
fullnameOverride: kps-alertmanager | ||
|
||
|
||
prometheusOperator: | ||
tls: | ||
enabled: false | ||
admissionWebhooks: | ||
enabled: false | ||
prometheusConfigReloader: | ||
resources: | ||
requests: | ||
cpu: 200m | ||
memory: 50Mi | ||
limits: | ||
memory: 100Mi | ||
grafana: | ||
enabled: false | ||
|
||
# monitored k8s components | ||
kubeApiServer: | ||
enabled: true | ||
|
||
kubelet: | ||
enabled: true | ||
|
||
coreDns: | ||
enabled: true | ||
|
||
# already monitored with coreDns | ||
kubeDns: | ||
enabled: false | ||
|
||
kubeProxy: | ||
enabled: true | ||
|
||
kubeStateMetrics: | ||
enabled: true | ||
|
||
kube-state-metrics: | ||
fullnameOverride: kps-kube-state-metrics | ||
selfMonitor: | ||
enabled: true | ||
|
||
nodeExporter: | ||
enabled: true | ||
|
||
prometheus-node-exporter: | ||
fullnameOverride: kps-node-exporter | ||
prometheus: | ||
monitor: | ||
enabled: true | ||
resources: | ||
requests: | ||
memory: 512Mi | ||
cpu: 250m | ||
limits: | ||
memory: 2048Mi | ||
|
||
# EKS hides metrics for controller manager, scheduler, and etcd | ||
# https://github.com/aws/eks-anywhere/issues/4405 | ||
# disable kube controller manager scraping | ||
kubeControllerManager: | ||
enabled: false | ||
|
||
# disable kube scheduler scraping | ||
kubeScheduler: | ||
enabled: false | ||
|
||
kubeEtcd: | ||
enabled: false | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
prometheus: | ||
enabled: true | ||
agentMode: true | ||
extraSecret: | ||
name: basic-auth-remote | ||
data: | ||
user: {{ configuration.basicAuthUser }} | ||
password: {{ configuration.basicAuthPassword }} | ||
prometheusSpec: | ||
remoteWrite: | ||
- url: https://prometheus.boot-aws.onplural.sh/api/v1/write | ||
name: mgmt-cluster-prometheus | ||
basicAuth: | ||
username: | ||
name: basic-auth-remote | ||
key: user | ||
password: | ||
name: basic-auth-remote | ||
key: password | ||
writeRelabelConfigs: | ||
- sourceLabels: [] | ||
targetLabel: 'cluster' | ||
replacement: {{ cluster.Handle }} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
prometheus: | ||
prometheusSpec: | ||
prometheusExternalLabelName: {{ cluster.Handle }} | ||
# incoming metrics from workload clusters will have a cluster label set to the cluster handle, but that's only assigned at the remote write push | ||
# the mgmt cluster itself will not have a cluster label set, one can use an external label, but that's only added on push time | ||
# ideally we would have a scrape class that would add a cluster label to all targets scraped by prometheus, this is currently not supported, but will be with probably the next release of the prometheus operator | ||
# in the meantime we add relabel config to all servicemonitors individually (see below) | ||
# this is how it would look like: | ||
# use scrape classes in the future to add a cluster label to all targets scraped by prometheus | ||
# this will make sure we can always identify the cluster a target belongs to, even for the mgmt cluster prometheus | ||
# https://github.com/prometheus-operator/prometheus-operator/pull/5978 | ||
# https://github.com/prometheus-operator/prometheus-operator/pull/6379 | ||
#additionalConfig: | ||
# scrapeClasses: | ||
# - name: cluster_label | ||
# default: true | ||
# relabelings: | ||
# - sourceLabels: [] | ||
# targetLabel: cluster | ||
# replacement: {{ cluster.Handle }} | ||
serviceMonitor: | ||
enabled: true | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
alertmanager: | ||
serviceMonitor: | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
grafana: | ||
serviceMonitor: | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
kubeApiServer: | ||
serviceMonitor: | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
|
||
kubelet: | ||
serviceMonitor: | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
|
||
coreDns: | ||
serviceMonitor: | ||
enabled: true | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
# already monitored with coreDns | ||
kubeDns: | ||
serviceMonitor: | ||
enabled: true | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
kubeProxy: | ||
enabled: true | ||
serviceMonitor: | ||
enabled: true | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
kubeStateMetrics: | ||
enabled: true | ||
|
||
kube-state-metrics: | ||
fullnameOverride: kps-kube-state-metrics | ||
selfMonitor: | ||
enabled: true | ||
prometheus: | ||
monitor: | ||
enabled: true | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
nodeExporter: | ||
enabled: true | ||
|
||
prometheus-node-exporter: | ||
fullnameOverride: kps-node-exporter | ||
prometheus: | ||
monitor: | ||
enabled: true | ||
relabelings: | ||
- sourceLabels: [] | ||
targetLabel: cluster | ||
replacement: {{ cluster.Handle }} | ||
|
||
resources: | ||
requests: | ||
memory: 512Mi | ||
cpu: 250m | ||
limits: | ||
memory: 2048Mi | ||
|
||
# EKS hides metrics for controller manager, scheduler, and etcd | ||
# https://github.com/aws/eks-anywhere/issues/4405 | ||
# disable kube controller manager scraping | ||
kubeControllerManager: | ||
enabled: false | ||
|
||
# disable kube scheduler scraping | ||
kubeScheduler: | ||
enabled: false | ||
|
||
kubeEtcd: | ||
enabled: false |
Oops, something went wrong.