Skip to content

Commit

Permalink
docs: Update troubleshooting and navigation (#228)
Browse files Browse the repository at this point in the history
* Update docs structure with regrouped patterns

* Rename multi cluster to avoid ambiguity with cross account

* Separate troubleshooting for eks-monitoring
  • Loading branch information
bonclay7 authored Sep 6, 2023
1 parent 9498a06 commit f65afff
Show file tree
Hide file tree
Showing 4 changed files with 272 additions and 82 deletions.
73 changes: 0 additions & 73 deletions docs/eks/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -191,76 +191,3 @@ sum(up{job="custom-metrics"}) by (container_name, cluster, nodename)
```

<img width="2560" alt="Screenshot 2023-01-31 at 11 16 21" src="https://user-images.githubusercontent.com/10175027/215869004-e05f557d-c81a-41fb-a452-ede9f986cb27.png">

## Troubleshooting

### 1. Grafana dashboards missing or Grafana API key expired

In case you don't see the grafana dashboards in your Amazon Managed Grafana console, check on the logs on your grafana operator pod using the below command :

```bash
kubectl get pods -n grafana-operator
```

Output:

```console
NAME READY STATUS RESTARTS AGE
grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m
```

```bash
kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator
```

Output:

```console
1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"}
github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile
```

If you observe, the the above `grafana-api-key error` in the logs, your grafana API key is expired. Please use the operational procedure to update your `grafana-api-key` :

- First, lets create a new Grafana API key.

```bash
export GO_AMG_API_KEY=$(aws grafana create-workspace-api-key \
--key-name "grafana-operator-key-new" \
--key-role "ADMIN" \
--seconds-to-live 432000 \
--workspace-id <YOUR_WORKSPACE_ID> \
--query key \
--output text)
```

- Finally, update the Grafana API key secret in AWS SSM Parameter Store using the above new Grafana API key:

```bash
aws aws ssm put-parameter \
--name "/terraform-accelerator/grafana-api-key" \
--type "SecureString" \
--value "{\"GF_SECURITY_ADMIN_APIKEY\": \"${GO_AMG_API_KEY}\"}" \
--region <Your AWS Region>
```

- If the issue persists, you can force the synchronization by deleting the `externalsecret` Kubernetes object.

```bash
kubectl delete externalsecret/external-secrets-sm -n grafana-operator
```

### 2. Upgrade from 2.1.0 or earlier

When you upgrade the eks-monitoring module from v2.1.0 or earlier, the following error may occur.

```bash
Error: cannot patch "prometheus-node-exporter" with kind DaemonSet: DaemonSet.apps "prometheus-node-exporter" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{"app.kubernetes.io/instance":"prometheus-node-exporter", "app.kubernetes.io/name":"prometheus-node-exporter"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
```

This is due to the upgrade of the node-exporter chart from v2 to v4. Manually delete the node-exporter's DaemonSet as described in [the link here](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter#3x-to-4x), and then apply.

```bash
kubectl -n prometheus-node-exporter delete daemonset -l app=prometheus-node-exporter
terraform apply
```
7 changes: 5 additions & 2 deletions docs/eks/multicluster.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# AWS EKS Multicluster Observability
# AWS EKS Multicluster Observability (single AWS Account)

This example shows how to use the [AWS Observability Accelerator](https://github.com/aws-observability/terraform-aws-observability-accelerator), with more than one EKS cluster and verify the collected metrics from all the clusters in the dashboards of a common `Amazon Managed Grafana` workspace.
This example shows how to use the [AWS Observability Accelerator](https://github.com/aws-observability/terraform-aws-observability-accelerator),
with more than one EKS cluster in a single account and visualize the collected
metrics from all the clusters in the dashboards of a common
`Amazon Managed Grafana` workspace.

## Prerequisites

Expand Down
257 changes: 257 additions & 0 deletions docs/eks/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
# Troubleshooting guide for Amazon EKS monitoring module

Depending on your setup, you might face a few errors. If you encounter an error
not listed here, please open an issue in the [issues section](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues)

These guide applies to the [eks-monitoring Terraform module](https://github.com/aws-observability/terraform-aws-observability-accelerator/tree/main/modules/eks-monitoring)


## Cluster authentication issue

### Error message

```console
│ Error: cluster-secretstore-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host
│ with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.cluster_secretstore,
│ on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 59, in resource "kubectl_manifest" "cluster_secretstore":
│ 59: resource "kubectl_manifest" "cluster_secretstore" {
│ Error: grafana-operator/external-secrets-sm failed to create kubernetes rest client for update of resource: Get "https://FINGERPRINT.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s": dial tcp: lookup F867DE6CE883F9595FC8A73D84FB9F83.gr7.us-east-1.eks.amazonaws.com on 192.168.4.1:53: no such host
│ with module.eks_monitoring.module.external_secrets[0].kubectl_manifest.secret,
│ on ../../modules/eks-monitoring/add-ons/external-secrets/main.tf line 89, in resource "kubectl_manifest" "secret":
│ 89: resource "kubectl_manifest" "secret" {
```

### Resolution


To provision the `eks-monitoring` module, the environment where you are running
Terraform apply needs to be authenticated against your cluster and be your
current context. To verify, you can run a single `kubectl get nodes` command
to ensure you are using the correct Amazon EKS cluster.

To login agains the correct cluster, run:

```console
aws eks update-kubeconfig --name <cluster name> --region <aws region>
```

## Missing Grafana dashboards

Terraform apply can run without apparent errors and your Grafana workspace
won't present any dashboards. Many situations could lead to this as described
below. The best place to start would be checking the logs of `grafana-operator`,
`external-secrets` and `flux-system` pods.


### Wrong Grafana workspace

It might happen that you provide the wrong Grafana workspace. One way to verify
this is to run the following command:

```bash
kubectl describe grafanas external-grafana -n grafana-operator
```

You should see an output similar to this (truncated for brevity). Validate that
you have the correct URL. If that's the case, re-running Terraform with the
correct workspace ID, API key should fix this issue.

```console
...
Spec:
External:
API Key:
Key: GF_SECURITY_ADMIN_APIKEY
Name: grafana-admin-credentials
URL: https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com
Status:
Admin URL: https://g-workspaceid.grafana-workspace.eu-central-1.amazonaws.com
Dashboards:
grafana-operator/apiserver-troubleshooting-grafanadashboard/V3y_Zcb7k
grafana-operator/apiserver-basic-grafanadashboard/R6abPf9Zz
grafana-operator/java-grafanadashboard/m9mHfAy7ks
grafana-operator/grafana-dashboards-adothealth/reshmanat
grafana-operator/apiserver-advanced-grafanadashboard/09ec8aa1e996d6ffcd6817bbaff4db1b
grafana-operator/nginx-grafanadashboard/nginx
grafana-operator/kubelet-grafanadashboard/3138fa155d5915769fbded898ac09fd9
grafana-operator/cluster-grafanadashboard/efa86fd1d0c121a26444b636a3f509a8
grafana-operator/workloads-grafanadashboard/a164a7f0339f99e89cea5cb47e9be617
grafana-operator/grafana-dashboards-kubeproxy/632e265de029684c40b21cb76bca4f94
grafana-operator/nodes-grafanadashboard/200ac8fdbfbb74b39aff88118e4d1c2c
grafana-operator/node-exporter-grafanadashboard/v8yDYJqnz
grafana-operator/namespace-workloads-grafanadashboard/a87fb0d919ec0ea5f6543124e16c42a5
```


### Grafana API key expired

Check on the logs on your grafana operator pod using the below command :

```bash
kubectl get pods -n grafana-operator
```

Output:

```console
NAME READY STATUS RESTARTS AGE
grafana-operator-866d4446bb-nqq5c 1/1 Running 0 3h17m
```

```bash
kubectl logs grafana-operator-866d4446bb-nqq5c -n grafana-operator
```

Output:

```console
1.6857285045556655e+09 ERROR error reconciling datasource {"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample-amp","namespace":"grafana-operator"}, "namespace": "grafana-operator", "name": "grafanadatasource-sample-amp", "reconcileID": "72cfd60c-a255-44a1-bfbd-88b0cbc4f90c", "datasource": "grafanadatasource-sample-amp", "grafana": "external-grafana", "error": "status: 401, body: {\"message\":\"Expired API key\"}\n"}
github.com/grafana-operator/grafana-operator/controllers.(*GrafanaDatasourceReconciler).Reconcile
```

If you observe, the the above `grafana-api-key error` in the logs,
your grafana API key is expired.

Please use the operational procedure to update your `grafana-api-key` :

- Create a new Grafana API key, you can use [this step](https://aws-observability.github.io/terraform-aws-observability-accelerator/eks/#6-grafana-api-key)
and make sure the API key duration is not too short.

- Run Terraform with the new API key. Terraform will modify the AWS SSM
Parameter used by `externalsecret`.

- If the issue persists, you can force the synchronization by deleting the
`externalsecret` Kubernetes object.

```bash
kubectl delete externalsecret/external-secrets-sm -n grafana-operator
```

### Git repository errors

[Flux](https://fluxcd.io/flux/components/source/gitrepositories/) is responsible
to regularly pull and synchronize [dashboards and artifacts](https://github.com/aws-observability/aws-observability-accelerator)
into your EKS cluster. It might happen that its state gets corrupted.

You can verify those errors by using this command. You should see an error if
Flux is not able to pull correctly:

```bash
kubectl get gitrepositories -n flux-system
NAME URL AGE READY STATUS
aws-observability-accelerator https://github.com/aws-observability/aws-observability-accelerator 6d12h True stored artifact for revision 'v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d'
```

Depending on the error, you can delete the repository and re-run Terraform and
force the synchronization.

```bash
k delete gitrepositories aws-observability-accelerator -n flux-system
```

If you believe this is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues).


### Flux Kustomizations

After Flux pulls the repository in the cluster state, it will apply [Kustomizations](https://fluxcd.io/flux/components/kustomize/kustomizations/)
to create Grafana data sources, folders and dashboards.

- Check the kustomization objects. Here you should see the dashboards you have
enabled

```bash
k get kustomizations.kustomize.toolkit.fluxcd.io -A
NAMESPACE NAME AGE READY STATUS
flux-system grafana-dashboards-adothealth 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system grafana-dashboards-apiserver 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system grafana-dashboards-infrastructure 10d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system grafana-dashboards-java 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system grafana-dashboards-kubeproxy 10d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
flux-system grafana-dashboards-nginx 18d True Applied revision: v0.2.0@sha1:c4819a990312f7c2597f529577471320e5c4ef7d
```

- To have more infos on an error, you can view the Kustomization controller logs

```bash
kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-65cc46469f-nsqd5 1/1 Running 2 (13d ago) 27d
image-automation-controller-d8f7bfcb4-k2m9j 1/1 Running 2 (13d ago) 27d
image-reflector-controller-68979dfd49-wh25h 1/1 Running 2 (13d ago) 27d
kustomize-controller-767677f7f5-c5xsp 1/1 Running 5 (13d ago) 63d
notification-controller-55d8c759f5-7df5l 1/1 Running 5 (13d ago) 63d
source-controller-58c66d55cd-4j6bl 1/1 Running 5 (13d ago) 63d
```

```bash
kubectl logs -f -n flux-system kustomize-controller-767677f7f5-c5xsp
```

If you believe there is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues).

- Depending on the error, delete the kustomization object and re-apply Terraform

```bash
kubectl delete kustomizations -n flux-system grafana-dashboards-apiserver
```

### Grafana dashboards errors

If all of the above seem normal, finally inspect deployed dashboards by
running this command:

```bash
kubectl get grafanadashboards -A
NAMESPACE NAME AGE
grafana-operator apiserver-advanced-grafanadashboard 18d
grafana-operator apiserver-basic-grafanadashboard 18d
grafana-operator apiserver-troubleshooting-grafanadashboard 18d
grafana-operator cluster-grafanadashboard 10d
grafana-operator grafana-dashboards-adothealth 18d
grafana-operator grafana-dashboards-kubeproxy 10d
grafana-operator java-grafanadashboard 18d
grafana-operator kubelet-grafanadashboard 10d
grafana-operator namespace-workloads-grafanadashboard 10d
grafana-operator nginx-grafanadashboard 18d
grafana-operator node-exporter-grafanadashboard 10d
grafana-operator nodes-grafanadashboard 10d
grafana-operator workloads-grafanadashboard 10d
```

- You can dive into the details of a dashboard by running:

```bash
kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator
```

- Depending on the error, you can delete the dashboard object. In this case,
you don't need to re-run Terraform as the Flux Kustomization will force its
recreation through the Grafana operator

```bash
kubectl describe grafanadashboards grafana-dashboards-kubeproxy -n grafana-operator
```

If you believe there is a bug, please open an issue [here](https://github.com/aws-observability/terraform-aws-observability-accelerator/issues).

## Upgrade from to v2.5 or earlier

v2.5.0 removes the dependency to the Terraform Grafana provider in the EKS
monitoring module. As Grafana Operator manages and syncs the Grafana contents,
Terraform is not required anymore in this context.

However, if you migrate from earlier versions, you might leave some data
orphan as the Grafana provider is dropped.
Terraform will throw an error. We have released v2.5.0-rc.1 which removes all
the Grafana resources provisioned by Terraform in the EKS context,
without removing the provider configurations.

- Step 1: migrate to v2.5.0-rc.1 and run apply
- Step 2: migrate to v2.5.0 or above
17 changes: 10 additions & 7 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,18 @@ nav:
- Home: index.md
- Concepts: concepts.md
- Amazon EKS:
- Infrastructure monitoring: eks/index.md
- EKS API server monitoring: eks/eks-apiserver.md
- Multicluster monitoring: eks/multicluster.md
- Cross Account monitoring: eks/multiaccount.md
- Java/JMX: eks/java.md
- Nginx: eks/nginx.md
- Istio: eks/istio.md
- Infrastructure: eks/index.md
- EKS API server: eks/eks-apiserver.md
- Multicluster:
- Single AWS account: eks/multicluster.md
- Cross AWS account: eks/multiaccount.md
- Viewing logs: eks/logs.md
- Tracing: eks/tracing.md
- Patterns:
- Java/JMX: eks/java.md
- Nginx: eks/nginx.md
- Istio: eks/istio.md
- Troubleshooting: eks/troubleshooting.md
- Teardown: eks/destroy.md
- AWS Distro for OpenTelemetry (ADOT):
- Monitoring ADOT collector health: adothealth/index.md
Expand Down

0 comments on commit f65afff

Please sign in to comment.