Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding troubleshooting section for Elastic Agent on Kubernetes and Kustomize #1409

Merged
merged 31 commits into from
Nov 4, 2024
Merged
Changes from 28 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
65c53ba
troubleshooting section added for kustomize
eedugon Oct 2, 2024
0c61ca1
Merge remote-tracking branch 'origin/main' into eedugon/kustomize_tro…
eedugon Oct 24, 2024
9622403
elastic agent on k8s troubleshooting added
eedugon Oct 25, 2024
9ac0649
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
a62bfde
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
f6fdd3a
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
2bc1adb
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
c293922
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
8059eca
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
c675652
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
7e12148
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
fae790b
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
9397d6a
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
c18bcdf
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
81d75f9
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
38a625f
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
bf4c639
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
0b3c809
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
13fb317
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
2c5cde2
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
88a1c5a
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
58be980
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
e51ef1c
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
c055840
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
e3a5201
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
d46f56b
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 28, 2024
1d92c39
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 29, 2024
920207e
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 29, 2024
cfa37cb
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Oct 29, 2024
99ff11f
Update docs/en/ingest-management/troubleshooting/troubleshooting.asci…
eedugon Nov 4, 2024
7d7f88f
ksm and agent namespaces and leadership link
eedugon Nov 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions docs/en/ingest-management/troubleshooting/troubleshooting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ Find troubleshooting information for {fleet}, {fleet-server}, and {agent} in the
* <<fleet-server-integration-removed>>
* <<agent-oom-k8s>>
* <<agent-sudo-error>>
* <<agent-kubernetes-kustomize>>


[discrete]
Expand Down Expand Up @@ -830,3 +831,129 @@ Error: error loading agent config: error loading raw config: fail to read config

To resolve this, either install {agent} without the `--unprivileged` flag so that it has administrative access, or run the {agent} commands without the `sudo` prefix.

[discrete]
[[agent-kubernetes-kustomize]]
== Troubleshoot {agent} installation on Kubernetes, with Kustomize

Potential issues during {agent} installation on Kubernetes can be categorized into two main areas:

. <<agent-kustomize-manifest>>.
. <<agent-kustomize-after>>.

[discrete]
[[agent-kustomize-manifest]]
=== Problems related to the creation of objects within the manifest
eedugon marked this conversation as resolved.
Show resolved Hide resolved

When troubleshooting installations performed with https://github.com/kubernetes-sigs/kustomize[Kustomize], it's good practice to inspect the output of the rendered manifest. To do this, take the installation command provided by Kibana Onboarding and replace the final part, `| kubectl apply -f-`, with a redirection to a local file. This allows for easier analysis of the rendered output.

For example, the following command, originally provided by {kib} for an {agent} Standalone installation, has been modified to redirect the output for troubleshooting purposes:

[source,sh]
----
kubectl kustomize https://github.com/elastic/elastic-agent/deploy/kubernetes/elastic-agent-kustomize/default/elastic-agent-standalone\?ref\=v8.15.3 | sed -e 's/JUFQSV9LRVkl/ZDAyNnZaSUJ3eWIwSUlCT0duRGs6Q1JfYmJoVFRUQktoN2dXTkd0FNMtdw==/g' -e "s/%ES_HOST%/https:\/\/7a912e8674a34086eacd0e3d615e6048.us-west2.gcp.elastic-cloud.com:443/g" -e "s/%ONBOARDING_ID%/db687358-2c1f-4ec9-86e0-8f1baa4912ed/g" -e "s/\(docker.elastic.co\/beats\/elastic-agent:\).*$/\18.15.3/g" -e "/{CA_TRUSTED}/c\ " > elastic_agent_installation_complete_manifest.yaml
----

The previous command generates a local file named `elastic_agent_installation_complete_manifest.yaml`, which you can use for further analysis. It contains the complete set of resources required for the {agent} installation, including:

* RBAC objects (`ServiceAccounts`, `Roles`, etc.)

* `ConfigMaps` and `Secrets` for {agent} configuration

* {agent} Standalone deployed as a `DaemonSet`

* https://github.com/kubernetes/kube-state-metrics[Kube-state-metrics] deployed as a `Deployment`

The content of this file is equivalent to what you'd obtain by following the <<running-on-kubernetes-standalone>> steps, with the exception that `kube-state-metrics` is not included in the standalone method.

Possible issues:

* If your user doesn't have *cluster-admin* privileges, the RBAC resources creation might fail.

* Some Kubernetes security mechanisms (like https://kubernetes.io/docs/concepts/security/pod-security-standards/[Pod Security Standards]) could cause part of the manifest to be rejected, as `hostNetwork` access and `hostPath` volumes are required.

* If you already have an installation of `kube-state-metrics`, it could cause part of the manifest installation to fail or to update your existing resources without notice.

[discrete]
[[agent-kustomize-after]]
=== Failures occurring within specific components after installation

If the installation is correct and all resources are deployed, but data is not flowing as expected (for example, you don't see any data on the *[Metrics Kubernetes] Cluster Overview* dashboard), check the following items:

. Check resources status and ensure they are all in a `Running` state:
+
[source,sh]
----
kubectl get pods -n kube-system | grep elastic
eedugon marked this conversation as resolved.
Show resolved Hide resolved
kubectl get pods -n kube-system | grep kube-state-metrics
----

. Describe the Pods if they are in a `Pending` state:
+
[source,sh]
----
kubectl describe -n kube-system <name_of_elastic_agent_pod>
----

. Check the logs of elastic-agents and kube-state-metrics, and look for errors:
+
[source,sh]
----
kubectl logs -n kube-system <name_of_elastic_agent_pod>
kubectl logs -n kube-system <name_of_elastic_agent_pod> | grep -i error
eedugon marked this conversation as resolved.
Show resolved Hide resolved
----
+
[source,sh]
----
kubectl logs -n kube-system <name_of_kube-state-metrics_pod>
----

Possible issues:

* Connectivity, authorization, or authentication issues when connecting to {es}:
+
Ensure the API Key and {es} destination endpoint used during the installation is correct and is reachable from within the Pods.
+
In an already installed system, the API Key is stored in a `Secret` named `elastic-agent-creds-<hash>`, and the endpoint is configured in the `ConfigMap` `elastic-agent-configs-<hash>`.

* Only missing cluster-level metrics (provided by `kube-state-metrics`):
eedugon marked this conversation as resolved.
Show resolved Hide resolved
eedugon marked this conversation as resolved.
Show resolved Hide resolved
+
These metrics (`state_*`) are retrieved by one of the Pods acting as `leader` (as described in <<running-on-kubernetes-standalone>>), so in order to troubleshoot that situation:
eedugon marked this conversation as resolved.
Show resolved Hide resolved
+
. Determine which Pod owns the leadership `lease` in the cluster, with:
eedugon marked this conversation as resolved.
Show resolved Hide resolved
+
[source,sh]
----
kubectl get lease -n kube-system elastic-agent-cluster-leader
----
+
. Check the logs of that Pod to see if there are errors when connecting to `kube-state-metrics` and if the `state_*` metrics are being sent.
eedugon marked this conversation as resolved.
Show resolved Hide resolved
+
One way to check if `state_*` metrics are being delivered to {es} is to inspect log lines with the `"Non-zero metrics in the last 30s"` message and check the values of the "state_*" metricsets within the line, with something like:
+
[source,sh]
----
kubectl logs -n kube-system elastic-agent-xxxx | grep "Non-zero metrics" | grep "state_"
----
+
If the previous command returns `"state_pod":{"events":213,"success":213}` or similar for all `state_*` metricsets, it means the metrics are being delivered.
+
. As a last resort, if you believe none of the Pods is acting as a leader, you can try deleting the `lease` to generate a new one:
+
[source,sh]
----
kubectl delete lease -n kube-system elastic-agent-cluster-leader
# wait a few seconds and check for the lease again
kubectl get lease -n kube-system elastic-agent-cluster-leader
----

* Performance problems:
+
Monitor the CPU and Memory usage of the agents Pods and adjust the manifest requests and limits as needed. Refer to <<scaling-on-kubernetes>> for more details about the needed resources.

Extra resources for {agent} on Kubernetes troubleshooting and information:

* <<agent-oom-k8s>>.

* https://github.com/elastic/elastic-agent/tree/main/deploy/kubernetes/elastic-agent-kustomize/default[{agent} Kustomize Templates] documentation and resources.

* Other examples and manifests to deploy https://github.com/elastic/elastic-agent/tree/main/deploy/kubernetes[{agent} on Kubernetes].