Skip to content

sunileman/ElasticKonductor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ElasticKonductor

ElasticKonductor was built to rapidly deploy with ease an K8s (AWS EKS/Azure AKS/GCP GKE) cluster, install ECK, and the ES Stack.

Total time from configuration to a fully launched ECK cluster generally should take less than 10 minutes. The automation; ElasticKonductor, is idempotent.

Note - Automation deploys OpenEBS which exposes and uses locally attached storage. More info, go to the OpenEBS section

ElasticKonductor currently deploys

  • EKS, AKS, GKE
  • ECK (Optional)
    • ElasticSearch
    • Kibana
    • License loading (Bring your own ES license)
  • OpenTelemetry Demo
  • Enterprise Search
  • ElasticSearch Service (ESS)
  • Fleet Server
  • Elastic Agent (APM,K8s Integration)
  • Istio

2023-06-28-21-54-31.jpg

Prerequisites

Quick Start Details regarding clients required to run the automtaion are outlined in the following sections. However if a ubuntu 20.+ instance is available, a quick start client install script is available here https://github.com/sunileman/ElasticKonductor/blob/main/scripts/ElasticKonductor-client-install.sh This will install all the required libaries and CLIs for the automation.

General

AWS

GIT

  • Not required. If interested in pulling the repo via ssh to keep up with updates applied to this automation https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent

Azure

Deployment

2023-06-28-22-08-23.jpg

2023-02-07-15-29-55.jpg

Note: The automation requires tags.project variable to be set in the .[aws|azure]/terraform.tfvars. Set the variable with your username, no special characters. Keep it short

Required Arguments

-c [aws|azure|gcp|ess]

-b [all|k8s|eck|otel]

-d Destroy all assets created by the automation

de destroy ECK

do destroy Open Telemetry

i get cluster info

int get cluster infra info

--openebs Enable OpenEBS

Note -b eck option assumes K8s has been deployed

Examples

2023-06-28-22-01-40.jpg

To run the automation in the background. Output will be writen to nohup.out.

  nohup ./elastickonductor.sh -b all -c [aws|gcp|azure] &

Once the automation completes, Kibana endpoints along with username and password should be displayed.

The automation will set your local kubectl manifest. Verify by running

  kubectl get nodes

Tear Down all assets built by the automation

  ./elastickonductor.sh -d -c [aws|gcp|azure]

Retrieve Kibana/ES URL and username & password

./elastickonductor.sh -c [aws|gcp|azure] -i

Enterprise Search

By default enterprise search is deployed with Konductor. To deploy additonal enterprise search pods or K8s nodes set the following variables in terraform.tfvars

entsearch_pod_count= 1
entsearch_instance_count= 1

To remove enterprise search, simply set both variables to 0.

ElasticSearch Service (ESS)

To deploy ESS set variables in terraform.tfvars under the folder ess. Additinoal set your ESS api key via environment variables export EC_API_KEY="your ess api key"

To deploy ESS ./elastickonductor.sh -c ess -b all

ESS username/password will be displayed via stdout once deployment completes. To retrieve username/password .elastickonductor.sh -c ess -i

To destroy ESS .elastickonductor.sh -c ess -d

Open Telemetry

The automation also has the ability to launch the Open Telemetry Demo found here: https://opentelemetry.io/docs/demo/kubernetes-deployment/

Use -b otel during the launch process

The demo can send data to ElasticSearch by setting the following env variables

export TF_VAR_es_apm_url="https://xxxxxx.elastic-cloud.com:443"

export TF_VAR_es_apm_token="xxxxxx"

To destroy/tear down Open Telemetry Demo ./elastickonductor.sh -c azure -do

Open Telemetry UI

Once Otel demo has been deployed, the UI will be available on port 8080 via cloud native load balancer.

To retrieve the load balancer IP, run

kubectl get service open-telemetry-frontendproxy

That will return something similar to this

NAME                           TYPE           CLUSTER-IP   EXTERNAL-IP     PORT(S)          AGE
open-telemetry-frontendproxy   LoadBalancer   10.52.5.6    30.300.30.300   8080:30658/TCP   32m

Using the external IP (The example returned 30.300.30.300), otel demo UI will be availale at 30.300.30.300:8080. Please replace 30.300.30.300 with your external IP.

GCP Open Telemetry Demo

Example terraform.tfvars to run Open Telemetry Demo on GCP

tags = {
    "division" = "field"
    "org" = "sa"
    "team" = "amer"
    "project" = "superman"
}


region= "us-central1"
zones= ["us-central1-a", "us-central1-b", "us-central1-c"]

gcp_project="your-gcp-project"

otel_instance_count= 1

master_initial_node_count_per_zone=0
hot_initial_node_count_per_zone=0
warm_initial_node_count_per_zone=0
cold_initial_node_count_per_zone=0
frozen_instance_count_per_zone=0
ml_initial_node_count_per_zone=0
entsearch_instance_count = 0
util_instance_count=0
kibana_instance_count=0
fleet_instance_count=0


es_apm_url= "your-es-apm.elastic-cloud.com:443" #without https:// prefix
es_apm_token="your es apm token" #your Elastic APM secret token


Azure Open Telemetry Demo

tags = {
    "division" = "field"
    "org" = "sa"
    "team" = "amer"
    "project" = "superman" # Project name (shared) or username (individual)
}

resource_group_location="eastus"


otel_instance_count= 1

master_instance_count=0
hot_instance_count=0
kibana_instance_count=0
warm_instance_count=0
cold_instance_count=0
frozen_instance_count=0
ml_instance_count=0
entsearch_instance_count=0



es_apm_url= "your-es-apm.elastic-cloud.com:443" #without https:// prefix
es_apm_token="your es apm token" #your Elastic APM secret token

Autoscaling

2023-01-30-11-26-51.jpg

Terraform Variables

Any variable found in ./[aws|azure]/variables.tf can be set as an envionrment variable.

export TF_VAR_<variable name>="your value"
export TF_VAR_variable_<list variable type>='["value", "value"]'

Envionrment variables and setting variables in terraform.tfvars can be used together. Consider terraform variable order of Precedence: https://developer.hashicorp.com/terraform/language/values/variables#variable-definition-precedence

Note Refrain from altering the region and/or zone in conjunction with the -d option. Doing so prompts the automation to attempt the destruction of assets in the modified region that may not be present. First, employ the -d option to dismantle assets, and subsequently, modify the region/zone.

Note - Instance and pod count
Instance count = Number of K8s nodes per type (hot/warm/etc))
Pod count = Number of pods per type (hot/warm/etc) which will be deployed on the instance type.

For example you can deploy 4 hot pods on 1 hot K8s instance type.

Variables- Default Values

2023-02-07-15-31-29.jpg ./[aws|azure]/variables file host all possible variables supported by this deployment.
If a varialble is not present in ./[aws|azure]/terraform.tfvars, the default value will be taken from ./[aws|azure]/variables.tf If default value is not acceptable, set the varialble value in ./[aws|azure|gcp]/terraform.tfvars

For example, the default number of es hot instances is 3. That default value is set in ./[aws|azure]/variables.tf. To change this value to 10; in ./[aws|azure]/terraform.tfvars file, set hot_instance_count = 10.

Another example. The default instance type for warm is im4gn.4xlarge. To change this value to im4gn.8xlarge; in ./[aws|azure]/terraform.tfvars file, set warm_instance_type = im4gn.8xlarge.

Word of caution on instance types for node groups (hot, warm, etc). When selecting an instance type, verify the AMI type: AL2_x86_64, AL2_x86_64_GPU, AL2_ARM_64. Instance type must fall withn specific AMI Type.

**Instance types
This deployment only supports DAS, not EBS

Ingest to pods are set by the following variables [master|hot|warm|cold|frozen|ml]_accept_ingest=false

GCP - To skip node pool creation [master|hot|warm|cold|frozen|ml]_create_node_pool=false

Apply ES License

Deployment without a license will use a basic license configuration.
If a ES license is available to you, place it under ./[aws|azure|gcp]/create-eck/license and name the license file es-license.json. The deployment will pick up license file

If a license file needs to be applied or changed after deployment, simply run ./[aws|azure|gpc]/create-eck/eck-add-license.sh

Pod Configuration

Once EKS is deployed, ES pods will be deployed leveraging resources defined in the K8s manifest. Each node type (master, hot, warm, ml, etc) spec is defined within ./[aws|azure]/variables.tf. Change the pod spec to your desired configuration.

For example

ml_pod_count=value
ml_pod_cpu=value
ml_pod_memory=value
ml_pod_storage=value
ml_pod_ES_JAVA_OPTS=value

Latest release of ES, if jvm arguments aren't spplied for heap size, half the available memory within the pod will be used for heap. Take this into consideration if the defaults aren't acceptables

Storage Class

By default ECK will mount local storage class to ES pods. To use a different storage class is simple. It is important to note that if network/remote storage will be used instead of local storage, use the -r. For example ./elastickonductor.sh -c gcp -b all -r

Run kubeclt get sc to retrieve available storage classes.

GKE [master|hot|warm|cold|frozen|ml]_pod_storage_class = "premium-rwo"

AWS To use gp3 or io2-be set the following hot_pod_storage_class = "hot-gp3" #valid values local-storage|hot-gp3|hot-io2-be hot_pod_storage_class_iops = "3000" hot_pod_storage_class_throughput = "125" ##in mb`

Azure [master|hot|warm|cold|frozen|ml]_pod_storage_class = "managed-csi-premium"

Kube State Metrics (KSM)

ElasticKonductor installs KSM by default with the service name ksm-kube-state-metrics. Under the kube-state-metrics integration section for Kubernetes, update the service name to ksm-kube-state-metrics.kube-system.svc.cluster.local with port 8080.

ECK/ES Updates

The automation; ElasticKonductor, is idempotent. Therefore if updates to ECK or ES have been applied, simple rerun ElasticKonductor with the same -b -c arguments

kubectl manifest

Automation will set local kube config (kubectl) after automation run. If local kube config needs to be reset, simple rerun the automation (even if there is no change) to set local kube config.

To reset your local kubeclt, run

  /ElasticKonductor/[gpc|aws|azure]/create-[eks|aks|gke]/setkubectl.sh

OpenEBS / Local Storage

OpenEBS is a automatic local disk provisioner for K8s. There may be scanerios (ie using network storage) the defaults from OpenEBS are not useful. To handle these scanerios, run the automation with --openebs to Enable OpenEBS

./elastickonductor.sh -c gcp -b k8s -r

Troubleshooting

Azure AKS bash into pod https://learn.microsoft.com/en-us/azure/aks/node-access

GCP GKE bash into pod https://cloud.google.com/migrate/containers/docs/troubleshooting/executing-shell-commands

OOMKilled

Pod JVM is requesting more memory than pod limits

exec plugin: invalid apiversion "client.authentication.k8s.io/v1alpha1"

The version of K8s does not match kubectl client. Please refer to: https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html

All automation logs are stored in ./logs

Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

try: export KUBE_CONFIG_PATH=~/.kube/config

If one of the pods is stuck in init stage run

kubectl describe pod <podname>

That should give you an idea on what went wrong. Often with openEBS the issue is the blockdevice has not been released. To resolve run azure|gcp|aws/aks|gke|eks/addons/openebs/delete_openebs_bds.sh to delete the block devices. However it is recommended to destroy cluster and start over as removing block devices manually has proven to provide unpredictable results.

How to enter pod within GKE nodes

gcloud compute ssh <NODE_NAME> --zone <ZONE>

-Logs showing markup Use lnav to view logs: https://lnav.org/

Azure Error

Error: building AzureRM Client: Authenticating using the Azure CLI is only supported as a User (not a Service Principal).

Set your azure creds prior to launching the automation

-Connection refused when trying to reach ES API port 9200 Verify at least 1 pod has role which does inclue master. The load balancer deployed selects nodes based on elasticsearch.k8s.elastic.co/node-master: "false"

Error

│ Error: Failed to query available provider packages
│ 
│ Could not retrieve the list of available versions for provider
│ anschoewe/curl: could not connect to registry.terraform.io: Failed to
│ request discovery document: Get
│ "https://registry.terraform.io/.well-known/terraform.json": read tcp
│ xxxx:54284->xxxx:443: read: connection reset by peer

Rerun automation. Terraform api (target side) was reset.

Error

An error occurred (UnrecognizedClientException) when calling the DescribeCluster operation: The security token included in the request is invalid.

Try: aws creds are invalid. Verify by running aws sts get-caller-identity

Error

Error: Get "https://oneclickeck-eyawmryu.hcp.eastus.azmk8s.io:443/api/v1/namespaces/kube-system/services/ksm-kube-state-metrics": dial tcp 40.88.241.57:443: connect: connection refused - error from a previous attempt: read tcp 10.0.9.248:46760->40.88.241.57:443: read: connection reset by peer
│ 
│   with helm_release.ksm,
│   on ksm.tf line 1, in resource "helm_release" "ksm":
│    1: resource "helm_release" "ksm" {

Rerun the automation

Otel helm fails to installs

│ Warning: Helm release "open-telemetry" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
│ 
│   with helm_release.otel,
│   on otel.tf line 1, in resource "helm_release" "otel":
│    1: resource "helm_release" "otel" {
│ 
╵
╷
│ Error: Get "https://xxxx/api/v1/namespaces/default/services/open-telemetry-loadgenerator": dial tcp xxx:443: connect: connection refused
│ 
│   with helm_release.otel,
│   on otel.tf line 1, in resource "helm_release" "otel":
│    1: resource "helm_release" "otel" {

set `otel_instance_count` to a value `> 0`  in terraform.tfvars as otel pods are created in the otel k8s node group


Error `Failed to install provider`

╷ │ Error: Failed to install provider │ │ Error while installing hashicorp/google v4.50.0: releases.hashicorp.com: │ Get │ "https://releases.hashicorp.com/terraform-provider-google/4.50.0/terraform-provider-google_4.50.0_linux_amd64.zip": │ net/http: TLS handshake timeout ╵




Error with kubectl
```kubectl get nodes
 couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused

Run ./elastickonductor.sh -c <gcp|azure|aws> -k to set your local kubectl config

Error with aws destroy or create

An error occurred (AccessDeniedException) when calling the DescribeCluster operation: User: arn:aws:sts::xxx:assumed-role/AmazonSSMRoleForInstancesQuickSetup/xxxxx is not authorized to perform: eks:DescribeCluster on resource: arn:aws:eks:us-east-1:xxxx:cluster/1ClickECK-sunman-tolerant-seagull

Run aws configure. This is required prior to running konductor

About

1Click Automation of Elastic Cloud on Kubernetes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published