Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace issue in K8 node deployment #4

Open
blackramit opened this issue Sep 2, 2021 · 10 comments
Open

Namespace issue in K8 node deployment #4

blackramit opened this issue Sep 2, 2021 · 10 comments

Comments

@blackramit
Copy link

blackramit commented Sep 2, 2021

Hey Pega88 (Niels), thanks much for all the work you did on this deployment manifest. Awesome work! I ran into an issue with what I believe is a race condition with the chainlink namespace getting started on the K8 cluster. Did you run into this issue and if so, did you ever get a workaround?

google_container_cluster.gke-cluster: Creation complete after 5m27s [id=projects/chainlink-test-324713/locations/us-central1-c/clusters/chainlink-cluster]
kubernetes_namespace.chainlink: Creating...
kubernetes_secret.password-credentials: Creating...
kubernetes_service.chainlink_service: Creating...
kubernetes_config_map.chainlink-env: Creating...
kubernetes_secret.api-credentials: Creating...
kubernetes_config_map.postgres: Creating...
kubernetes_service.postgres: Creating...
kubernetes_deployment.chainlink-node: Creating...
kubernetes_stateful_set.postgres: Creating...
kubernetes_namespace.chainlink: Creation complete after 0s [id=chainlink]

│ Error: namespaces "chainlink" not found

│ with kubernetes_config_map.chainlink-env,
│ on chainlink-node.tf line 28, in resource "kubernetes_config_map" "chainlink-env":
│ 28: resource "kubernetes_config_map" "chainlink-env" {

@blackramit
Copy link
Author

blackramit commented Sep 3, 2021

For Pega88 (Niels) and any others who venture here. I was able to get things working with a couple tweaks;

  • I added timeouts to gke.tf similar to below. I was running over the 10min limit in GCP. I did this for both the initial "gke-cluster" build and the "main-node" build.
resource "google_container_cluster" "gke-cluster" {
  name     = var.cluster_name
  location = var.gcp_zone

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true 
  initial_node_count       = 3

  enable_legacy_abac = false

  depends_on = [
    google_project_service.container_api
  ]
  #BRIT: Added timeouts for initial cluster/node builds.
  timeouts {
    create = "30m"
    update = "20m"
  }
}
  • I also used v0.9.10 of the chainlink image in chainlink-node.tf. I believe v10 does something funky with deploying v9.10 and then upgrading, but may test that at some point.
spec {
       container {
         image = "smartcontract/chainlink:0.9.10" #BRIT: Updated to the highest v9 chainlink release, v10 seems to want to upgrade from here.
  • I used postgres v13.3 in postgres.tf, rather than the v9.6.17, which being way old may have been an issue between postgres and the namespace. Not sure:
    image = "postgres:13.3" #BRIT: Updated to highest current release from 9.6.17
  • Not sure what knocked what loose, but Terraform applied & completed.

@blackramit
Copy link
Author

I noticed something about GCP/GKE that could be what was going on above. Even after deleting a project, the platform seems to hold onto the namespace. But what it has won't be pointing to the new project you are working with. I believe this may be the way to fix that. Notice it stays in a terminating state for quite a while;

devadmin@ThunderCloud:/mnt/e/Development/chainlink-gcp$ kubectl get namespace
NAME              STATUS   AGE
chainlink         Active   3h12m
default           Active   3h16m
kube-node-lease   Active   3h16m
kube-public       Active   3h16m
kube-system       Active   3h16m
devadmin@ThunderCloud:/mnt/e/Development/chainlink-gcp$ kubectl delete namespace chainlink
namespace "chainlink" deleted 
devadmin@ThunderCloud:/mnt/e/Development/chainlink-gcp$ kubectl get namespace
NAME              STATUS        AGE
chainlink         Terminating   3h31m
default           Active        3h35m
kube-node-lease   Active        3h35m
kube-public       Active        3h35m
kube-system       Active        3h35m

@Pega88
Copy link
Owner

Pega88 commented Sep 5, 2021

Thanks for flagging, I'll take some time in the week to update the entire setup

@blackramit
Copy link
Author

Thanks for flagging, I'll take some time in the week to update the entire setup

Hey Niels, I had to add a bunch of dependencies (depends_on=) to get the various build segments to run in the right order. I'll submit the code after I run it a few times to verify it. I now have three solid nodes up on GKE running v0.9.10 of the chainlink code.

@Pega88
Copy link
Owner

Pega88 commented Sep 10, 2021

can you have a look at #6 to see of this helps? Still need updating the CL image and add the timeouts, haven't tried that. feel free to PR though!

@Pega88
Copy link
Owner

Pega88 commented Sep 10, 2021

I noticed something about GCP/GKE that could be what was going on above. Even after deleting a project, the platform seems to hold onto the namespace. But what it has won't be pointing to the new project you are working with. I believe this may be the way to fix that. Notice it stays in a terminating state for quite a while;

devadmin@ThunderCloud:/mnt/e/Development/chainlink-gcp$ kubectl get namespace
NAME              STATUS   AGE
chainlink         Active   3h12m
default           Active   3h16m
kube-node-lease   Active   3h16m
kube-public       Active   3h16m
kube-system       Active   3h16m
devadmin@ThunderCloud:/mnt/e/Development/chainlink-gcp$ kubectl delete namespace chainlink
namespace "chainlink" deleted 
devadmin@ThunderCloud:/mnt/e/Development/chainlink-gcp$ kubectl get namespace
NAME              STATUS        AGE
chainlink         Terminating   3h31m
default           Active        3h35m
kube-node-lease   Active        3h35m
kube-public       Active        3h35m
kube-system       Active        3h35m

this is your local ~/.kube/config, which is not deleted if you delete the google cloud environment. so your local tooling still thinks its there. link here That said, it's weird is successfully deleting a namespace of a cluster that should not be reachable anymore

@Pega88
Copy link
Owner

Pega88 commented Sep 10, 2021

can you have a look at #6 to see of this helps? Still need updating the CL image and add the timeouts, haven't tried that. feel free to PR though!

updated CL version as well with your snippet - haven't had time to fully run it yet. LMK if it works for you?

@blackramit
Copy link
Author

blackramit commented Sep 13, 2021 via email

@blackramit
Copy link
Author

blackramit commented Sep 13, 2021 via email

@blackramit
Copy link
Author

blackramit commented Sep 13, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants