native service delete errors for old allocs after client restart #24461

mr-karan · 2024-11-14T03:02:28Z

Nomad version

1.7.7

Operating system and Environment details

Running in an AWS environment
Ubuntu 24.04

Issue

Service registration errors and task failures occurring during node registration.

Reproduction steps

Node starts registration process
Multiple service registration deletion attempts fail
Template rendering issues occur for HAProxy peer service
Sibling task failures cascade to other services

Expected Result

Clean node registration
Successful service registration management
Proper template rendering for HAProxy peer service
Successful task execution without cascading failures

Actual Result

Multiple cascading failures observed:

Service registration errors:

[ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: service registration not found"

Template failures:

Missing: nomad.service(haproxy-peer)

Task failures:

Setup Failure: failed to setup alloc: pre-run hook "group_services" failed: no servers

Forced termination:

Exit Code: 0, Exit Message: "executor: error waiting on process: rpc error: code = Canceled desc = grpc: the client connection is closing"

Nomad Client logs

Nov 14 08:11:51 [INFO]  agent: (runner) starting
Nov 14 08:11:51 [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=172.31.2.217:4647
Nov 14 08:11:51 [INFO]  client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-d90c47ce-f4be-0fa3-e019-5d1b522e64a1-group-haproxy-default-haproxy-peer-haproxy-peer-net namespace=kite
Nov 14 08:12:01 [INFO]  client: node registration complete

Nomad Alloc Events Timeline

Nov 14, '24 08:10:36 - Terminated (Exit Code: 0)
Nov 14, '24 08:10:35 - Killing (Sent interrupt, 5s grace period)
Nov 14, '24 08:10:33 - Template Missing: nomad.service(haproxy-peer)
Nov 14, '24 08:10:30 - Sibling Task Failed (prepare-logging-setup)
Nov 14, '24 08:10:30 - Setup Failure (group_services hook failed)
Nov 14, '24 07:31:19 - Started

The primary issue appears to be related to service registration and template rendering failures, particularly affecting HAProxy peer services. This is causing cascading failures across dependent services and tasks.

The text was updated successfully, but these errors were encountered:

tgross · 2024-11-14T13:41:55Z

@mr-karan if the node hasn't registered yet, how is it running services? Is this a node that was running services and then restarted?

tgross · 2024-12-11T21:14:56Z

@mr-karan we haven't heard back on this one in a while. It looks like you've got some running jobs and then you're rebooting the client agent, and then the allocations fail to restore? Are you rebooting the node? I tried reproducing but there really isn't enough to go on. I'm going to close this out for now as unreproducible, but if you have more info I'd be happy to reopen.

tgross · 2024-12-12T21:47:09Z

Reproduced!

jobspec

job "httpd" {

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    service {
      name     = "httpd-web"
      provider = "nomad"
      port     = "www"
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      identity {
        env  = true
        file = true
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

Running on a cluster with a single client and single server, I was running that job for a while and updating it frequently debugging other work. Then I see the following logs on the client after running systemd restart nomad:

2024-12-12T16:38:44.311-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.311-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.311-0500 [INFO] client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-c1c2f114-5f48-bc51-3575-95764f088e89-group-web-httpd-web-www namespace=default
2024-12-12T16:38:44.339-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.340-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.340-0500 [INFO] client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-1bec7c14-86a2-70e6-c018-654a05d02cc4-group-web-example-web-www namespace=default
2024-12-12T16:38:44.423-0500 [ERROR] client.rpc: error performing RPC to server: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.423-0500 [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: service registration not found" rpc=ServiceRegistration.DeleteByID server=10.37.105.3:4647
2024-12-12T16:38:44.424-0500 [INFO] client.service_registration.nomad: attempted to delete non-existent service registration: service_id=_nomad-task-1e098779-8d6e-8840-372f-178d44ca90c6-group-web-httpd-web-www namespace=default

The allocation IDs here are all for the service registration of the old allocations, not the ones that exist currently. So the errors we get from the server make sense -- these should all be gone already. But why the client still thinks it has to delete them I don't know yet. Seems like a chunk of data is getting left behind in the client state store.

Reopening and marking for roadmapping.

mr-karan added the type/bug label Nov 14, 2024

tgross added this to Nomad - Community Issues Triage Nov 14, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Nov 14, 2024

tgross added stage/waiting-reply theme/service-discovery labels Nov 14, 2024

tgross self-assigned this Nov 14, 2024

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Nov 14, 2024

tgross closed this as completed Dec 11, 2024

github-project-automation bot moved this from Triaging to Done in Nomad - Community Issues Triage Dec 11, 2024

tgross closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2024

tgross reopened this Dec 12, 2024

github-project-automation bot moved this from Done to Needs Triage in Nomad - Community Issues Triage Dec 12, 2024

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Dec 12, 2024

tgross changed the title ~~Service Registration Failures During Node Registration~~ native service delete errors for old allocs after client restart Dec 12, 2024

tgross added the hcc/jira label Dec 12, 2024

tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Dec 12, 2024

tgross removed their assignment Dec 12, 2024

mismithhisler self-assigned this Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

native service delete errors for old allocs after client restart #24461

native service delete errors for old allocs after client restart #24461

mr-karan commented Nov 14, 2024

tgross commented Nov 14, 2024

tgross commented Dec 11, 2024

tgross commented Dec 12, 2024

native service delete errors for old allocs after client restart #24461

native service delete errors for old allocs after client restart #24461

Comments

mr-karan commented Nov 14, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Client logs

Nomad Alloc Events Timeline

tgross commented Nov 14, 2024

tgross commented Dec 11, 2024

tgross commented Dec 12, 2024