Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet check-in should send policy_id and revision #6446

Open
blakerouse opened this issue Dec 27, 2024 · 2 comments
Open

Fleet check-in should send policy_id and revision #6446

blakerouse opened this issue Dec 27, 2024 · 2 comments
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@blakerouse
Copy link
Contributor

Overview

Currently when the Elastic Agent checks-in with Fleet Server it doesn't send the policy_id or revision of the policy that it is currently running. The Fleet Server stores this information by the fact that the Elastic Agent ACK'd the policy change notification, but there are many cases where this could be come out of sync.

VM Snapshot

  1. VM is snapshotted
  2. new policy revision occurs
  3. ACK'd by Elastic Agent (stored new revision in Fleet)
  4. VM is rolled back

Now the running Elastic Agent policy is the old version, but to Fleet it is the new version.

Bad Error Case

This is just a weird case but a coding issue could result in this problem.

  1. New revision is sent to Elastic Agent
  2. Policy failed to be saved to disk (could be coding issue or just with filesystem problem)
  3. policy revision is ACK'd anyway (shouldn't happen, but if it does...)

Elastic Agent is now running old version of policy but Fleet Server believes that its the new revision

Backup/restore of fleet.enc

In the case of backup/restore of fleet.enc.

  1. fleet.enc is backed up
  2. new policy revision occurs
  3. ACK'd by Elastic Agent (stored new revision in Fleet)
  4. fleet.enc is replaced with backup from 1
  5. Elastic Agent restarted

Elastic Agent is now running old version of policy but Fleet Server believes that its the new revision

How to solve it?

Upon check-in the Elastic Agent should be sending its current policy ID and revision. That is then compared to what Fleet Server expects and if it is not correct then it sends the correct policy.

@blakerouse blakerouse added the bug Something isn't working label Dec 27, 2024
@jlind23 jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Dec 30, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz
Copy link
Member

cmacknz commented Dec 30, 2024

The Fleet Server stores this information by the fact that the Elastic Agent ACK'd the policy change notification, but there are many cases where this could be come out of sync.

This is another example of a place where we don't actually need explicit ACKs like upgrades, because the actual state of the agent can be detected from the checkin payload.

More and more I think the way ACKs are used in action processing needs to be completely revisited or just designed out of the system. We frequently have an ACK and an actual state change or action result that are two separate updates to the system state, allowing one of them to not happen regardless of the other. We should eliminate this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

4 participants