Automated AMI upgrades #320

preflightsiren · 2021-07-09T03:33:01Z

Is this a BUG REPORT or FEATURE REQUEST?: Feature Request

Amazon publish a supported, EKS optimised AMI and an associated SSM parameter that is updated to point to the latest AMI for a given kubernetes version.

aws ssm get-parameter --name "/aws/service/eks/optimized-ami/1.18/amazon-linux-2/recommended/image_id" --region us-east-1 --query "Parameter.Value" --output text

We could have instance-manager query the EKS API to determine the kubernetes version of the cluster, query the latest AMI and replace the image field.

This could then be coupled with an automated upgrade process like upgrade-manager, so that nodes are automatically replaced as a new AMI.

The text was updated successfully, but these errors were encountered:

eytan-avisror · 2021-07-09T03:53:08Z

Interesting idea @preflightsiren
Maybe we can support a ‘latest’ value for image field.
We already have the DescribeCluster payload available.
Just need the ssm call to get the latest image.

backjo · 2021-07-10T23:51:39Z

A few thoughts on use cases / edge cases for this feature:

Probably want to support a configurable worker k8s version - a 'surprise' upgrade during a master node version upgrade is probably not what folks will want.
It'd be nice to have some kind of approval or delay mechanism, so operators could choose when to roll out a new change instead of it happening right after ssm is updated.
On a similar note, it'd be useful to track version information about what AMI was used for operators to rollback if they need to.

preflightsiren · 2021-07-12T00:54:15Z

yeah, thinking about what I'd like to know, operating a cluster with this feature:

What is the current AMI?
What was the previous AMI?
When did they change?
How many nodes are running the current AMI?
How many nodes are running an AMI other than the current?

eytan-avisror · 2021-07-12T16:20:09Z

A few thoughts on use cases / edge cases for this feature:

* Probably want to support a configurable worker k8s version - a 'surprise' upgrade during a master node version upgrade is probably not what folks will want.

* It'd be nice to have some kind of approval or delay mechanism, so operators could choose when to roll out a new change instead of it happening right after ssm is updated.

* On a similar note, it'd be useful to track version information about what AMI was used for operators to rollback if they need to.

That might be a good feature even separately @backjo.
This can be useful in cases you don't want an upgrade to be triggered immediately when you make a change e.g. when new controller version includes user-data change.

Perhaps we can introduce a generic mechanism to allow users to stagger upgrades.
example, existence of annotation instancemgr.keikoproj.io/lock-upgrades: true would avoid entering an upgrade state, when the annotation is removed that will trigger another reconcile which can then process the upgrade.

This can allow users to now chose the behavior they want - automatic upgrades without notification (with the case of latest AMI), or with use of a 'lock' that allows to trigger the upgrade by removing/adding an annotation.

Regarding tracking which AMI is used - we add a label now to nodes with the AMI in use, so even if the IG says "latest", the nodes will show the actual AMI via labels, we can also add a status field with the resolved AMI.

WDYT

backjo · 2021-07-15T04:48:46Z

Just getting back to this - I really like what cert-manager did with the concept of CertificateRequests and approval (https://cert-manager.io/docs/concepts/certificaterequest/#approval)

We could do a similar concept - have a concept of an UpgradeRequest where an IG that wishes to upgrade is gated by the approval of such a request. We could similarly have an internal controller (enabled by default) that auto-approves UpgradeRequest objects, but advanced users could choose to disable this controller and implement custom logic - could be a slack webhook, a custom controller etc - it'd really be up to the user.

Also could work with lock-upgrades as well, but in general I'm a fan of this approach where upgrades are more of an 'approval' then a manual update to the ami

preflightsiren · 2021-07-15T05:26:13Z

do you mind if we split this into the 2 parts?:

AMI ID updating from aws ssm params
UpgradeRequest approval gate

eytan-avisror · 2021-07-17T02:11:21Z

@backjo
Can you clarify the suggestion a bit?
Are we talking about having the controller create a CR InstanceGroupUpgradeRequest whenever we detect a drift, and then ending the requeue.
Then a secondary reconciler in the controller that watches InstanceGroupUpgradeRequest objects, can reconcile this object when someone modifies the CR to Approved: true or Denied: true?

Should the point in time should be before creation of new Launch Config or after?

Approval mechanism can then be configurable in the spec of InstanceGroup to either be manual (setting the CR to approved), or slack / email / webhook / etc.

I think this makes a nicer interface, but there are some considerations/questions and tradeoffs here:

Primary tradeoff is that with the annotation approach you can actually use the configmap to lock/unlock all IGs if you choose to (which is a more relevant use case), if you have 100 IGs, you would have to otherwise change 100 InstanceGroupUpgradeRequest - we should consider what a global option would be here. Annotation /w configmap gives you both single and global option.
Instead of automatic approval, should we consider when this mechanism is not enabled to keep the existing behavior? creating objects that get auto approved seems redundant. This might be problematic since adding a reconciler is a controller level option - meaning we would be forced to always submit/reconcile these objects with the auto-approval.
What happens when an upgrade request is denied? Would the user need to make another IG change? or delete the denied InstanceGroupUpgradeRequest so that a new one can be created?
What happens to old InstanceGroupUpgradeRequest objects? does the controller delete these objects at some point? if not and we use unique names, what is the mechanism for cleaning those up?

backjo · 2021-07-20T20:46:08Z

@eytan-avisror how you describe it is largely how I imagine it working.

I wonder if we can combine these two approaches - maybe a hybrid approach here works the best.

An update to an IG happens
If the IG is not annotated as 'locked', update proceeds as implemented today.
If the IG is annotated as locked, a CR is created and is in a pending state.
If the annotation for lock is removed from the IG, all CRs for the IG are removed, and the IG will reconcile as normal today.
If the request is denied, a change will be needed to the IG to trigger a new upgrade request.
Once the IG transitions back to ready state after an update, the CR gets removed.

preflightsiren · 2021-08-19T23:56:55Z

how do you guys feel about: #327 (comment)

Just looking through https://docs.aws.amazon.com/systems-manager/latest/userguide/parameter-store-public-parameters-eks.html

The ssm GetParameter response contains LastModifiedDate. This might be useful to create policies to allow teams to set >different delays updating. Eg. Dev clusters immediately get new AMIs, staging clusters after 48h, prod after 1week.

What do you think?

we could include a delay after which an upgrade can occur

eytan-avisror · 2021-08-20T02:43:51Z

That’s a pretty good idea.. I think especially interesting in the case of dev clusters.
Feel free to open a separate issue to explore that idea and propose a design.. I think may be appropriate to add this field as a global on the configmap somewhere such as a “schedules” object.

preflightsiren mentioned this issue Jul 15, 2021

Upgrade approval for changes to Instance Groups #321

Open

preflightsiren mentioned this issue Aug 20, 2021

Delay/Schedule automated AMI updates #328

Open

eytan-avisror mentioned this issue Oct 5, 2021

320 automated ami upgrades + upgrade locking #327

Merged

eytan-avisror closed this as completed in #327 Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated AMI upgrades #320

Automated AMI upgrades #320

preflightsiren commented Jul 9, 2021

eytan-avisror commented Jul 9, 2021

backjo commented Jul 10, 2021

preflightsiren commented Jul 12, 2021

eytan-avisror commented Jul 12, 2021 •

edited

Loading

backjo commented Jul 15, 2021

preflightsiren commented Jul 15, 2021

eytan-avisror commented Jul 17, 2021 •

edited

Loading

backjo commented Jul 20, 2021

preflightsiren commented Aug 19, 2021 •

edited

Loading

eytan-avisror commented Aug 20, 2021

Automated AMI upgrades #320

Automated AMI upgrades #320

Comments

preflightsiren commented Jul 9, 2021

eytan-avisror commented Jul 9, 2021

backjo commented Jul 10, 2021

preflightsiren commented Jul 12, 2021

eytan-avisror commented Jul 12, 2021 • edited Loading

backjo commented Jul 15, 2021

preflightsiren commented Jul 15, 2021

eytan-avisror commented Jul 17, 2021 • edited Loading

backjo commented Jul 20, 2021

preflightsiren commented Aug 19, 2021 • edited Loading

eytan-avisror commented Aug 20, 2021

eytan-avisror commented Jul 12, 2021 •

edited

Loading

eytan-avisror commented Jul 17, 2021 •

edited

Loading

preflightsiren commented Aug 19, 2021 •

edited

Loading