Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement per node Affinity & Tolerations #102

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

glebiller
Copy link

@glebiller glebiller commented Mar 10, 2022

Currently, the Affinity & Tolerations are set for all the StatefulSets uniformly.
This prevents forcing StatefulSets to a particular region or particular node.

This PR adds a new field spec.nodeSets that is mutually exclusive with spec.replicas.
This new field allows setting an Affinity or Tolerations to a particular node while keeping the a default in the spec.scheduler.

Added the required tests to cover the use cases.

NB: I am considering simplifying the logic so that the field replicas will generate a number of empty node in the nodeSets to match the value. This will allow simplifying the reconciliation.

@alex-arica
Copy link
Member

@glebiller Thank you for the PR.

I will review it this weekend and I will let you know on Monday morning about it. Thanks for your contribution.

@alex-arica
Copy link
Member

@glebiller apologies that I could not reply earlier.

I understand that you would like to have the option to apply specific Affinity & Tolerations rules to specific StatefulSets.

Kubegres can replace a Primary Postgres StatefulSet by a Replica StatefulSet and it can also remove a Replica Postgres StatefulSet if it is identified as unavailable which will be replaced by a new Replica StatefulSet. This failover behaviour has for result the incrementation of the last instance index when assigning new names to the newly created StatefulSets.

For example, let's say we have a cluster of Postgres with 3 StatefulSets, with the names: mypostgres-1, mypostgres-2 and mypostgres-3.

The name "[postgres clustername]-[integer] has for instance index [integer]. In mypostgres-1 the instance index is 1.

If the Primary StatefulSet mypostgres-1 is unavailable it is replaced by a Replica StatefulSet mypostgres-2 which will be promoted as a new Primary. And a new Replica StatefulSet will be created. After the failover, the cluster will contains: mypostgres-2, mypostgres-3 and mypostgres-4.

The instance index 1 does not exist once the failover is completed.

If a Replica StatefulSet is unavailable, let's say mypostgres-3, then it will be replaced by a new Replica StatefulSet. After the failover the cluster will contains: mypostgres-2, mypostgres-4 and mypostgres-5.

The instance index 3 does not exist once the failover is completed.

Correct me if I am wrong, it seems like the array spec.nodeSets is using its array index number to identify the instance index on which to apply the Affinity & Tolerations rules. If the failover use cases above happen, how do you keep track of the configuration since the instance indexes would change?

@glebiller
Copy link
Author

@alex-arica No worries for the delay :)

I understood the behavior you described before, and the Failover unit test was working because it was re-creating the missing Instance instead of incrementing the instance index to create a new Statefulset.

I just commit the second part of the changes that does the following:

  • replace the Instance type int32 by string
  • replace the "index" label by "app.kubernetes.io/instance" which will track the instance name instead of the index
  • updated the reconcile loop to re-create StatefulSets that are missing or extra. This will not create any StatefulSets with an instance name different than the one defined in nodeSets (like the 4 and 5 you mentioned).
  • use the instance name to match the deployed Pod / StatefulSet to the Spec

That change should also help with #88 since it will keep the name of the StatefulSets organized, in addition to being able to change the configuration of each nodes Tolerations & Affinities separately.

@alex-arica
Copy link
Member

Thank you for the update @glebiller

I will review it as soon as I can and let you know.

Use the Kubernetes recommanded label "app.kubernetes.io/instance" to store the StatefulSet instance.
Replace the int32 index variable by the name of the StatefulSet instance defined in `nodeSets`.
@kmiszta
Copy link

kmiszta commented Jul 31, 2023

Hi, will this be merged somewhere in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants