Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

imanenami
Copy link
Contributor

@imanenami imanenami commented Dec 2, 2024

Changes

  • Upgrade to kafka 3.9.0
  • Switch from KRaft static quorum to dynamic quorum
  • Add scaling operations on single & multi-app charm

@imanenami imanenami force-pushed the wip-kafka-3-9 branch 2 times, most recently from 9ccaf8a to 40bf7bf Compare December 9, 2024 05:54
@imanenami imanenami marked this pull request as ready for review December 12, 2024 06:52
@imanenami imanenami force-pushed the wip-kafka-3-9 branch 3 times, most recently from 5ee521b to 5465d74 Compare December 13, 2024 18:04
@imanenami imanenami changed the title WIP: upgrade to kafka 3.9 [DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) Dec 17, 2024
Comment on lines 135 to 145
def _fetch_from_secrets(self, group, field):
if not self.relation:
return ""

if secrets_uri := self.relation.data[self.relation.app].get(
f"{PROV_SECRET_PREFIX}{group}"
):
if secret := self.data_interface._model.get_secret(id=secrets_uri):
return secret.get_content().get(field, "")

return ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: This is much better 👍🏾 - thank you for the improvement!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use this for the other properties that require the painful _fetch_relation_data_with_secrets`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm a little afraid to do so given the nature of _fetch_relation_data_with_secrets which has a bit of legacy :(

src/core/models.py Show resolved Hide resolved
src/core/models.py Outdated Show resolved Hide resolved
src/core/models.py Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
src/events/broker.py Outdated Show resolved Hide resolved
Comment on lines 517 to 521
@retry(
wait=wait_fixed(15),
stop=stop_after_attempt(4),
reraise=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What issues were you seeing with the retries? As in, what errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the broker is still initializing (listener is not ready yet) and we face an error there, the retry is to make sure this condition is not happening.

src/events/broker.py Outdated Show resolved Hide resolved
src/managers/config.py Outdated Show resolved Hide resolved
@imanenami imanenami requested a review from deusebio December 18, 2024 13:46
Copy link
Contributor

@deusebio deusebio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From high-level, the code looks good to me. I have really appreciated the introduction of managers and handlers for the Controller, which I believe it is improving the structure of the code.

The business logic seems reasonable to me although I'm not a big expert on using Kraft, so I'd defer this to others. Probably if @zmraul has bandwidth, it could also be good to get his thoughts/review for the more Kraft specific business logic and concepts, unless both Marc and Alex feel comfortable with this.

Just a few small points here and there. Great work @imanenami !

I

src/managers/controller.py Outdated Show resolved Hide resolved
@@ -94,6 +94,7 @@ def __init__(self, charm) -> None:

self.provider = KafkaProvider(self)
self.oauth = OAuthHandler(self)
self.kraft = KRaftHandler(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question it may be a naive question, but don't we want to enable this based on the if statement above? I would have logically thought that it is either ZooKeeper or Kraft...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could enable it based on ClusterState.kraft_mode which includes condition for peer_cluster role, but there's no harm in enabling it globally because of checks existent in controller handlers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

globally because of checks existent in controller handlers.

Uhm, my preference here would be to have "general" checks in another "general" handler, and the custom ones related to controllers in the KraftHandler handler, which is run only on kraft mode. But double check with the other engineers, this is non-blocking. My only point - from a bit of an outsider perspective - is that the code above looks slightly counter intuitive to me

src/events/controller.py Show resolved Hide resolved
src/literals.py Outdated Show resolved Hide resolved
@imanenami imanenami requested a review from zmraul December 19, 2024 06:13
Copy link
Contributor

@zmraul zmraul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! I'm surprised how simple the implementation feels compared to the complexity of the task at hand, great work :)

I left a couple of TODOs, and some points that should be considered on the future.

@@ -226,6 +232,13 @@ def unit_broker(self) -> KafkaBroker:
substrate=self.substrate,
)

@property
def kraft_unit_id(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: this needs to be called bootstrap_unit_id to be consistent with models no?

@@ -455,6 +468,13 @@ def controller_quorum_uris(self) -> str:
)
return ""

@property
def bootstrap_controller(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that controller_quorum_uris are not being used anymore for the workload and they have been replaced by a combination of bootstrap_controller and kraft_unit_id keys.

If this is the expected use of them, there are a few things that would need to be checked and fixed:

  • Remove controller_quorum_uris from the codebase
  • On ClusterState._broker_status() add the correct checks:
    @property
    def _broker_status(self) -> Status:  # noqa: C901
        """Checks for role=broker specific readiness."""
        . . .

        if self.kraft_mode:
            if not (self.peer_cluster.bootstrap_controller and self.peer_cluster.bootstrap_unit_id):
                return Status.NO_QUORUM_URIS
            if not self.cluster.cluster_uuid:
                return Status.NO_CLUSTER_UUID
        . . .

return self.relation_data.get("directory-id", "")

@property
def added_to_quorum(self) -> bool:
Copy link
Contributor

@zmraul zmraul Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of important points here, not really blocking but I think they are good for future improvements

  • Logic around added_to_quorum should probably be checked against the cluster itself, and be tracked as little as possible on the charm.
  • Something like an upgrade or a failure on the unit will not be correctly resolved.
    At the moment there is no unsetting of this property, it will stay as true once the unit is first added to the quorum, so a failing unit would look at the databag, see added_to_quorum=True when the reality is that the unit is on a recovery state.

Even if we unset the property, a failing unit still is problematic. Chances are remove_from_quorum was never called before the failure, thus still leading to a wrong recovery state on the unit. This is why I'm suggesting to generally have added_to_quorum be a live check against Kafka itself

# change bootstrap controller config on followers and brokers
self.charm.on.config_changed.emit()

def _add_controller(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming and scope on these two methods should be consistent:

  • Either both are private or both are public (leading underscore on the name)
  • Either add_controller | remove_controller or add_to_quorum | remove_from_quorum

self.charm.state.kraft_unit_id,
directory_id,
bootstrap_node=self.charm.state.bootstrap_controller,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I think this is missing here

self.charm.state.unit_broker.update("added-to-quorum": ""})

@@ -167,20 +168,13 @@ def _on_start(self, event: StartEvent | PebbleReadyEvent) -> None: # noqa: C901
if not self.upgrade.idle:
return

if self.charm.state.kraft_mode:
self._init_kraft_mode()

# FIXME ready to start probably needs to account for credentials being created beforehand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment can be removed. Credentials issues on the start process were solved already

initial_controllers=f"{self.charm.state.peer_cluster.bootstrap_unit_id}@{self.charm.state.peer_cluster.bootstrap_controller}:{self.charm.state.peer_cluster.bootstrap_replica_id}",
)

def _leader_elected(self, event: LeaderElectedEvent) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I'm a bit itchy when seeing/using leader_elected event. Following up @marcoppenheimer comment I would rather have this logic happen in update_status or config_changed (since update status calls config_changed).

  • afaik, there are no guarantees that leader_elected triggers ever after the fist deployment
  • Triggers of leader_elected tend to happen in the context of controller failing to see a unit, which can mean unit failure, so this code being close to failure states is not the best :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Client relation-changed and relation-broken fails with ACL error in Kraft mode
4 participants