-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-5232;DPE-5233] feat: support for scaling operations in KRaft mode (single & multi-app) #281
base: main
Are you sure you want to change the base?
Conversation
9ccaf8a
to
40bf7bf
Compare
c0f725d
to
cf274b7
Compare
5ee521b
to
5465d74
Compare
5465d74
to
0f35c71
Compare
src/core/models.py
Outdated
def _fetch_from_secrets(self, group, field): | ||
if not self.relation: | ||
return "" | ||
|
||
if secrets_uri := self.relation.data[self.relation.app].get( | ||
f"{PROV_SECRET_PREFIX}{group}" | ||
): | ||
if secret := self.data_interface._model.get_secret(id=secrets_uri): | ||
return secret.get_content().get(field, "") | ||
|
||
return "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: This is much better 👍🏾 - thank you for the improvement!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use this for the other properties that require the painful _fetch_relation_data_with_secrets`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm a little afraid to do so given the nature of _fetch_relation_data_with_secrets
which has a bit of legacy :(
src/events/broker.py
Outdated
@retry( | ||
wait=wait_fixed(15), | ||
stop=stop_after_attempt(4), | ||
reraise=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What issues were you seeing with the retries? As in, what errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes the broker is still initializing (listener is not ready yet) and we face an error there, the retry is to make sure this condition is not happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From high-level, the code looks good to me. I have really appreciated the introduction of managers and handlers for the Controller, which I believe it is improving the structure of the code.
The business logic seems reasonable to me although I'm not a big expert on using Kraft, so I'd defer this to others. Probably if @zmraul has bandwidth, it could also be good to get his thoughts/review for the more Kraft specific business logic and concepts, unless both Marc and Alex feel comfortable with this.
Just a few small points here and there. Great work @imanenami !
I
@@ -94,6 +94,7 @@ def __init__(self, charm) -> None: | |||
|
|||
self.provider = KafkaProvider(self) | |||
self.oauth = OAuthHandler(self) | |||
self.kraft = KRaftHandler(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question it may be a naive question, but don't we want to enable this based on the if statement above? I would have logically thought that it is either ZooKeeper or Kraft...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could enable it based on ClusterState.kraft_mode
which includes condition for peer_cluster role, but there's no harm in enabling it globally because of checks existent in controller handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
globally because of checks existent in controller handlers.
Uhm, my preference here would be to have "general" checks in another "general" handler, and the custom ones related to controllers in the KraftHandler
handler, which is run only on kraft mode. But double check with the other engineers, this is non-blocking. My only point - from a bit of an outsider perspective - is that the code above looks slightly counter intuitive to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! I'm surprised how simple the implementation feels compared to the complexity of the task at hand, great work :)
I left a couple of TODOs, and some points that should be considered on the future.
@@ -226,6 +232,13 @@ def unit_broker(self) -> KafkaBroker: | |||
substrate=self.substrate, | |||
) | |||
|
|||
@property | |||
def kraft_unit_id(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: this needs to be called bootstrap_unit_id
to be consistent with models no?
@@ -455,6 +468,13 @@ def controller_quorum_uris(self) -> str: | |||
) | |||
return "" | |||
|
|||
@property | |||
def bootstrap_controller(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that controller_quorum_uris
are not being used anymore for the workload and they have been replaced by a combination of bootstrap_controller
and kraft_unit_id
keys.
If this is the expected use of them, there are a few things that would need to be checked and fixed:
- Remove
controller_quorum_uris
from the codebase - On
ClusterState._broker_status()
add the correct checks:
@property
def _broker_status(self) -> Status: # noqa: C901
"""Checks for role=broker specific readiness."""
. . .
if self.kraft_mode:
if not (self.peer_cluster.bootstrap_controller and self.peer_cluster.bootstrap_unit_id):
return Status.NO_QUORUM_URIS
if not self.cluster.cluster_uuid:
return Status.NO_CLUSTER_UUID
. . .
- check this logic on
controller
handler
return self.relation_data.get("directory-id", "") | ||
|
||
@property | ||
def added_to_quorum(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of important points here, not really blocking but I think they are good for future improvements
- Logic around
added_to_quorum
should probably be checked against the cluster itself, and be tracked as little as possible on the charm. - Something like an upgrade or a failure on the unit will not be correctly resolved.
At the moment there is no unsetting of this property, it will stay astrue
once the unit is first added to the quorum, so a failing unit would look at the databag, seeadded_to_quorum=True
when the reality is that the unit is on a recovery state.
Even if we unset the property, a failing unit still is problematic. Chances are remove_from_quorum
was never called before the failure, thus still leading to a wrong recovery state on the unit. This is why I'm suggesting to generally have added_to_quorum
be a live check against Kafka itself
# change bootstrap controller config on followers and brokers | ||
self.charm.on.config_changed.emit() | ||
|
||
def _add_controller(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: naming and scope on these two methods should be consistent:
- Either both are private or both are public (leading underscore on the name)
- Either
add_controller | remove_controller
oradd_to_quorum | remove_from_quorum
self.charm.state.kraft_unit_id, | ||
directory_id, | ||
bootstrap_node=self.charm.state.bootstrap_controller, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: I think this is missing here
self.charm.state.unit_broker.update("added-to-quorum": ""})
@@ -167,20 +168,13 @@ def _on_start(self, event: StartEvent | PebbleReadyEvent) -> None: # noqa: C901 | |||
if not self.upgrade.idle: | |||
return | |||
|
|||
if self.charm.state.kraft_mode: | |||
self._init_kraft_mode() | |||
|
|||
# FIXME ready to start probably needs to account for credentials being created beforehand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment can be removed. Credentials issues on the start process were solved already
initial_controllers=f"{self.charm.state.peer_cluster.bootstrap_unit_id}@{self.charm.state.peer_cluster.bootstrap_controller}:{self.charm.state.peer_cluster.bootstrap_replica_id}", | ||
) | ||
|
||
def _leader_elected(self, event: LeaderElectedEvent) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: I'm a bit itchy when seeing/using leader_elected
event. Following up @marcoppenheimer comment I would rather have this logic happen in update_status
or config_changed
(since update status calls config_changed).
- afaik, there are no guarantees that
leader_elected
triggers ever after the fist deployment - Triggers of
leader_elected
tend to happen in the context of controller failing to see a unit, which can mean unit failure, so this code being close to failure states is not the best :)
Changes