[DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster restart, leader restart, leader freeze. #136

zmraul · 2023-09-18T10:42:18Z

Add HA tests and fix ContinuousWrites implementation.

Leader restart
Leader freeze
Full cluster crash
Full cluster restart

Added more checks in between test stages.

Fixes:

ContinuousWrites now doesn't reinitialize the client if there is an error. Just retries until an Exception raises.
Error handling of exceptions on message produced are handled on continuous writes subprocess.
Count now correctly takes into account index 0 => last index of messages produced == count of consumed messages + lost messages - 1

deusebio

Great job! The code looks good! Let's just make sure tests passes.

Some small, minor comments attached

deusebio · 2023-09-22T09:39:50Z

tests/integration/ha/continuous_writes.py

+    count: int = -1
+    last_message: Optional[object] = None
+    last_expected_message: int = -1
+    lost_messages: int = -1


niptick, minor I'd rather use Optional instead of -1, or not have any default at all actually

deusebio · 2023-09-22T09:40:37Z

tests/integration/ha/continuous_writes.py

@@ -26,6 +27,14 @@
 logger = logging.getLogger(__name__)


+@dataclass
+class ContinuousWritesResult:


praise I see that you have been using data-classes, I really like this and makes function return very meaninfgul object which improve readability

tests/integration/ha/continuous_writes.py

deusebio · 2023-09-22T09:43:17Z

tests/integration/ha/ha_helpers.py

        stderr=PIPE,
        shell=True,
        universal_newlines=True,
    )
+    result = TopicDescription()
+    result.leader = int(re.search(r"Leader: (\d+)", output)[1])
+    result.in_sync_replicas = {int(i) for i in re.search(r"Isr: ([\d,]+)", output)[1].split(",")}


niptick, minor I believe it is a tiny bit more pythonic to use

return TopicDescription(leader=..., in_sync_replica=...)

deusebio · 2023-09-22T09:44:37Z

tests/integration/ha/ha_helpers.py


    # example of topic offset output: 'test-topic:0:10'
    result = check_output(
-        f"JUJU_MODEL={ops_test.model_full_name} juju ssh {unit_name} sudo -i 'charmed-kafka.get-offsets --bootstrap-server {bootstrap_server} --command-config {KafkaSnap.CONF_PATH}/client.properties --topic {topic}'",
+        f"JUJU_MODEL={ops_test.model_full_name} juju ssh kafka/0 sudo -i 'charmed-kafka.get-offsets --bootstrap-server {','.join(bootstrap_servers)} --command-config {KafkaSnap.CONF_PATH}/client.properties --topic {topic}'",


question what if kafka/0 is down? Maybe better to use kafka/leader?

Same situation. The CLI commands work without the main Kafka process being up. So if the whole unit is down, the test is not even meant to pass.
In short, this command is not dependent on the unit.

Just a nudge here to not rely on specific unit numbers, but instead to grab one from ops_test.model.applications[APP_NAME].units[X].name

deusebio · 2023-09-22T09:48:14Z

tests/integration/ha/test_ha.py


    logger.info(
        f"Killing broker of leader for topic '{ContinuousWrites.TOPIC_NAME}': {initial_leader_num}"
    )
    await send_control_signal(
-        ops_test=ops_test, unit_name=f"{APP_NAME}/{initial_leader_num}", kill_code="SIGKILL"
+        ops_test=ops_test, unit_name=f"{APP_NAME}/{initial_leader_num}", signal="SIGKILL"
    )
    # Give time for the service to restart


question Should we assert whether the in-sync-replica is not all units anymore?

Yes! I can add that here.

deusebio · 2023-09-22T09:53:39Z

tests/integration/ha/test_ha.py

+    topic_description = await get_topic_description(
+        ops_test=ops_test, topic=ContinuousWrites.TOPIC_NAME
+    )
+    initial_offsets = await get_topic_offsets(ops_test=ops_test, topic=ContinuousWrites.TOPIC_NAME)


question shouldn't this be at the top, before killing the brokers? Probably it is the same, but I believe it would make the test a bit clearer, in the sense that we

Set the stage and get pre-information

Do the action (killing everything)

do asserts

Initial offsets being here is intentional, as I want to check that offsets are still increasing after all brokers are back up. If I initialize before killing brokers, the offsets will still report an increase because of the time it takes to actually kill the brokers.

# your suggestion: # producer already writing initial_offsets = get_offsets() <--- . . . |--- # Between these two moments, offsets are still increasing, | # so a check afterwards would always succeed kill_all_brokers() <---

deusebio · 2023-09-22T09:54:25Z

tests/integration/ha/test_ha.py

+    # Give time for the service to restart
+    await asyncio.sleep(15)
+
+    initial_offsets = await get_topic_offsets(ops_test=ops_test, topic=ContinuousWrites.TOPIC_NAME)


deusebio · 2023-09-22T09:55:04Z

last_written_value

.gitignore as in the other PR

It is included on gitignore. For some reason it got trough at some previous commit.

…estart, leader restart, leader freeze. (#136) * add extra HA tests * change to full acks

add extra HA tests

b3b9277

zmraul force-pushed the test/dpe-2362-full-cluster-restart branch from 19a6eda to b3b9277 Compare September 18, 2023 10:59

zmraul requested a review from deusebio September 18, 2023 10:59

fix rebase error

667a43b

zmraul changed the title ~~[DPE-2362] HA: full cluster restart, full cluster restart, leader restart, leader freeze.~~ [DPE-2362][DPE-2369][DPE-2374] HA: full cluster restart, full cluster restart, leader restart, leader freeze. Sep 18, 2023

fix import

8dba491

zmraul changed the title ~~[DPE-2362][DPE-2369][DPE-2374] HA: full cluster restart, full cluster restart, leader restart, leader freeze.~~ [DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster restart, leader restart, leader freeze. Sep 18, 2023

zmraul added 2 commits September 18, 2023 16:52

various fixes

fbb7e5c

add checks, rework tests

e822892

zmraul requested a review from marcoppenheimer September 20, 2023 12:38

zmraul added 2 commits September 22, 2023 10:40

fix helpers

6615362

change to full acks

cd249d9

deusebio approved these changes Sep 22, 2023

View reviewed changes

add pr feedback and fix client

e0ad3a3

zmraul marked this pull request as ready for review September 25, 2023 11:18

zmraul added 9 commits September 25, 2023 14:01

change test

d40e639

remove lost message check

fbc31a6

remove unused code

ac5b21d

normalize waits

943f5c8

fix client and tests

c323588

fix test

71558c3

add debug info and retries to client

dc37f34

remove full acks

96181b6

readd acks

a32b951

zmraul merged commit 27e79ca into test/high-availability Sep 27, 2023

zmraul deleted the test/dpe-2362-full-cluster-restart branch September 27, 2023 12:30

zmraul added a commit that referenced this pull request Sep 27, 2023

[DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster r…

0ca262a

…estart, leader restart, leader freeze. (#136) * add extra HA tests * change to full acks

zmraul added a commit that referenced this pull request Oct 17, 2023

[DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster r…

1144fa6

…estart, leader restart, leader freeze. (#136) * add extra HA tests * change to full acks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster restart, leader restart, leader freeze. #136

[DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster restart, leader restart, leader freeze. #136

zmraul commented Sep 18, 2023 •

edited

Loading

deusebio left a comment

deusebio Sep 22, 2023

deusebio Sep 22, 2023

deusebio Sep 22, 2023

deusebio Sep 22, 2023

zmraul Sep 22, 2023 •

edited

Loading

marcoppenheimer Sep 22, 2023 •

edited

Loading

deusebio Sep 22, 2023

zmraul Sep 22, 2023

deusebio Sep 22, 2023

zmraul Sep 22, 2023 •

edited

Loading

deusebio Sep 22, 2023

deusebio Sep 22, 2023

zmraul Sep 22, 2023

[DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster restart, leader restart, leader freeze. #136

[DPE-2362][DPE-2369][DPE-2374] HA: full cluster crash, full cluster restart, leader restart, leader freeze. #136

Conversation

zmraul commented Sep 18, 2023 • edited Loading

deusebio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmraul Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

marcoppenheimer Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmraul Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zmraul commented Sep 18, 2023 •

edited

Loading

zmraul Sep 22, 2023 •

edited

Loading

marcoppenheimer Sep 22, 2023 •

edited

Loading

zmraul Sep 22, 2023 •

edited

Loading