Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added SmartSwitch support in chassisd and enabling chassisd #467

Open
wants to merge 183 commits into
base: master
Choose a base branch
from

Conversation

rameshraghupathy
Copy link

@rameshraghupathy rameshraghupathy commented Apr 15, 2024

Added SmartSwitch support in chassisd and enabling chassisd for fixed SmartSwitches

Description

chassisd is enabled only for modular chassis. Smartswitch is a fixed chassis. However it has been treated like a modular chassis to manage the DPU cards just like the line-cards of a modular chassis. chassisd will be enabled only on the smartswitch NPU and hence the scope of these changes are limited only to NPU and not applicable to DPU.

Motivation and Context

chassisd is enabled only for modular chassis. Smartswitch is a fixed chassis. However it has been treated like a modular chassis to manage the DPU cards just like the line-cards of a modular chassis. Hence, chassid is needed for SmartSwitch and enabled here. Also, some of the table updates and clean up are not required for SmartSwitch platform and hence using the is_smartswitch API to selectively enable it. the text "fixes #xxxx", "closes #xxxx" or "resolves #xxxx" here

How Has This Been Tested?

  1. Enabled and built and image and checked if the chassisd is running all the time without crashing
  2. Verify if the config change handler is working as expected by issuing config CLIs to startup and shutdown DPUs

Additional Information (Optional)

@oleksandrivantsiv
Copy link
Collaborator

What are the states supported by the DPUs in the Smart Switch?

@rameshraghupathy
Copy link
Author

What are the states supported by the DPUs in the Smart Switch?

”dpu_midplane_link_state”
”dpu_control_plane_state"
"dpu_data_plane_state"

sonic-chassisd/scripts/chassisd Outdated Show resolved Hide resolved
sonic-chassisd/scripts/chassisd Outdated Show resolved Hide resolved
accidental removal of a few lines from the original chassisd file
@@ -184,7 +188,7 @@ class ModuleUpdater(logger.Logger):
self.chassis = chassis
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that instead of modifying the existing ModuleUpdater class, we should implement SmartSwitchModuleUpdater(ModuleUpdater). SmartSwitchModuleUpdater should be derived from ModuleUpdater and overwrite the methods that should behave differently for the Smart Switch. This approach allows us to keep the original implementation untouched and guarantees full backward compatibility with the chassisd.

Copy link
Author

@rameshraghupathy rameshraghupathy Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will consider this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@oleksandrivantsiv oleksandrivantsiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as commented

self.update_dpu_state(key, "UP")

elif midplane_access is False and current_midplane_state == 'False':
if self.is_module_reboot_system_up_expired(module_key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this function call?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell No we don't, cleaned it.

midplane_state = 'False'
if isinstance(fvs, list) and fvs[0] is True:
fvs = dict(fvs[-1])
midplane_state = fvs[CHASSIS_MIDPLANE_INFO_ACCESS_FIELD]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are checking for the midplane_state here in the DB (and if it is not present in the DB we are assuming that the module is not reachable since midplane_state=False by default) but in case of Nvidia platform the midplane_state is being updated to the DB by Chassisd itself, so chassisd here is checking for midplane state in DB before it is supposed to update the midplane state itself. Either we need to re order code so that we update midplane_state to db before we check the DB for values, or instead of checking the DB we can directly invoke the midplane reachability API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this may cause issues, assuming the dpu is powered on initially, by default if there is no data in DB we assume the midplane_state is false, if no config is present the DPU is supposed to power off, but admin_state == 'down' and midplane_state is false, so the operation is skipped, causing DPU to stay on

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell Using dpu operational_state and the API to get this as planned.


formatted_time = dt_obj.strftime("%a %b %d %I:%M:%S %p UTC %Y")

reboot_cause_dict = {
Copy link
Contributor

@gpunathilell gpunathilell Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_reboot_cause is supposed to return a tuple - a reboot cause and an accompanying description (from HLD) when we perform json.dump with a tuple, it is saved as a list can we split here as "cause" and "comment" separately? otherwise the # Publish the reboot cause information to CHASSIS_STATE_DB step fails in chassisd because we are performing hset to chassis_state_db with a list (Example return value)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved


# Prepare the updated data
updates = {
"dpu_midplane_link_state": state,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting suggestion, Can we have the midplane state similar to admin state and have in lower case? the show interfaces status also displays in lower case, Can be changed later if it takes time, also the time format can be changed to "%Y-%m-%d %H:%M:%S" (here or in sonic-utilities system_health command - since we are displaying this data to user)

if admin_state == 'up' and operational_state != ModuleBase.MODULE_STATUS_ONLINE:

Copy link
Author

@rameshraghupathy rameshraghupathy Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpunathilell Made dpu_midplane_link_state lowercase. The time is already in the format "%Y-%m-%d %H:%M:%S" right?

@gpunathilell
Copy link
Contributor

gpunathilell commented Nov 23, 2024

The REBOOT_CAUSE_DIR is not accessible from the pmon docker where chassisd is running, (theres a diffierent /host/reboot created in the docker )REBOOT_CAUSE_DIR = "/host/reboot-cause/module/" The first-boot file which is created by sonic-host-services is not seen by chassisd, we need to use a different location which is available in the docker, and is persistent across reboots
Resolved after considering the buildimage PR

@rameshraghupathy
Copy link
Author

rameshraghupathy commented Nov 23, 2024

The REBOOT_CAUSE_DIR is not accessible from the pmon docker where chassisd is running, (theres a diffierent /host/reboot created in the docker )REBOOT_CAUSE_DIR = "/host/reboot-cause/module/" The first-boot file which is created by sonic-host-services is not seen by chassisd, we need to use a different location which is available in the docker, and is persistent across reboots

@gpunathilell Have you used the PR? This is needed for pmon docker to mount the volume.
image

@gpunathilell
Copy link
Contributor

The REBOOT_CAUSE_DIR is not accessible from the pmon docker where chassisd is running, (theres a diffierent /host/reboot created in the docker )REBOOT_CAUSE_DIR = "/host/reboot-cause/module/" The first-boot file which is created by sonic-host-services is not seen by chassisd, we need to use a different location which is available in the docker, and is persistent across reboots

@gpunathilell Have you used the PR? This is needed for pmon docker to mount the volume.
image

No, I see that the location is mounted in pmon in that PR, will include, please ignore comment

updates = {
"dpu_midplane_link_state": state,
"dpu_midplane_link_reason": "",
"dpu_midplane_link_time": datetime.now().strftime("%Y%m%d %H:%M:%S"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I mean this line

Suggested change
"dpu_midplane_link_time": datetime.now().strftime("%Y%m%d %H:%M:%S"),
"dpu_midplane_link_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),

Will be displayed as such when we do show system-health dpu:
DPU2 Online dpu_midplane_link_state up 2024-11-25 23:38:21

PLATFORM_ENV_CONF_FILE = "/usr/share/sonic/platform/platform_env.conf"
PLATFORM_JSON_FILE = "/usr/share/sonic/platform/platform.json"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see platform.json file exists in /usr/share/sonic/devices//platform.json, am I missing something?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam The file exists, we are accessing it here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants