Implement Clustering and App related functions #4407

naiming-zededa · 2024-10-30T18:44:25Z

change VMI to VMI ReplicaSet for kubernetes
change Pod to Pod RelicaSet for containers
change functions handling replicaset names in services
subscribe EdgeNodeInfo in domainmgr, zedmanager to get node-name for cluster
add Designated Node ID to several structs for App
not to delete domain from kubernetes if not a Designated App node
parse config for EdgeNodeClusterConfig in zedagent
handle ENClusterAppStatus publication in zedmanger in multi-node clustering case
zedmanager handling effective-activation include ENClusterAppStatus
kubevirt hypervisor changes to handle VMI/Pod ReplicaSets

pkg/pillar/cmd/domainmgr/domainmgr.go

eriknordmark · 2024-11-07T09:56:53Z

pkg/pillar/cmd/domainmgr/domainmgr.go

-			}
-			if err := hyper.Task(status).Cleanup(status.DomainName); err != nil {
-				log.Errorf("failed to cleanup domain: %s (%v)", status.DomainName, err)
+			// in cluster mode, we can not delete the pod due to failing to get app info


Was the issue this is fixing something appearing when there is a failover/takeover and another node in the cluster starts running the app instance?
Or is it something which could happen when an app instance is first provisioned on the cluster?

This can happen even on the first node of the app deployment, sometimes we can not get the status from the k3s cluster, or somethings it takes time to come up running state, but we should not remove this kubernetes configuration, it has the config stored in the database, it has it's own scheduling and control process to eventually bring it to the intended state. If we delete the config from the cluster, then we need to wait for another 10 minutes to retry, etc. and it will cause confusing.

so, a new boolean is introduced in the domainstatus, DomainConfigDeleted, allow the Designated node, if it knows for sure the app instance is removed from the device, then it can go ahead to delete the app/domain from the cluster.

It would be good to capture the above explanation either in a comment here or in pkg/pillar/docs/zedkube.md

Sure. I documented this in zedkube.md, and referenced from domainmgr.go

Hmm - re-reading it and it still looks odd.
Does the kubevirt Info() return an error or does it return HALTED when the issue is merely that it can't (yet) fetch the status from k3s?
Can we not fix that Info() to return something more accurate?

If it returns an error or HALTED then this function will set an error, and that error might be propagated to the controller, which would be confusing if the task is slow at starting, or is starting on some other node.

So I think this check is in the wrong place.

current code in kubevirt.go logic is that

if not found this app, then it will return status "", and an error logError("getVMIStatus: No VMI %s found with the given nodeName %s", repVmiName, nodeName)

if found this app on another node in cluster, then it returns status "nolocal", no error

if found on this node, then return whatever the kubernetes app running status

with above condition, if error is returned, then status is set to types.Broken
we further map the above status in string with a mapping:

// Use few states for now var stateMap = map[string]types.SwState{ "Paused": types.PAUSED, "Running": types.RUNNING, "NonLocal": types.RUNNING, "shutdown": types.HALTING, "suspended": types.PAUSED, "Pending": types.PENDING, "Scheduling": types.SCHEDULING, "Failed": types.FAILED, }

this is a good point of the error condition if not running will be confusing. I can change the condition 1) above to 'Scheduling', and it's a currently defined state, and sort of reflecting the kubernetes app status.

pkg/pillar/kubeapi/kubeapi.go

pkg/pillar/cmd/zedagent/parseconfig.go

naiming-zededa · 2024-12-10T05:30:40Z

I have rebased and resolved the conflicts. Please review and see if the PR is ok.

codecov · 2024-12-10T05:34:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 20.90%. Comparing base (a73c78d) to head (2453125).
Report is 45 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4407      +/-   ##
==========================================
- Coverage   20.93%   20.90%   -0.03%     
==========================================
  Files          13       13              
  Lines        2895     2894       -1     
==========================================
- Hits          606      605       -1     
  Misses       2163     2163              
  Partials      126      126

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

christoph-zededa · 2024-12-10T10:44:31Z

pkg/pillar/cmd/domainmgr/domainmgr.go

@@ -3603,3 +3703,16 @@ func lookupCapabilities(ctx *domainContext) (*types.Capabilities, error) {
 	}
 	return &capabilities, nil
 }
+
+func getnodeNameAndUUID(ctx *domainContext) error {


Suggested change

func getnodeNameAndUUID(ctx *domainContext) error {

func (ctx *domainContext) retrieveNodeNameAndUUID(ctx *domainContext) error {

as this is not a getter method

christoph-zededa · 2024-12-10T10:50:07Z

pkg/pillar/hypervisor/kubevirt.go

+
+	// Add pod non-image volume disks
+	if len(diskStatusList) > 1 {
+		leng := len(diskStatusList) - 1


g stands for?

changed to 'length'.

christoph-zededa · 2024-12-10T10:51:26Z

I have rebased and resolved the conflicts. Please review and see if the PR is ok.

I don't see any tests.

naiming-zededa · 2024-12-11T00:54:56Z

I have rebased and resolved the conflicts. Please review and see if the PR is ok.

I don't see any tests.

Added hypervisor/kubevirt_test.go now.

- change VMI to VMI ReplicaSet for kubernetes - change Pod to Pod RelicaSet for containers - change functions handling replicaset names in services - subscribe EdgeNodeInfo in domainmgr, zedmanager to get node-name for cluster - add Designated Node ID to several structs for App - not to delete domain from kubernetes if not a Designated App node - parse config for EdgeNodeClusterConfig in zedagent - handle ENClusterAppStatus publication in zedmanger in multi-node clustering case - zedmanager handling effective-activation include ENClusterAppStatus - kubevirt hypervisor changes to handle VMI/Pod ReplicaSets Signed-off-by: Naiming Shen <[email protected]>

eriknordmark · 2024-12-17T17:04:13Z

pkg/pillar/cmd/domainmgr/domainmgr.go

@@ -3603,3 +3709,16 @@ func lookupCapabilities(ctx *domainContext) (*types.Capabilities, error) {
 	}
 	return &capabilities, nil
 }
+
+func (ctx *domainContext) retrieveNodeNameAndUUID() error {


This is our own nodes nodename, right?
Can't it retrieve that at domainmgr startup?

And why is "UUID" part of the name of the function?

This part I used to at the startup of domainmgr, the 'waitEdgeNodeInfo' only wait for max of 5 minutes, so i could not relied on this. In previous review of this PR, you pointed out this, and I removed it. So, good point, I should also remove this function here.
I used to have UUID in the function() in previous code before this PR, will remove that.

eriknordmark · 2024-12-17T17:13:02Z

pkg/pillar/cmd/domainmgr/domainmgr.go

-			if err := hyper.Task(status).Cleanup(status.DomainName); err != nil {
-				log.Errorf("failed to cleanup domain: %s (%v)", status.DomainName, err)
+			// in cluster mode, we can not delete the pod due to failing to get app info
+			if !ctx.hvTypeKube {


And I get the feeling that most or all of these hvTypeKube checks should not be in domainmgr but he inside the kubevirt hypervisor package.

If for instance zedmanager is telling domainmgr to shut down or delete a task and if for kubevirt this isn't needed except on some designated node, can't that check whether designated or not be done inside the hypervisor/kubevirt code?

I can move this into the hypervisor functions, and I think mainly we need to pass in if this Domain config is being deleted or not (DomainStatus.DomainConfigDeleted), so, we need to change the API from '.Delete(domainName string)' into '.Delete(status *types.DomainStatus)' (that is going to change for other hypervisors)

eriknordmark · 2024-12-17T17:15:59Z

pkg/pillar/cmd/zedagent/parseconfig.go

+	if zcfgCluster == nil {
+		log.Functionf("parseEdgeNodeClusterConfig: No EdgeNodeClusterConfig, Unpublishing")
+		pub := ctx.pubEdgeNodeClusterConfig
+		items := pub.GetAll()


Is there only a global key for this? If so there is no need to call GetAll - just call Unpublish("global")

Otherwise call GetAll and walk the set of returned objects and unpublish each of their keys.

pkg/pillar/cmd/zedagent/parseconfig.go

zedi-pramodh · 2024-12-25T18:12:54Z

pkg/pillar/cmd/zedmanager/handleclusterapp.go

+	handleENClusterAppStatusImpl(ctx, key, nil)
+}
+
+func handleENClusterAppStatusImpl(ctx *zedmanagerContext, key string, status *types.ENClusterAppStatus) {


This function should be updated from our POC branch. Or I can do it as part of my PR for failover handling.

Actually I suggest amending this commit with latest handleclusterapp.go and applogs.go from POC branch.
In that way they can be reviewed as a unit and my follow up PRs need not include those files.

zedi-pramodh · 2024-12-25T18:16:19Z

pkg/pillar/cmd/zedkube/applogs.go

@@ -154,7 +154,7 @@ func (z *zedkube) checkAppsStatus() {
 	}

 	pub := z.pubENClusterAppStatus
-	stItmes := pub.GetAll()
+	stItems := pub.GetAll()


Please update this function from the POC branch, it changed a lot.

@zedi-pramodh as discussed, since some of the types boolean changes, update the poc code to applog.go would need to update many other files. I'll leave to your later PR to add those changes.

- changed the function to retrieveDeviceNodeName() and call it only at the start of domainmgr Run() - remove the ctx.hvTypeKube and status.IsDNidNode checks in the domainmgr.go code; also remove the status.DomainConfigDeleted. we now rely on normal domain handling of delete/cleanup work flow - fixed a bug where nodeName with underscore, which is not allowed in kubernetes names - changed the zedmanager/handleclusterstatus.go code to PoC code base, and commented out one line for later PR to handle - implemented the scheme when kubevirt can not contact kubernetes API-server or the cluster does not have the POD/VMI being scheduled yet, we return the 'Unknown' status now. It keeps a starting 'unknown' timestamp per application - also if the 'unknown' status lasts longer than 5 minutes, it changes into 'Halt' status back to domainmgr - updated 'zedkube.md' section 'Handle Domain Apps Status in domainmgr' for the above behavior Signed-off-by: Naiming Shen <[email protected]>

naiming-zededa requested review from OhmSpectator, rene, rouming, milan-zededa and eriknordmark as code owners October 30, 2024 18:44

github-actions bot requested review from christoph-zededa, jsfakian, rucoder, shjala and uncleDecart October 30, 2024 18:44

naiming-zededa force-pushed the naiming-cluster-hypervisor branch 4 times, most recently from 631cc70 to 1aed95e Compare November 2, 2024 18:50

eriknordmark reviewed Nov 7, 2024

View reviewed changes

pkg/pillar/cmd/domainmgr/domainmgr.go Outdated Show resolved Hide resolved

eriknordmark reviewed Nov 7, 2024

View reviewed changes

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from 1aed95e to 2e3e102 Compare November 7, 2024 18:50

github-actions bot requested a review from eriknordmark November 7, 2024 18:51

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from 2e3e102 to 8a13e80 Compare November 7, 2024 21:24

milan-zededa reviewed Dec 2, 2024

View reviewed changes

pkg/pillar/kubeapi/kubeapi.go Outdated Show resolved Hide resolved

milan-zededa reviewed Dec 2, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/parseconfig.go Outdated Show resolved Hide resolved

milan-zededa reviewed Dec 2, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/parseconfig.go Outdated Show resolved Hide resolved

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from 8a13e80 to b7140d2 Compare December 3, 2024 19:11

github-actions bot requested a review from milan-zededa December 3, 2024 19:11

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from b7140d2 to 2453125 Compare December 10, 2024 05:17

christoph-zededa reviewed Dec 10, 2024

View reviewed changes

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from 2453125 to acd9962 Compare December 11, 2024 00:50

github-actions bot requested a review from christoph-zededa December 11, 2024 00:51

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from acd9962 to 1129c28 Compare December 11, 2024 06:24

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from 1129c28 to 7db2529 Compare December 13, 2024 02:54

eriknordmark reviewed Dec 17, 2024

View reviewed changes

pkg/pillar/cmd/zedagent/parseconfig.go Show resolved Hide resolved

zedi-pramodh reviewed Dec 25, 2024

View reviewed changes

github-actions bot requested a review from eriknordmark December 28, 2024 01:55

naiming-zededa force-pushed the naiming-cluster-hypervisor branch from 3e80fde to ee49db4 Compare December 28, 2024 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Clustering and App related functions #4407

Implement Clustering and App related functions #4407

naiming-zededa commented Oct 30, 2024

eriknordmark Nov 7, 2024

naiming-zededa Nov 7, 2024

naiming-zededa Nov 7, 2024 •

edited

Loading

eriknordmark Dec 10, 2024

naiming-zededa Dec 11, 2024

eriknordmark Dec 17, 2024

naiming-zededa Dec 17, 2024

naiming-zededa commented Dec 10, 2024

codecov bot commented Dec 10, 2024

christoph-zededa Dec 10, 2024

naiming-zededa Dec 11, 2024

christoph-zededa Dec 10, 2024

naiming-zededa Dec 11, 2024

christoph-zededa commented Dec 10, 2024

naiming-zededa commented Dec 11, 2024

eriknordmark Dec 17, 2024

naiming-zededa Dec 17, 2024

eriknordmark Dec 17, 2024

naiming-zededa Dec 17, 2024

eriknordmark Dec 17, 2024

naiming-zededa Dec 17, 2024

zedi-pramodh Dec 25, 2024

zedi-pramodh Dec 25, 2024

zedi-pramodh Dec 25, 2024

naiming-zededa Dec 28, 2024

	func getnodeNameAndUUID(ctx *domainContext) error {
	func (ctx domainContext) retrieveNodeNameAndUUID(ctx domainContext) error {

Implement Clustering and App related functions #4407

Are you sure you want to change the base?

Implement Clustering and App related functions #4407

Conversation

naiming-zededa commented Oct 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naiming-zededa Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naiming-zededa commented Dec 10, 2024

codecov bot commented Dec 10, 2024

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christoph-zededa commented Dec 10, 2024

naiming-zededa commented Dec 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naiming-zededa Nov 7, 2024 •

edited

Loading