Bug: Inventory updates should tolerate drift (and overwrite it) #559

karlkfi · 2022-03-02T04:33:22Z

Right now, inventory updates may return a conflict error from Kubernetes. The inventory client should detect this (apierrors.IsConflict(err)) and retry with a new Get (to update the ResourceVersion) + Update.

Example retry code:

type retriable func(ctx context.Context) (retry bool, err error)

func retryWithBackoff(ctx context.Context, timeout time.Duration, fn retriable) error {
	var err error
	var retry bool
	ctx, cancel := context.WithTimeout(ctx, timeout)
	defer cancel()
	delay := 1 + time.Second
	for {
		// attempt to update status
		retry, err = fn(ctx)
		if !retry {
			return err
		}

		// wait until delay or timeout
		timer := time.NewTimer(delay)
		select {
		case <-ctx.Done():
			timer.Stop()
			return fmt.Errorf("timed out after retrying for %v: %w", timeout, err)
		case <-timer.C:
			// continue
		}
		// retry backoff
		delay = delay * 2
	}
}

example usage:

	// attempt to update status until timeout
	ctx := context.TODO()
	timeout := 1 * time.Minute
	return retryWithBackoff(ctx, timeout, func(ctx context.Context) (retry bool, err error) {
		// Get the object to get the latest ResourceVersion.
		latestObj, err := resource.Get(ctx, obj.GetName(), metav1.GetOptions{TypeMeta: meta})
		if err != nil {
			return false, fmt.Errorf("failed to get inventory status from cluster: %w", err)
		}
		// Ignore any status changes made remotely.
		// This update will replace them.
		obj.SetResourceVersion(latestObj.GetResourceVersion())

		_, err = resource.UpdateStatus(ctx, obj, metav1.UpdateOptions{TypeMeta: meta})
		if err != nil {
			// retry if conflict
			return apierrors.IsConflict(err), fmt.Errorf("failed to write updated inventory status to cluster: %w", err)
		}
		return false, nil
	})

Another option is to use https://github.com/flowchartsman/retry which is nice and generic. gcloud and client-go also have retry libs.

The text was updated successfully, but these errors were encountered:

karlkfi · 2022-03-02T04:34:48Z

The main client causing drift right now is the Config Sync resource-group-controller, which updates the ResourceGroup (inventory) status.

k8s-triage-robot · 2022-05-31T05:20:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-30T06:14:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

karlkfi · 2022-07-25T21:16:58Z

/remove-lifecycle rotten
/lifecycle frozen

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 30, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Inventory updates should tolerate drift (and overwrite it) #559

Bug: Inventory updates should tolerate drift (and overwrite it) #559

karlkfi commented Mar 2, 2022

karlkfi commented Mar 2, 2022

k8s-triage-robot commented May 31, 2022

k8s-triage-robot commented Jun 30, 2022

karlkfi commented Jul 25, 2022

Bug: Inventory updates should tolerate drift (and overwrite it) #559

Bug: Inventory updates should tolerate drift (and overwrite it) #559

Comments

karlkfi commented Mar 2, 2022

karlkfi commented Mar 2, 2022

k8s-triage-robot commented May 31, 2022

k8s-triage-robot commented Jun 30, 2022

karlkfi commented Jul 25, 2022