Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design: tinkerbell v1alpha2 API #27

Merged
merged 14 commits into from
Mar 15, 2023

Conversation

chrisdoherty4
Copy link
Member

@chrisdoherty4 chrisdoherty4 commented Feb 21, 2023

Implements #14.

@chrisdoherty4 chrisdoherty4 changed the title Add tinkerbell CRD refactor design Tinkerbell CRD Refactor Design Feb 21, 2023
@chrisdoherty4 chrisdoherty4 marked this pull request as draft February 21, 2023 17:18
@chrisdoherty4 chrisdoherty4 changed the title Tinkerbell CRD Refactor Design design: Tinkerbell CRD Refactor Feb 21, 2023
@chrisdoherty4 chrisdoherty4 force-pushed the feat/crd-refactor branch 14 times, most recently from 9e067f7 to 4307489 Compare February 21, 2023 23:51
@chrisdoherty4 chrisdoherty4 marked this pull request as ready for review February 21, 2023 23:51
@chrisdoherty4 chrisdoherty4 force-pushed the feat/crd-refactor branch 5 times, most recently from 663d6b2 to 08d6e52 Compare February 22, 2023 00:07
**Non-goals**

- To change the existing relationship between Tink and Rufio.
- To introduce additional object status data typically found on the `.Status` field.
Copy link
Member

@moadqassem moadqassem Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this coming however in a different proposal of how to handle status of the CRs?

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind elaborating on 'how to handle status'? This proposal outlines the status aspects that are relevant to the existing Tinkerbell capability but not any additional status tracking.

Perhaps this non-goal is unclear or unnecessary.

Copy link
Member

@moadqassem moadqassem Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this was my bad as I didn't elaborate more on what I meant 😅. What I meant is, naturally, each k8s has a status tied to it to specify what is the status of that object and how it was reconciled during it is lifecycle. For the Hardware CRD, there was no Status field found however for the workflow there is one. So I wasn't sure adding a status field for the Hardware CRD will be added in a further work, or it is not added at all.

Even though one might thing that we don't need a Status filed/object for Hardware(due to it's functionality) I guess having a status is still valid, so we can track something like last updated, error messages reconciling, conditions, phases, etc..

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, we can add it with relative ease in the future for the Hardware type?

I think we've struggled for reliable status information Hardware hence why it doesn't exist here. This stems from the fact the physical hardware isn't really inspectable until we get the OSIE onto it at which point we have a client we can play with. It might be useful to have status information about reaching that point in the future. Again, I suspect we can add the status field when that time comes though.

Is there a good reason to why it couldn't be added later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: Yeah it shouldn't be any problem

Actually, one important thing about Hardware status is, integration with other tooling like Netbox/Nautobot ;) (but that's a tale for another day 😂 ). Regarding this:

I think we've struggled for reliable status information Hardware hence why it doesn't exist here. This stems from the fact the physical hardware isn't really inspectable until we get the OSIE onto it at which point we have a client we can play with

In general that's true, however, the status isn't only about the underlying resource(in this case the server) it also to add some details about how the object itself has been and being reconciled. So in the Status object you might have something called conditions and those would reflect what is happening to the underlying object, But if the object is being reconciled by different controllers, the status object can give some insights about what happened. Example: Someone added a wrong MAC address or something was wrong during initialization of that hardware, Status object should give insights why this Hardware wasn't reconciled correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. Principally, I like the thought of having some sort of hardware status information and it could prove useful for everything up-to launching the OSIE although that's difficult to do in control loops. There isn't a Hardware reconciler today though, and this proposal won't introduce one (perhaps something to call out explicitly).


```go
// Hardware is a logical representation of a machine that can execute Workflows.
type Hardware struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those representation only for the sake proposal? meaning, would the API looks exactly the same as this struct? The reason why I am asking is, I believe it should be really important to comply with k8s semantics and standards. An example for that is, naming, compositions, pointers, etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structurally, I expect these data structures to map exactly to code. Could you elaborate/link to the standards you're referencing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure, for example the metav1.TypeMeta and metav1.ObjectMeta for all CRDs. Spec declaration should be like:

Spec HardwareSpec `json:"spec,omitempty"`

More of these conventions can be found here: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I was wrong when I said "map exactly to code" then because I intended on adding that. I'll add it to the proposal for clarity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

design/14_tinkerbell_crd_refactor/proposal.md Outdated Show resolved Hide resolved
type NetworkInterfaces map[MAC]NetworkInterface

// MAC is a Media Access Control address.
type MAC string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this field and the other fields below(StorageDevice, KernelParam) should be a distinct types?

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few reasons.

  1. So we can use CEL type validation.
  2. To be explicit around expectations. Data structures such as map[string]string are difficult to establish expectations with, which has been a contributor to poor user experience.

// shared_volume:/data
//
// See https://docs.docker.com/storage/volumes/ for additional details
type Volume string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Volume I would assume that file system based volumes is not enough especially in k8s context. What about have a new volume of a type struct that we have in there different volume targets? e.g: volume from a file path, configmap, secret or source url.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume doesn't relate to Kubernetes volumes; it pertains to Actions run on the OSIE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so this should be just reflected just to the docker container and should only be running on the node(OISE) directly.

type Volume string

// ActionName is unique name within the context of a workflow.
type ActionName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really need to be a type? Wouldn't string representation is enough where it is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It exists to facilitate the key in the RenderedActions field of the WorkflowStatus. It helps to establish a clear relationship between the 2 although that's arguably an implementation detail of WorkflowStatus.

I'm leaning toward removing it, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I mean, documenting the field RenderedActions and mention that, the key represents the action name should be enough IMO, thus we can remove it.

type RenderedActions map[ActionName]ActionStatus

// ActionStatus describes status information about an action.
type ActionStatus struct {
Copy link
Member

@moadqassem moadqassem Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly recommend adding the conditions framework here instead of using explicit fields like State. Something similar to this https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/condition_types.go

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I flip-flopped on whether to use conditions. Here's my train of thought:

  1. Conditions are meant to be observations, not state machines in and of themselves. I think this is a common misinterpretation and misuse. Cluster API has a good practical example with its ControlPlaneReady condition. ControlPlaneReady transition to True when the first control plane node is up, but it can also transition back to False (at least that's my understanding); it is an observation of the control plane, not a transition of 1 predefined state to another. Conditions can be monotonic but, again, that doesn't mean they should represent state machines. The guidance is that they shouldn't represent state machines.

  2. Adding conditions to actions creates a set of nested conditions because the Workflow would have conditions which are a summation of ActionStatus conditions. I don't think I'm totally opposed to this approach, but I fear its confusing given how established conditions are in Kubernetes and the fact they generally refer to the Kind, not nested data structures of a Kind. We could add conditions to the WorkflowStatus and derive them from ActionStatus information; that feels more appropriate given all other guidance but they would be complimentary to state machine fields as opposed to the actual representation of the state machine.

  3. Conditions should generally be derivable from information already available on the resource. That is by no means a hard and fast rule but encoding state into conditions that need to be populated at reconcile time means state expectations are hard to rationalize when looking at the API only (documentation may clear this up, I haven't given it much thought). For example, we can clearly see the predefined states via the State type; conditions mask that and users are dependent on documentation to understand workflow state semantics.

  4. Conditions have a predefined recommended format which doesn't necessarily compliment what we're doing. We could break from this in the same way Cluster API has (they dropped the ObservedGeneration field) and that's probably fine but it contributes to the turmoil already shrouding conditions. We would need to very clearly, as Cluster API has, define our condition expectations.


What does it look like if we used conditions? I would probably forgo most of the 'states' we have in favor of the following condition types.

  1. Started: A monotonic condition indicating the workflow has begun.
  2. Succeeded: A monotonic condition indicating the workflow succeeded. Unknown indicates it hasn't finished and False indicates it failed. On False I would expect a Reason and Message elaborating on why it didn't succeed.

I'm not sure this sufficiently represents what we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be useful to document in the proposal's Rationale section, whatever we decide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are quite interesting thoughts I would say. Ok maybe we can move the usage of conditions in another PR/Proposal? One thing to keep in mind though is, we should distinguish what things are relevant to the CR itself and it is being reconciled and how the underlying resource(machine) has eventually turned out to be.

I have to say though that, I have seen multiple cases, where a status of an object is being derived from another status assigned of another object. I mean for now I guess we can roll without it, and then we can iterate in the future about what's needs to be done here :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the detail to the proposals rationale section and said we will tackle conditions as a separate proposal.

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moadqassem I thought some more on conditions, here's what I've come up with.

  1. We'll have the Reason and Message fields on ActionStatus as detailed in the proposal already (probably change Reason to a string so users know its something they can define.
  2. As part of this proposal, we'll define how action developers can propagate a Reason and Message from within actions such that it populates the ActionStatus fields.
  3. We'll remove Reason and Message from WorkflowStatus and replace it with a conditions data structure (the conditions will include a Reason and Message).
  4. We'll define a WorkflowSucceeded condition type. The values of Reason and Message will be propagated from the failed ActonStatus providing the summarized observation of the workflow.

This does mean we'll need to define conditions as part of this proposal. Given we know we want to provide ways for users to propagate information from within actions on error, and that we want to trend toward conditions anyway, it feels appropriate to just go ahead and define our conditions setup as part of this proposal.

Using conditions on the WorkflowStatus to summarize action/workflow state feels more extensible and avoids duplicating the information field for field between ActionStatus and WorkflowStatus.

Conditions will likely take the same form as Cluster APIs.

The TL;DR on Reason and Message propagation from within actions: developers can write to paths such as /tinkernell/reason and /tinkerbell/message and which we'll read and propagate from client to server. This is at least the initial proposal and inspired by Kubernetes mechanism for containers/pods to indicate why they exited non-zero.

There are probably other observations such as WorkflowStarted that might be useful to have as conditions.


Side note, I'm also proposing the Reason type be dropped in favor of a string. This compliments better the idea that action developers can define its value.

I'm going to update the proposal with this change either way because it doesn't seem to sit right currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with Jacob, I decided to shoot for this ahead of tomorrow's community call. The proposal has been updated.


### Webhooks

We will introduce a set of basic webhooks for validating each CRD. The webhooks will only validate the `v1alpha2` API version. They will be implemented as part of the `tink-controller` component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this might be controversial 😅 but can we please forget about webhooks and use CEL instead? Webhooks have been a huge pain to work with and debug and I would say this is a nice opportunity to integrate cel instead since this is still the beginning of the project. For more details about CEL PTAL at this https://kubernetes.io/blog/2022/09/23/crd-validation-rules-beta/

Copy link
Member Author

@chrisdoherty4 chrisdoherty4 Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will use CEL (I know its not clear in the proposal and I'd be happy to add specific annotations to clarify) but it doesn't capture all validation. For example, CEL won't be able to verify all Hardware submitted have unique MAC fields.

```go
// Hardware is a logical representation of a machine that can execute Workflows.
type Hardware struct {
HardwareSpec
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing in Kubernetes guidlines that says we must have a Spec or Status field. I'm wondering if we should just move all the HardwareSpec fields up to the Hardware object.

Another thread is talking about the possibility of adding a Status field which, if that happened, would necessitate the HardwareSpec but I don't think we have any intention of adding hardware status information any time soon so I suspect this would be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure that there is something written about any CRD should have a field called Spec, or Status, however, if you take a look at k8s crds you will find that, they do have such fields which makes other k8s users familiar with our CRD and know where to look whenever they need to adjust something.

With that being said, we can explore this path in the future, and see how those fields can be used during the lifecycle of those CRs.

Signed-off-by: Chris Doherty <[email protected]>
@chrisdoherty4 chrisdoherty4 force-pushed the feat/crd-refactor branch 2 times, most recently from 5d64826 to 7ff2ad4 Compare February 24, 2023 23:17
@moadqassem
Copy link
Member

@jacobweinstock not sure if someone else need to take a look at this to approve 😅 also not sure if I can approve/merge things

Signed-off-by: Chris Doherty <[email protected]>

## Migrating from v1alpha1 to v1alpha2

Given the relative immaturity of the Tinkerbell API we will not provide any migration tooling from `v1alpha1` to `v1alpha2`. Users will be required to manually convert `Hardware` and `Template` resources. `Workflow` resources are considered ephemeral and can generally be deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should write some very rudimentary script or such to help with this process. I would recommend this for only for hardware and templates, not workflows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. That should be simple enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved.

Signed-off-by: Chris Doherty <[email protected]>
Signed-off-by: Chris Doherty <[email protected]>
Signed-off-by: Chris Doherty <[email protected]>
@chrisdoherty4 chrisdoherty4 changed the title design: Tinkerbell CRD Refactor design: tinkerbell crd refactor Mar 10, 2023
@chrisdoherty4 chrisdoherty4 merged commit 1e6c92a into tinkerbell:main Mar 15, 2023
@chrisdoherty4 chrisdoherty4 changed the title design: tinkerbell crd refactor design: tinkerbell v1alpha2 API Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants