-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design: tinkerbell v1alpha2 API #27
design: tinkerbell v1alpha2 API #27
Conversation
9e067f7
to
4307489
Compare
663d6b2
to
08d6e52
Compare
Signed-off-by: Chris Doherty <[email protected]>
08d6e52
to
7ba306c
Compare
**Non-goals** | ||
|
||
- To change the existing relationship between Tink and Rufio. | ||
- To introduce additional object status data typically found on the `.Status` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this coming however in a different proposal of how to handle status of the CRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind elaborating on 'how to handle status'? This proposal outlines the status aspects that are relevant to the existing Tinkerbell capability but not any additional status tracking.
Perhaps this non-goal is unclear or unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this was my bad as I didn't elaborate more on what I meant 😅. What I meant is, naturally, each k8s has a status tied to it to specify what is the status of that object and how it was reconciled during it is lifecycle. For the Hardware CRD, there was no Status field found however for the workflow there is one. So I wasn't sure adding a status field for the Hardware CRD will be added in a further work, or it is not added at all.
Even though one might thing that we don't need a Status filed/object for Hardware(due to it's functionality) I guess having a status is still valid, so we can track something like last updated, error messages reconciling, conditions, phases, etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, we can add it with relative ease in the future for the Hardware
type?
I think we've struggled for reliable status information Hardware
hence why it doesn't exist here. This stems from the fact the physical hardware isn't really inspectable until we get the OSIE onto it at which point we have a client we can play with. It might be useful to have status information about reaching that point in the future. Again, I suspect we can add the status field when that time comes though.
Is there a good reason to why it couldn't be added later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: Yeah it shouldn't be any problem
Actually, one important thing about Hardware status is, integration with other tooling like Netbox/Nautobot ;) (but that's a tale for another day 😂 ). Regarding this:
I think we've struggled for reliable status information Hardware hence why it doesn't exist here. This stems from the fact the physical hardware isn't really inspectable until we get the OSIE onto it at which point we have a client we can play with
In general that's true, however, the status isn't only about the underlying resource(in this case the server) it also to add some details about how the object itself has been and being reconciled. So in the Status object you might have something called conditions and those would reflect what is happening to the underlying object, But if the object is being reconciled by different controllers, the status object can give some insights about what happened. Example: Someone added a wrong MAC address or something was wrong during initialization of that hardware, Status object should give insights why this Hardware wasn't reconciled correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely. Principally, I like the thought of having some sort of hardware status information and it could prove useful for everything up-to launching the OSIE although that's difficult to do in control loops. There isn't a Hardware
reconciler today though, and this proposal won't introduce one (perhaps something to call out explicitly).
|
||
```go | ||
// Hardware is a logical representation of a machine that can execute Workflows. | ||
type Hardware struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those representation only for the sake proposal? meaning, would the API looks exactly the same as this struct? The reason why I am asking is, I believe it should be really important to comply with k8s semantics and standards. An example for that is, naming, compositions, pointers, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structurally, I expect these data structures to map exactly to code. Could you elaborate/link to the standards you're referencing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure, for example the metav1.TypeMeta
and metav1.ObjectMeta
for all CRDs. Spec declaration should be like:
Spec HardwareSpec `json:"spec,omitempty"`
More of these conventions can be found here: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I was wrong when I said "map exactly to code" then because I intended on adding that. I'll add it to the proposal for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
type NetworkInterfaces map[MAC]NetworkInterface | ||
|
||
// MAC is a Media Access Control address. | ||
type MAC string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this field and the other fields below(StorageDevice
, KernelParam
) should be a distinct types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few reasons.
- So we can use CEL type validation.
- To be explicit around expectations. Data structures such as
map[string]string
are difficult to establish expectations with, which has been a contributor to poor user experience.
// shared_volume:/data | ||
// | ||
// See https://docs.docker.com/storage/volumes/ for additional details | ||
type Volume string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Volume I would assume that file system based volumes is not enough especially in k8s context. What about have a new volume of a type struct that we have in there different volume targets? e.g: volume from a file path, configmap, secret or source url.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The volume doesn't relate to Kubernetes volumes; it pertains to Actions run on the OSIE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so this should be just reflected just to the docker container and should only be running on the node(OISE) directly.
type Volume string | ||
|
||
// ActionName is unique name within the context of a workflow. | ||
type ActionName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really need to be a type? Wouldn't string representation is enough where it is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exists to facilitate the key in the RenderedActions
field of the WorkflowStatus
. It helps to establish a clear relationship between the 2 although that's arguably an implementation detail of WorkflowStatus
.
I'm leaning toward removing it, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I mean, documenting the field RenderedActions
and mention that, the key represents the action name should be enough IMO, thus we can remove it.
type RenderedActions map[ActionName]ActionStatus | ||
|
||
// ActionStatus describes status information about an action. | ||
type ActionStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly recommend adding the conditions framework here instead of using explicit fields like State. Something similar to this https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/condition_types.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I flip-flopped on whether to use conditions. Here's my train of thought:
-
Conditions are meant to be observations, not state machines in and of themselves. I think this is a common misinterpretation and misuse. Cluster API has a good practical example with its
ControlPlaneReady
condition.ControlPlaneReady
transition toTrue
when the first control plane node is up, but it can also transition back toFalse
(at least that's my understanding); it is an observation of the control plane, not a transition of 1 predefined state to another. Conditions can be monotonic but, again, that doesn't mean they should represent state machines. The guidance is that they shouldn't represent state machines. -
Adding conditions to actions creates a set of nested conditions because the
Workflow
would have conditions which are a summation ofActionStatus
conditions. I don't think I'm totally opposed to this approach, but I fear its confusing given how established conditions are in Kubernetes and the fact they generally refer to the Kind, not nested data structures of a Kind. We could add conditions to theWorkflowStatus
and derive them fromActionStatus
information; that feels more appropriate given all other guidance but they would be complimentary to state machine fields as opposed to the actual representation of the state machine. -
Conditions should generally be derivable from information already available on the resource. That is by no means a hard and fast rule but encoding state into conditions that need to be populated at reconcile time means state expectations are hard to rationalize when looking at the API only (documentation may clear this up, I haven't given it much thought). For example, we can clearly see the predefined states via the
State
type; conditions mask that and users are dependent on documentation to understand workflow state semantics. -
Conditions have a predefined recommended format which doesn't necessarily compliment what we're doing. We could break from this in the same way Cluster API has (they dropped the
ObservedGeneration
field) and that's probably fine but it contributes to the turmoil already shrouding conditions. We would need to very clearly, as Cluster API has, define our condition expectations.
What does it look like if we used conditions? I would probably forgo most of the 'states' we have in favor of the following condition types.
Started
: A monotonic condition indicating the workflow has begun.Succeeded
: A monotonic condition indicating the workflow succeeded.Unknown
indicates it hasn't finished andFalse
indicates it failed. OnFalse
I would expect aReason
andMessage
elaborating on why it didn't succeed.
I'm not sure this sufficiently represents what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be useful to document in the proposal's Rationale
section, whatever we decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are quite interesting thoughts I would say. Ok maybe we can move the usage of conditions in another PR/Proposal? One thing to keep in mind though is, we should distinguish what things are relevant to the CR itself and it is being reconciled and how the underlying resource(machine) has eventually turned out to be.
I have to say though that, I have seen multiple cases, where a status of an object is being derived from another status assigned of another object. I mean for now I guess we can roll without it, and then we can iterate in the future about what's needs to be done here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the detail to the proposals rationale section and said we will tackle conditions as a separate proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moadqassem I thought some more on conditions, here's what I've come up with.
- We'll have the
Reason
andMessage
fields onActionStatus
as detailed in the proposal already (probably changeReason
to a string so users know its something they can define. - As part of this proposal, we'll define how action developers can propagate a
Reason
andMessage
from within actions such that it populates theActionStatus
fields. - We'll remove
Reason
andMessage
fromWorkflowStatus
and replace it with a conditions data structure (the conditions will include aReason
andMessage
). - We'll define a
WorkflowSucceeded
condition type. The values ofReason
andMessage
will be propagated from the failedActonStatus
providing the summarized observation of the workflow.
This does mean we'll need to define conditions as part of this proposal. Given we know we want to provide ways for users to propagate information from within actions on error, and that we want to trend toward conditions anyway, it feels appropriate to just go ahead and define our conditions setup as part of this proposal.
Using conditions on the WorkflowStatus
to summarize action/workflow state feels more extensible and avoids duplicating the information field for field between ActionStatus
and WorkflowStatus
.
Conditions will likely take the same form as Cluster APIs.
The TL;DR on Reason
and Message
propagation from within actions: developers can write to paths such as /tinkernell/reason
and /tinkerbell/message
and which we'll read and propagate from client to server. This is at least the initial proposal and inspired by Kubernetes mechanism for containers/pods to indicate why they exited non-zero.
There are probably other observations such as WorkflowStarted
that might be useful to have as conditions.
Side note, I'm also proposing the Reason
type be dropped in favor of a string
. This compliments better the idea that action developers can define its value.
I'm going to update the proposal with this change either way because it doesn't seem to sit right currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing with Jacob, I decided to shoot for this ahead of tomorrow's community call. The proposal has been updated.
|
||
### Webhooks | ||
|
||
We will introduce a set of basic webhooks for validating each CRD. The webhooks will only validate the `v1alpha2` API version. They will be implemented as part of the `tink-controller` component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this might be controversial 😅 but can we please forget about webhooks and use CEL instead? Webhooks have been a huge pain to work with and debug and I would say this is a nice opportunity to integrate cel instead since this is still the beginning of the project. For more details about CEL PTAL at this https://kubernetes.io/blog/2022/09/23/crd-validation-rules-beta/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will use CEL (I know its not clear in the proposal and I'd be happy to add specific annotations to clarify) but it doesn't capture all validation. For example, CEL won't be able to verify all Hardware
submitted have unique MAC
fields.
```go | ||
// Hardware is a logical representation of a machine that can execute Workflows. | ||
type Hardware struct { | ||
HardwareSpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing in Kubernetes guidlines that says we must have a Spec
or Status
field. I'm wondering if we should just move all the HardwareSpec
fields up to the Hardware
object.
Another thread is talking about the possibility of adding a Status
field which, if that happened, would necessitate the HardwareSpec
but I don't think we have any intention of adding hardware status information any time soon so I suspect this would be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure that there is something written about any CRD should have a field called Spec, or Status, however, if you take a look at k8s crds you will find that, they do have such fields which makes other k8s users familiar with our CRD and know where to look whenever they need to adjust something.
With that being said, we can explore this path in the future, and see how those fields can be used during the lifecycle of those CRs.
Signed-off-by: Chris Doherty <[email protected]>
5d64826
to
7ff2ad4
Compare
Signed-off-by: Chris Doherty <[email protected]>
7ff2ad4
to
84837a2
Compare
@jacobweinstock not sure if someone else need to take a look at this to approve 😅 also not sure if I can approve/merge things |
Signed-off-by: Chris Doherty <[email protected]>
742331b
to
9c9729e
Compare
Signed-off-by: Chris Doherty <[email protected]>
2857933
to
6de3e8b
Compare
Signed-off-by: Chris Doherty <[email protected]>
|
||
## Migrating from v1alpha1 to v1alpha2 | ||
|
||
Given the relative immaturity of the Tinkerbell API we will not provide any migration tooling from `v1alpha1` to `v1alpha2`. Users will be required to manually convert `Hardware` and `Template` resources. `Workflow` resources are considered ephemeral and can generally be deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should write some very rudimentary script or such to help with this process. I would recommend this for only for hardware and templates, not workflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. That should be simple enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved.
883c71c
to
ab2fbe4
Compare
Signed-off-by: Chris Doherty <[email protected]>
ab2fbe4
to
4bd09ed
Compare
Signed-off-by: Chris Doherty <[email protected]>
Signed-off-by: Chris Doherty <[email protected]>
Signed-off-by: Chris Doherty <[email protected]>
Signed-off-by: Chris Doherty <[email protected]>
dbe6942
to
02421bb
Compare
Signed-off-by: Chris Doherty <[email protected]>
Signed-off-by: Chris Doherty <[email protected]>
Signed-off-by: Chris Doherty <[email protected]>
02421bb
to
cb99427
Compare
Implements #14.