Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supply "depth" information when including relationships #3010

Open
kzantow opened this issue Jul 3, 2024 · 7 comments · May be fixed by #3402
Open

Supply "depth" information when including relationships #3010

kzantow opened this issue Jul 3, 2024 · 7 comments · May be fixed by #3402
Assignees
Labels
enhancement New feature or request

Comments

@kzantow
Copy link
Contributor

kzantow commented Jul 3, 2024

What would you like to be added:
Relationship depth information, when Syft is unable to provide a full transitive dependency graph.

Why is this needed:
One of the data elements mentioned in the NTIA minimum requirements is the depth of relationships. If Syft is able to build an accurate SBOM with a full transitive-dependency graph, that would be ideal, but different scenarios prevent this information from being included or accurately depicting the transitive graph. Some examples are Python requirements.txt and Go binary mod information, which only provide a flat list of dependencies. Or binaries which are only directly identified without dependent component information.

One solution is to provide an "unknown" indicator that Syft was unable to determine a full transitive dependency graph, or Syft stopped after 5-levels deep resolving online parent references. These can be returned as "unknowns" from catalogers where appropriate to be associated with the file(s) where package graph information originated.

Additional context:
This is likely to be dependent the PR for known unknowns getting merged.

This is a part of #632

@kzantow kzantow added the enhancement New feature or request label Jul 3, 2024
@wagoodman wagoodman self-assigned this Sep 24, 2024
@wagoodman wagoodman moved this to In Progress in OSS Sep 24, 2024
@wagoodman
Copy link
Contributor

Related to #572

We want to be able to describe the topology and limitations of any dependency graph that an SBOM is producing. This isn't based on the SBOM as a whole, a language or packaging ecosystem, but really a package at a time based on the evidence we found and what we know about the kind of files that make up that evidence (e.g. package.json vs package-lock.json provide different answers here, which also differ when there is the existence of a populated node_modules dir from a previously run npm install command).

I feel that on a per-package bases we're looking for the following description:

  • From a capability perspective: Could we capture dependency information or not?
  • From a node-quality perspective: If we could capture dependencies to what extent depth-wise do we have the node information? That is, maybe we only have direct dependencies captured (partial), or we have all indirect dependencies listed as well (full).
  • From an edge-quality perspective: If we have full dependencies captured for a package, what is the quality of the relationships between these nodes? In some cases we only have a simple listing of dependencies with no real relationships (e.g. we know A and B are direct deps, and C and D are indirect deps, but we don't know by which means C and D were included [was it A? B? or both?]). Sometimes we can partition direct dependencies from transitive/indirect ones, other times we can't. Sometimes we have all direct dependency information for all nodes in the graph, thus can clearly describe all ways your application depends on any dependency (e.g. there are 13 path in the dependency graph that reach dep node D).

So how should we start expressing these topologies? I have an early/incomplete thought about a new field onto the pkg.Package called dependency with the following subfields:

  • nodes: with possible values...
    • unknown: no distinction is made about if we're able to find any package dependencies
    • direct-only: partial set of nodes, only describing direct dependencies
    • all: all direct and indirect nodes are described
  • edges: with possible values...
    • unknown: no distinction is made about if we're able to find any information about how dependencies are related to one another
    • flat: nodes have relationships between both direct and indirect dependencies; cannot distinct between direct and indirect dependencies
    • all: nodes have relationships between themselves and only direct dependencies

One question that comes to mind: what about cases where we can partition nodes into direct/indirect dependencies but it is still a flat list (like go.mod)? We can only say all/flat but it's still valuable to know which of these nodes are indirect. Does this mean we should add additional dependency information onto the edge itself? (in which case this is a non-point)

While I'm not sold on the specifics of the field, I think I'm becoming more convinced that describing the node and edge qualities separately is more valuable then attempting to combine them into a single enum field.

Another consideration is that there are nodes in the graph that cross ecosystems, combining nodes making up dependency graphs in one ecosystem with another dependency graph for another ecosystem. One example of this is with binary packages: these may relate to any number of other ecosystems based on file ownership overlap and dynamic imports (and soon dlopen descriptions) from that binary. So it may not be as simple as having an ecosystem cataloger make a claim on a package about it's node/edge/capability conclusion... this may additionally be a post-cataloging analysis that further annotates these qualities based on the final graph captured.

Thoughts to be continued in another post soon...

@wagoodman
Copy link
Contributor

wagoodman commented Sep 26, 2024

From a discussion with the team on this one, we nudged this into a different direction. The conclusive point of discussion was: when asking a single package node information about dependencies it shouldn't attempt to answer anything outside it's immediate dependencies. That is, asking a node to describe the graph isn't really correct. We should instead limit the answer to only the immediate part of the graph that the node is privy to.

This somewhat eliminates the need to describe edges in such depth. The current suggestion from the team is to have a single dependencies field with the following possible enum values:

  • unknown: no dependency resolution mechanism is clear to determine if there are any dependencies
  • complete-direct-only: the full set of direct dependencies are enumerated as relationships to the current package
  • complete-transitive: the full set of direct and indirect dependencies (mixed) are enumerated as relationships to the current package
  • incomplete: any enumeration dependencies related to this package cannot be assumed to be the full set of direct dependencies

Furthermore, to open back up a conversation from #572, we should be qualifying edges that are known direct dependencies vs are known transitive (indirect) dependencies. In the common case of direct dependencies, using the dependency-of relationship type is what we should continue to use. However, we should not use this relationship type when describing dependencies that are NOT direct dependencies --another type should be created for this purpose.

edit: see the final names used in the PR description #3402 (comment)

@kzantow
Copy link
Contributor Author

kzantow commented Sep 27, 2024

I'm not sure why I hadn't looked this up before, but I should also note the related SPDX 3 field: https://spdx.github.io/spdx-spec/v3.0.1/model/Core/Vocabularies/RelationshipCompleteness/. This is defined on a one-to-many relationship element and isn't exactly the same thing as we were talking about but is very closely related, I think.

@wagoodman wagoodman linked a pull request Oct 31, 2024 that will close this issue
4 tasks
@wagoodman wagoodman moved this from In Progress to In Review in OSS Oct 31, 2024
@willmurphyscode
Copy link
Contributor

I've been thinking about this a bit and discussing it with some folks off line, and I don't think we can get one-word enum names to carry all the info. Proposal from weekend discussion is:

  • unknown/noAssertion - we don't know the completeness or directness of the dependencies
  • incompleteDirectOnly - we know that this package's set of dependencies is incomplete, but we also know that it doesn't contain indirect dependencies
  • incompleteMixed or incompleteFlattened - this package's set of dependencies is incomplete, and my contain indirect dependencies, for example a requirements.txt
  • completeDirectOnly - this package's dependencies contains the complete list of its direct dependencies and nothing else, e.g. Cargo.lock because we can draw the whole dependency graph, giving each package exactly and only its dependencies.
  • completeMixed or completeFlattened this package's dependencies are complete, but make no distinction between direct and indirect dependences, e.g. the modules parsed from a Go binary.

What do you all think of these field names @wagoodman @kzantow ?

@wagoodman
Copy link
Contributor

This is proposing a little more than a rename -- this is introducing new states. I'm trying to understand how a user would use the information of incompleteMixed/incompleteDirectOnly? I feel if the definition of completeness stays focused on direct dependencies of the current node then knowing if there are other indirect dependencies that are mixed in doesn't help clarify the completeness beyond "this is incomplete". In other words, clarifying the opposite, completeness, makes sense because the clarity of direct-only vs mixed directly indicates at the accuracy of the graph for use cases like reachability analysis, but the same cannot be said for incompleteness (as far as I can tell). This is the original reason why I dropped the other incomplete* permutations from the enum.

On your other point, single word enums, yeah I get the same sense that single words wont quite cut it. Ideally there would be a way to indicate with a name that mixed is a superset of complete (in the same way that it conveyed with the words saturated and supersaturated). I don't quite have a name that means that, but in lieu of that, qualifying mixed as a class of completeness would better than not... making the total suggestions:

  • unknown
  • incomplete
  • complete-direct-only
  • complete-mixed

Maybe leave complete instead of complete-direct-only?

@willmurphyscode
Copy link
Contributor

Maybe leave complete instead of complete-direct-only?

I disagree. It's not much more typing, and it makes the distinction that other types might mix direct and indirect dependencies more obvious.

This is proposing a little more than a rename -- this is introducing new states. I'm trying to understand how a user would use the information of incompleteMixed/incompleteDirectOnly

Is there a compelling reason not to represent the complete state space of direct-only/mixed X complete/incomplete? We don't need to know what exactly downstream users want with each thing. We should just say true things like, "this dependency list is incomplete and mixes direct and indirect dependencies." It seems like the only reason for excluding it is that we don't have any examples of where we'd say it yet (though I think python requirements.txt is incompleteMixed).

@willmurphyscode
Copy link
Contributor

willmurphyscode commented Nov 22, 2024

@wagoodman and I talked offline, and we think these values will work:

  • unknown
  • incomplete
  • incomplete-with-indirect (might be added later; we don't know of a cataloger that needs this today)
  • complete
  • complete-with-indirect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

3 participants