Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement in-memory Service Dependency Graph using Apache Beam #5911

Open
4 tasks
yurishkuro opened this issue Aug 31, 2024 · 4 comments
Open
4 tasks

Implement in-memory Service Dependency Graph using Apache Beam #5911

yurishkuro opened this issue Aug 31, 2024 · 4 comments
Labels
area/otel area/storage changelog:new-feature Change that should be called out as new feature in CHANGELOG help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/badger Issues related to badger storage

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Aug 31, 2024

For background, see #5910

Jaeger all-in-one typically runs with in-memory or badger storage that both have a special implementation of Dependencies Storage API where instead of pre-computing and storing the dependencies they just brute-force re-calculate them on demand each time:

It's ok for small demos, but:

  • if all-in-one is run on a large machine and is allowed to store a lot of traces, this brute-force recalculation could be very slow
  • the two implementations are completely independent, and different from the Spark implementation, so we have 3 different copies of the code to maintain

Following on the proposal from RFC #5910, we could re-implement this logic as an in-process streaming component using Apache Beam with direct executor. This will allow us to consolidate the graph building logic across memory and badger storages (in fact extract it from them into an independent component), and in the future we can find a way to adapt it to run in a distributed manner on big data runners without actually changing the business logic.

Some implementation details:

  • The logic will be running as a dedicated processor in the OTEL pipeline (similar to adaptive sampling processor)
  • The processor will accumulate dependency graph from a stream of traces it receives and periodically write them to dependencies storage (it can obtain the storage from jaeger_storage extension)
  • The processor will need to perform a trace aggregation similar to tail sampling processor by grouping spans by trace ID and waiting for a period of inactivity after which the trace can be declared "complete" (i.e. fully received, assembled, and ready for processing). It would be interesting if such functionality can be abstracted out of tail sampling processor. Unlike tail sampler, the aggregator for dependencies does not need to keep full spans in memory, because (a) if it really needs them it can get them from SpanReader, and (b) it can just keep a more lightweight structure of only span IDs, span kinds, and their parent-child relationships - this will be useful once that logic starts running on a big data pipeline.

Steps:

  • build a processor performing streaming aggregation for the basic service map (using the logic that already exists in the memory store)
  • implement dependencies storage in the memory store (the GetDependency could be controlled by a feature flag allowing switching between current behavior and the new behavior that just reads the data)
  • hook-up the streaming processor to write to storage
  • verify the behavior via existing integration test for dependencies
@yurishkuro yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Aug 31, 2024
@dosubot dosubot bot added area/otel area/storage changelog:new-feature Change that should be called out as new feature in CHANGELOG storage/badger Issues related to badger storage labels Aug 31, 2024
@NavinShrinivas
Copy link
Contributor

Hey yuri, this seems interesting. Are you thinking this will be a separate service that the collector forwards the details to? I'm just trying to make sense of this.

Are you in the process of splitting it down into smaller tasks?

@yurishkuro
Copy link
Member Author

It's not a separate service.

@tronda
Copy link

tronda commented Sep 4, 2024

The OpenTelemetry Collector Contrib includes the ServiceGraphConnector. This generates metrics based on the trace data which can be used to draw a dependency graph. Having struggled with Jaeger Spark dependency job, the service graph connector sounded appealing to us because the deployment would be much easier since we already have Prometheus available. Are there any architectural issues with using the service graph connector which doesn't fit Jaeger?

@yurishkuro
Copy link
Member Author

@tronda Jaeger is not a metrics database, so in order to use the ServiceGraphConnector the user needs to run another backend. Then the transitive dependency graph is simply not representable in the metrics format, but is much more useful then the p2p graph that ServiceGraphConnector can produce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/otel area/storage changelog:new-feature Change that should be called out as new feature in CHANGELOG help wanted Features that maintainers are willing to accept but do not have cycles to implement storage/badger Issues related to badger storage
Projects
None yet
Development

No branches or pull requests

3 participants