Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create e2e integration tests for Adaptive Sampling #5717

Open
yurishkuro opened this issue Jul 8, 2024 · 44 comments · May be fixed by #5951
Open

Create e2e integration tests for Adaptive Sampling #5717

yurishkuro opened this issue Jul 8, 2024 · 44 comments · May be fixed by #5951
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Jul 8, 2024

We never had these tests properly setup. Now that Jaeger SDKs are deprecated we don't even know if there is a working example of adaptive sampling using OTEL SDKs. We need to set up integration tests to validate that the functionality is still working.

There are two possible ways we could validate adaptive sampling:

  1. Start generating a load with a certain known rate and wait until adaptive sampling probability is adjusted from the default to a certain expected level. This is a bit tricky since the performance of the trace generator is unpredictable in GH Action runners, so we don't quite know how much traffic it would generate.
  2. Instead of waiting to achieve an expected level of sampling, we could instead run two different workloads, e.g. 1x volume and then 2x volume. Even if we don't know the exact expected level of sampling probability, we can expect that it should become about 2x smaller once the load increases.

And even simpler alternative might be to just detect that the pre-configured default probability (which would be returned by the backend in the absence of any traffic for a given service) would change to a different calculated probability. For example, we could configure default probability as 100%, configure adaptive sampling target to be 1 trace per second, and run load generator with 10 traces per second. We would expect the calculated rate to go down to about 10% (that's the strategy 1 above), but even a change to 50% would already be an indication that adaptive sampling is working end to end. We don't need the e2e test to validate the exact calculation, since we do that in the unit tests.

@yurishkuro yurishkuro added the help wanted Features that maintainers are willing to accept but do not have cycles to implement label Jul 8, 2024
yurishkuro added a commit that referenced this issue Jul 8, 2024
## Which problem is this PR solving?
- Part of #5717

## Description of the changes
- Add a new flag to tracegen to allow it to use `jaegerremotesampling`
contrib module with OTEL SDK

## How was this change tested?
- It's not actually working at the moment, the backend does not
recognize the spans without sampling tags

Signed-off-by: Yuri Shkuro <[email protected]>
@mahadzaryab1
Copy link
Collaborator

@yurishkuro question regarding your comment We would expect the calculated rate to go down to about 10%. How can we get what the calculated rate is ?

@yurishkuro
Copy link
Member Author

@mahadzaryab1 we can query /sampling endpoint for the current sampling strategy (which is what the SDK does).

@mahadzaryab1
Copy link
Collaborator

@yurishkuro got it - thank you! i will give this issue a shot. are there any other tasks for this issue? do we want a docker-compose setup, readme, etc?

@yurishkuro
Copy link
Member Author

You can see the discussion on the closed PR. I suggest writing up your plan first before jumping into implementation.

@mahadzaryab1
Copy link
Collaborator

@yurishkuro Here's what I'm thinking (all of the following steps being in a shell script):

  1. Start jaeger backend with the following configs
  remote_sampling:
    adaptive:
      sampling_store: some_store
      initial_sampling_probability: 1.0
      target_samples_per_second: 1
    http:
    grpc:
  1. Generate load using tracegen to send 10 traces per second (-pause 100ms duration=60m )
  2. Query /sampling endpoint (with a max tries or timeout of some sort) until the sampling rate drops below a certain threshold, say 50%. If threshold drops, test passes. Otherwise, it fails.

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 4, 2024

I was also wondering, what was the purpose of #5718? If the Jaeger backend is responsible for doing the sampling, why does tracegen need to know about the sampling endpoint?

@yurishkuro
Copy link
Member Author

Jaeger backend is not doing sampling, it only calculates the desired sampling rate based on the observed traffic. The SDK in the application polls for that sampling rate.

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 5, 2024

@yurishkuro Oh I see - so it's the client's responsibility to change the rate that it emits traces at? What's the reason for that instead of having Jaeger do the sampling?

@yurishkuro
Copy link
Member Author

The SDK performs the sampling. If you instead delegate it to the backend it means you would have to collect traces for 100% of requests, which could be a huge overhead on the application's performance.

@mahadzaryab1
Copy link
Collaborator

@yurishkuro Ah okay I see! Is this the Jaeger SDK that you're talking about? Are these still being used for v2? I was under the impression that client side libraries were retired in favour of OTEL.

@mahadzaryab1
Copy link
Collaborator

Do you have any feedback on the plan I mentioned above?

@yurishkuro
Copy link
Member Author

No, I was referring to OTEL SDK.

The plan sounds ok.

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 6, 2024

Oh okay I see - Thank you! I had a follow-up question: does the OTEL SDK have an integration with Jaeger for it to be able to adjust sampling based on the desired sampling rate that Jaeger serves?

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 6, 2024

Oh I see - we provide an extension in OTEL to do this https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/jaegerremotesampling/README.md. Very cool! Thanks for answering my questions - I'll pick this up and try to have a PR up when I make some progress.

@yurishkuro
Copy link
Member Author

a) Yes, Go and Java OTEL SDKs do implement support for jaeger remote sampler.
b) No, the extension you linked has nothing to do with adaptive sampling, it's only remote sampling config provider. You can read about the distinction in our docs: https://www.jaegertracing.io/docs/1.60/sampling/

@yurishkuro
Copy link
Member Author

Yes, Go and Java OTEL SDKs do implement support for jaeger remote sampler.

Caveat: while they implement the support, it may not actually work with our adaptive sampling backend because OTEL SDKs do not send the necessary span tags indicating which sampler & probability were used, and our adaptive sampling processor checks for that. But we can disable that check in the code.

@mahadzaryab1
Copy link
Collaborator

Thanks for the explanation @yurishkuro! So the remote sampling endpoint is just a means to serve the configuration. The OTEL SDK doesn't actually know that its doing adaptive sampling, it just gets some configuration and samples accordingly. Is that accurate?

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 7, 2024

I was also running into an issue with my docker-compose set up. I have the following in docker-compose.yml:

services:
  jaeger:
    image: jaegertracing/jaeger:${JAEGER_IMAGE_TAG:-latest}
    volumes:
      - "./jaeger-v2-config.yml:/etc/jaeger/config.yml"
    command: ["--config", "/etc/jaeger/config.yml"]
    ports:
      - "16686:16686"
      - "14268:14268"

And I'm trying to access the sampling endpoint as follows:

curl 'http://localhost:14268/api/sampling?service=tracegen'

However, I get the following error: curl: (56) Recv failure: Connection reset by peer. Do you know if there's something I'm missing in my docker-compose set up that won't let me query that endpoint? I tried it with the all-in-one image and it seems to work okay with that. Does the jaeger-v2 image not expose port 14268 for querying?

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 7, 2024

@yurishkuro I tried running the steps maually for now and https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L164-L167 returns an unrecognized sampler and a probability of 0 which is just giving me the following response which does not change. I tried commenting out the sampler type check but that doesn't seem to make a difference. Do you have any ideas/thoughts on how to proceed?

{
  "strategyType": "PROBABILISTIC",
  "operationSampling": {
    "defaultSamplingProbability": 1,
    "defaultLowerBoundTracesPerSecond": 0.016666666666666666,
    "perOperationStrategies": [],
    "defaultUpperBoundTracesPerSecond": 0
  }
}

@yurishkuro
Copy link
Member Author

yurishkuro commented Sep 7, 2024

curl 'http://localhost:14268/api/sampling?service=tracegen'

a) I don't know which config you are using, the only configs where remote sampling is defined are cmd/jaeger/config.yaml and cmd/jaeger/internal/all-in-one.yaml

b) the default port is different

$ curl http://localhost:5778/
'service' parameter must be provided once

@yurishkuro
Copy link
Member Author

I tried running the steps maually for now and https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L164-L167 returns an unrecognized sampler and a probability of 0

yes this is one place where you need to comment out the check. We should create an option in v2 which by default would disable that check and in the future if OTEL SDKs start sending proper sampler identification we can re-enable it.

When you say it doesn't work, does it record the throughput in storage or is there another check somewhere?

@mahadzaryab1
Copy link
Collaborator

target_samples_per_second: 1

Thank you so much! Using 5778 fixed the issue.

@mahadzaryab1
Copy link
Collaborator

I tried running the steps maually for now and https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L164-L167 returns an unrecognized sampler and a probability of 0

yes this is one place where you need to comment out the check. We should create an option in v2 which by default would disable that check and in the future if OTEL SDKs start sending proper sampler identification we can re-enable it.

When you say it doesn't work, does it record the throughput in storage or is there another check somewhere?

@yurishkuro I commented out https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L165-L167 and https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L132-L134. The probability however is 0 and we end up entering in https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L132-L134. How is the sampler supposed to work when the sampler.param tag is not being sent?

@yurishkuro
Copy link
Member Author

please stamp #5953, to reduce the confusion a bit.

Did you try this instead?

	if samplerType == span_model.SamplerTypeUnrecognized {
		samplerType = span_model.SamplerProbabilistic
	}

Because it you let the type remain Unrecognized there are other places that may get bypassed. it may still not work though, because the probability will be passed down as 0 and there's nothing we can do about that - we just need to find places where that triggers a different logic.

Q: have you looked at the traces being generated by the OTEL SDK with remote sampler? Do they say nothing at all about the type of sampler used?

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 7, 2024

I made the change you suggested but still getting an empty response for perOperationStrategies.

Here's what I see in the logs, specifically: tracegen-1 | 2024-09-07T20:31:23.512Z LEVEL(-4) global/internal_logging.go:45 TracerProvider created {"config": {"SpanProcessors":[{}],"SamplerType":"*jaegerremote

jaeger-1    | 2024-09-07T20:31:23.442Z  info    extensions/extensions.go:41  Extension is starting...        {"kind": "extension", "name": "remote_sampling"}
jaeger-1    | 2024-09-07T20:31:23.442Z  info    remotesampling/extension.go:114      Starting adaptive sampling strategy provider    {"kind": "extension", "name": "remote_sampling", "sampling_store": "some_store"}
jaeger-1    | 2024-09-07T20:31:23.442Z  info    adaptive/provider.go:59      starting adaptive sampling service      {"kind": "extension", "name": "remote_sampling"}
jaeger-1    | 2024-09-07T20:31:23.442Z  info    remotesampling/extension.go:246      Starting remote sampling HTTP server    {"kind": "extension", "name": "remote_sampling", "endpoint": ":5778"}
jaeger-1    | 2024-09-07T20:31:23.442Z  info    remotesampling/extension.go:280      Starting remote sampling GRPC server    {"kind": "extension", "name": "remote_sampling", "endpoint": ":5779"}
jaeger-1    | 2024-09-07T20:31:23.442Z  info    extensions/extensions.go:58  Extension started.      {"kind": "extension", "name": "remote_sampling"}
jaeger-1    | 2024-09-07T20:31:23.445Z  info    adaptive/aggregator.go:45    Using unique participantName in adaptive sampling       {"kind": "processor", "name": "adaptive_sampling", "pipeline": "traces", "participantName": "47c8bfe49e1d-155ceca202abefe3"}
jaeger-1    | 2024-09-07T20:31:23.445Z  info    adaptive/processor.go:144    starting adaptive sampling postAggregator       {"kind": "processor", "name": "adaptive_sampling", "pipeline": "traces"}
jaeger-1    | 2024-09-07T20:31:23.445Z  info    [email protected]/otlp.go:103    Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "localhost:4317"}
jaeger-1    | 2024-09-07T20:31:23.445Z  info    [email protected]/otlp.go:153    Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": "0.0.0.0:4318"}
jaeger-1    | 2024-09-07T20:31:23.445Z  info    [email protected]/service.go:289      Everything is ready. Begin running and processing data.
jaeger-1    | 2024-09-07T20:31:23.445Z  info    localhostgate/featuregate.go:63      The default endpoints for all servers in components have changed to use localhost instead of 0.0.0.0. Disable the feature gate to temporarily revert to the previous default.     {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
tracegen-1  | 2024-09-07T20:31:23.512Z  INFO    tracegen/main.go:68 git-commit=0524ae6df8acac17e8ed52af9b86e6dee85a5e5e, git-version=v1.60.0, build-date=2024-08-06T15:39:43Z
tracegen-1  | 2024-09-07T20:31:23.512Z  INFO    tracegen/main.go:95 using otlp-http trace exporter for service tracegen
tracegen-1  | 2024-09-07T20:31:23.512Z  INFO    tracegen/main.go:121using adaptive sampling URL: http://jaeger:5778/api/sampling
tracegen-1  | 2024-09-07T20:31:23.512Z  LEVEL(-4)       global/internal_logging.go:45        TracerProvider created  {"config": {"SpanProcessors":[{}],"SamplerType":"*jaegerremote
jaeger-1    | 2024-09-07T20:31:23.445Z  info    adaptive/aggregator.go:45    Using unique participantName in adaptive sampling       {"kind": "processor", "name": "adaptive_sampling", "pipeline": "traces", "participantName": "47c8bfe49e1d-155ceca202abefe3"}
jaeger-1    | 2024-09-07T20:31:23.445Z  info    adaptive/processor.go:144    starting adaptive sampling postAggregator       {"kind": "processor", "name": "adaptive_sampling", "pipeline": "traces"}

@mahadzaryab1
Copy link
Collaborator

This method that generates the responses iterates over p.probabilities which is empty.

@yurishkuro
Copy link
Member Author

iterates over p.probabilities which is empty.

if throughout is being recorded, that list should not be empty.

"SamplerType":"*jaegerremote

the output is cut off. I meant look in Jaeger UI to see what tags are recorded on the root span.

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 7, 2024

Sorry about that! Here's what I'm seeing in the Jaeger UI. I don't see any sampler information.

Screenshot 2024-09-07 at 4 44 07 PM Screenshot 2024-09-07 at 4 47 32 PM

@yurishkuro
Copy link
Member Author

yeah, that's the problem. The adaptive sampling can still work, we just need to clean up the code paths that expect this information from the spans to be available.

NB: please stamp another PR where I moved the code around to make it more logically organized https://github.com/jaegertracing/jaeger/pull/5954/files

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 7, 2024

@yurishkuro I did some more debugging and found that this check always returns false so nothing gets appended to the slice. Specifically both t.time.Before(end) and t.time.Equal(end) are returning false. Any idea why this could be happening?

@yurishkuro
Copy link
Member Author

func (p *PostAggregator) isUsingAdaptiveSampling(

I think this is the one you need to override to return true

@yurishkuro
Copy link
Member Author

nothing gets appended to the slice

hm, it's possible that there is a bug here because adaptive storage on memory storage is a relatively new addition and since we never had e2e tests there's no way to validate that this is correct. Although we do have e2e storage tests so I would expect this to still work correctly.

Another reason could be if you didn't wait long enough, because the adaptive sampling is working slowly, with 1min granularity, so you need to wait 2-3 min for data to show up.

What would be great is to add an expvar reporting of the post-aggregator serviceCache, so that one can observe what data it accumulates.

@mahadzaryab1
Copy link
Collaborator

func (p *PostAggregator) isUsingAdaptiveSampling(

I think this is the one you need to override to return true

Would this be in addition to the change you suggested in #5717 (comment)?

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 7, 2024

nothing gets appended to the slice

hm, it's possible that there is a bug here because adaptive storage on memory storage is a relatively new addition and since we never had e2e tests there's no way to validate that this is correct. Although we do have e2e storage tests so I would expect this to still work correctly.

Another reason could be if you didn't wait long enough, because the adaptive sampling is working slowly, with 1min granularity, so you need to wait 2-3 min for data to show up.

What would be great is to add an expvar reporting of the post-aggregator serviceCache, so that one can observe what data it accumulates.

Ah okay I see. I did wait for a few minutes to see if the change took effect but still nothing. I'll try playing around with it a bit more and lowering the calculation interval.

How would the expvar workflow look like? I've never used that package before. I'll also read more about it.

@yurishkuro
Copy link
Member Author

Would this be in addition to the change you suggested in #5717 (comment)?

yes

How would the expvar workflow look like?

expvar is standard Go lib that defines "variables" to which we can assign values and those values can be inspected at a /expvar/debug endpoint (provided that it's mounted at some server). We used it in jaeger-v1 for some thing.

Another alternative would be to use the zpages extension, but I don't think today it is designed to accept external handlers. Or we could just add expvar page to zpages extension (open-telemetry/opentelemetry-collector#11081).

@mahadzaryab1
Copy link
Collaborator

Would this be in addition to the change you suggested in #5717 (comment)?

yes

How would the expvar workflow look like?

expvar is standard Go lib that defines "variables" to which we can assign values and those values can be inspected at a /expvar/debug endpoint (provided that it's mounted at some server). We used it in jaeger-v1 for some thing.

Another alternative would be to use the zpages extension, but I don't think today it is designed to accept external handlers. Or we could just add expvar page to zpages extension (open-telemetry/opentelemetry-collector#11081).

Sounds good! I can create a separate issue for this.

@mahadzaryab1
Copy link
Collaborator

@yurishkuro I did some more debugging and here's what I'm seeing:

  • This check is always returning false at the moment. Specifically, t.time.Before(end) and t.time.Equal(end) are both always false.
  • We never enter this block here and so the strategy response is never generated. It looks like isLeader() is always evaluating to true.

When I remove the if t.time.After(start) && (t.time.Before(end) || t.time.Equal(end)) and !isLeader() checks, then I'm able to get this response. Let me know if you have any thoughts / ideas on how to proceed.

{
  "strategyType": 0,
  "operationSampling": {
    "defaultSamplingProbability": 1,
    "defaultLowerBoundTracesPerSecond": 0.016666666666666666,
    "perOperationStrategies": [
      {
        "operation": "lets-go",
        "probabilisticSampling": {
          "samplingRate": 0.00001
        }
      }
    ],
    "defaultUpperBoundTracesPerSecond": 0
  }
}

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 8, 2024

For the first point in my previous post, I think the default calculation delay was too high (2 minutes) so I lowered it to 10 seconds for the time being and I seem to be able to read some throughput values now. The second issue remains however.

@yurishkuro
Copy link
Member Author

(2) I think you found a major bug likely introduced in the recent refactoring. Before the refactoring the whole adaptive sampling implementation was more like a single black box running in v1 collector. When adopting it to v2 we needed to split the processing / calculation part from the strategy serving part, since the former (aggregator + post-aggregator) needs to run as a trace processor in the OTEL pipeline and the latter (provider) needs to run as an extension in order to expose a new server for /sampling endpoint.

I think in the previous version the "provider" code had access to the cache that the aggregator was created when it was the leader, and therefore leader provider did not need to load the sampling from storage. Now this has changed, I don't think provider should be relying on leader-follower at all, all instances of the provider can just read the probabilities from storage.

Ironically, at first I thought you were mistaken about point (2) so I started writing an implementation README (please stamp #5955), and during writing realized that we have this issue.

Feel free to remove if !p.isLeader() check, it's a bug. Once we get the e2e test going we can do another refactoring to clean-up the Provider from leader/follower dependency altogether.

@mahadzaryab1
Copy link
Collaborator

@yurishkuro I have a question about the calculation. I'm running tracegen as follows to generate 10 traces per second:

  tracegen:
    image: jaegertracing/jaeger-tracegen:latest
    environment:
      - OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/traces
    command: ["-adaptive-sampling", "http://jaeger:5778/api/sampling", "-pause", "100ms", "-duration", "60m"]
    depends_on:
      - jaeger

and here is my configuration:

    adaptive:
      sampling_store: some_store
      initial_sampling_probability: 1.0
      target_samples_per_second: 1
      calculation_interval: 10s
      calculation_delay: 20s

When I query the sampling endpoint after a minute or so, I get the following response:

{
  "strategyType": 0,
  "operationSampling": {
    "defaultSamplingProbability": 1,
    "defaultLowerBoundTracesPerSecond": 0.016666666666666666,
    "perOperationStrategies": [
      {
        "operation": "lets-go",
        "probabilisticSampling": {
          "samplingRate": 0.00001
        }
      }
    ],
    "defaultUpperBoundTracesPerSecond": 0
  }
}

Why is the samplingRate dropping to 0.00001 instead of 0.1?

@yurishkuro
Copy link
Member Author

do you see this rate changing overtime at all, or it is always the same? If the same, maybe it's the default value that's being returned and the adaptive sampling part is actually not working, i.e. it's not recognizing the traffic.

I create a helper extension #5986 to expose expvar data via HTTP. As mentioned earlier, we should add some reporting within the adaptive sampling processors into expvar variables, this way you'd be able to inspect the state in better details, specifically which throughout has been recorded, and what probabilities the adaptive sampling is assigning.

@mahadzaryab1
Copy link
Collaborator

@yurishkuro The rate does seem to be changing over time but the calculation doesn't seem to make sense. I am hardcoding this function right now to return true for the time being - could that be why? I also wanted to ask what configuration we should use for the time being to bypass this check.

Thanks for adding the extension in! I'll play around with it.

@yurishkuro
Copy link
Member Author

You need to hardcode bypasses for both checks.

I think it's best to implement reporting of aggregator's internal state to expvar, then it will make is a lot easier to investigate what happens.

@mahadzaryab1
Copy link
Collaborator

mahadzaryab1 commented Sep 16, 2024

@yurishkuro I pushed up my working state at the moment. It looks like hardcoding the isUsingAdaptiveSampling causes the unit tests to fail so I'm assuming that's the reason the calculations I'm seeing are not correct. The tests seem to be failing because of incorrect calculations. If we hardcode that check to true, are we performing extra calculations? If that check is not bypassed, then the probability only gets calculated once because the length of the service cache is 1 so this check never gets entered.

Let me know if you have any thoughts on how to proceed. I'll work on adding the expvar for the aggregator's service cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants