-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create e2e integration tests for Adaptive Sampling #5717
Comments
## Which problem is this PR solving? - Part of #5717 ## Description of the changes - Add a new flag to tracegen to allow it to use `jaegerremotesampling` contrib module with OTEL SDK ## How was this change tested? - It's not actually working at the moment, the backend does not recognize the spans without sampling tags Signed-off-by: Yuri Shkuro <[email protected]>
@yurishkuro question regarding your comment |
@mahadzaryab1 we can query /sampling endpoint for the current sampling strategy (which is what the SDK does). |
@yurishkuro got it - thank you! i will give this issue a shot. are there any other tasks for this issue? do we want a docker-compose setup, readme, etc? |
You can see the discussion on the closed PR. I suggest writing up your plan first before jumping into implementation. |
@yurishkuro Here's what I'm thinking (all of the following steps being in a shell script):
remote_sampling:
adaptive:
sampling_store: some_store
initial_sampling_probability: 1.0
target_samples_per_second: 1
http:
grpc:
|
I was also wondering, what was the purpose of #5718? If the Jaeger backend is responsible for doing the sampling, why does tracegen need to know about the sampling endpoint? |
Jaeger backend is not doing sampling, it only calculates the desired sampling rate based on the observed traffic. The SDK in the application polls for that sampling rate. |
@yurishkuro Oh I see - so it's the client's responsibility to change the rate that it emits traces at? What's the reason for that instead of having Jaeger do the sampling? |
The SDK performs the sampling. If you instead delegate it to the backend it means you would have to collect traces for 100% of requests, which could be a huge overhead on the application's performance. |
@yurishkuro Ah okay I see! Is this the Jaeger SDK that you're talking about? Are these still being used for v2? I was under the impression that client side libraries were retired in favour of OTEL. |
Do you have any feedback on the plan I mentioned above? |
No, I was referring to OTEL SDK. The plan sounds ok. |
Oh okay I see - Thank you! I had a follow-up question: does the OTEL SDK have an integration with Jaeger for it to be able to adjust sampling based on the desired sampling rate that Jaeger serves? |
Oh I see - we provide an extension in OTEL to do this https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/jaegerremotesampling/README.md. Very cool! Thanks for answering my questions - I'll pick this up and try to have a PR up when I make some progress. |
a) Yes, Go and Java OTEL SDKs do implement support for jaeger remote sampler. |
Caveat: while they implement the support, it may not actually work with our adaptive sampling backend because OTEL SDKs do not send the necessary span tags indicating which sampler & probability were used, and our adaptive sampling processor checks for that. But we can disable that check in the code. |
Thanks for the explanation @yurishkuro! So the remote sampling endpoint is just a means to serve the configuration. The OTEL SDK doesn't actually know that its doing adaptive sampling, it just gets some configuration and samples accordingly. Is that accurate? |
I was also running into an issue with my docker-compose set up. I have the following in services:
jaeger:
image: jaegertracing/jaeger:${JAEGER_IMAGE_TAG:-latest}
volumes:
- "./jaeger-v2-config.yml:/etc/jaeger/config.yml"
command: ["--config", "/etc/jaeger/config.yml"]
ports:
- "16686:16686"
- "14268:14268" And I'm trying to access the sampling endpoint as follows: curl 'http://localhost:14268/api/sampling?service=tracegen' However, I get the following error: |
@yurishkuro I tried running the steps maually for now and https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L164-L167 returns an unrecognized sampler and a probability of 0 which is just giving me the following response which does not change. I tried commenting out the sampler type check but that doesn't seem to make a difference. Do you have any ideas/thoughts on how to proceed? {
"strategyType": "PROBABILISTIC",
"operationSampling": {
"defaultSamplingProbability": 1,
"defaultLowerBoundTracesPerSecond": 0.016666666666666666,
"perOperationStrategies": [],
"defaultUpperBoundTracesPerSecond": 0
}
} |
a) I don't know which config you are using, the only configs where remote sampling is defined are cmd/jaeger/config.yaml and cmd/jaeger/internal/all-in-one.yaml b) the default port is different
|
yes this is one place where you need to comment out the check. We should create an option in v2 which by default would disable that check and in the future if OTEL SDKs start sending proper sampler identification we can re-enable it. When you say it doesn't work, does it record the throughput in storage or is there another check somewhere? |
Thank you so much! Using 5778 fixed the issue. |
@yurishkuro I commented out https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L165-L167 and https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L132-L134. The probability however is 0 and we end up entering in https://github.com/jaegertracing/jaeger/blob/main/plugin/sampling/strategyprovider/adaptive/aggregator.go#L132-L134. How is the sampler supposed to work when the |
please stamp #5953, to reduce the confusion a bit. Did you try this instead?
Because it you let the type remain Unrecognized there are other places that may get bypassed. it may still not work though, because the probability will be passed down as 0 and there's nothing we can do about that - we just need to find places where that triggers a different logic. Q: have you looked at the traces being generated by the OTEL SDK with remote sampler? Do they say nothing at all about the type of sampler used? |
I made the change you suggested but still getting an empty response for Here's what I see in the logs, specifically:
|
This method that generates the responses iterates over |
if throughout is being recorded, that list should not be empty.
the output is cut off. I meant look in Jaeger UI to see what tags are recorded on the root span. |
yeah, that's the problem. The adaptive sampling can still work, we just need to clean up the code paths that expect this information from the spans to be available. NB: please stamp another PR where I moved the code around to make it more logically organized https://github.com/jaegertracing/jaeger/pull/5954/files |
@yurishkuro I did some more debugging and found that this check always returns false so nothing gets appended to the slice. Specifically both |
I think this is the one you need to override to return true |
hm, it's possible that there is a bug here because adaptive storage on memory storage is a relatively new addition and since we never had e2e tests there's no way to validate that this is correct. Although we do have e2e storage tests so I would expect this to still work correctly. Another reason could be if you didn't wait long enough, because the adaptive sampling is working slowly, with 1min granularity, so you need to wait 2-3 min for data to show up. What would be great is to add an expvar reporting of the post-aggregator |
Would this be in addition to the change you suggested in #5717 (comment)? |
Ah okay I see. I did wait for a few minutes to see if the change took effect but still nothing. I'll try playing around with it a bit more and lowering the calculation interval. How would the expvar workflow look like? I've never used that package before. I'll also read more about it. |
yes
expvar is standard Go lib that defines "variables" to which we can assign values and those values can be inspected at a /expvar/debug endpoint (provided that it's mounted at some server). We used it in jaeger-v1 for some thing. Another alternative would be to use the zpages extension, but I don't think today it is designed to accept external handlers. Or we could just add expvar page to zpages extension (open-telemetry/opentelemetry-collector#11081). |
Sounds good! I can create a separate issue for this. |
@yurishkuro I did some more debugging and here's what I'm seeing:
When I remove the {
"strategyType": 0,
"operationSampling": {
"defaultSamplingProbability": 1,
"defaultLowerBoundTracesPerSecond": 0.016666666666666666,
"perOperationStrategies": [
{
"operation": "lets-go",
"probabilisticSampling": {
"samplingRate": 0.00001
}
}
],
"defaultUpperBoundTracesPerSecond": 0
}
} |
For the first point in my previous post, I think the default calculation delay was too high (2 minutes) so I lowered it to 10 seconds for the time being and I seem to be able to read some throughput values now. The second issue remains however. |
(2) I think you found a major bug likely introduced in the recent refactoring. Before the refactoring the whole adaptive sampling implementation was more like a single black box running in v1 collector. When adopting it to v2 we needed to split the processing / calculation part from the strategy serving part, since the former (aggregator + post-aggregator) needs to run as a trace processor in the OTEL pipeline and the latter (provider) needs to run as an extension in order to expose a new server for I think in the previous version the "provider" code had access to the cache that the aggregator was created when it was the leader, and therefore leader provider did not need to load the sampling from storage. Now this has changed, I don't think provider should be relying on leader-follower at all, all instances of the provider can just read the probabilities from storage. Ironically, at first I thought you were mistaken about point (2) so I started writing an implementation README (please stamp #5955), and during writing realized that we have this issue. Feel free to remove |
@yurishkuro I have a question about the calculation. I'm running tracegen as follows to generate 10 traces per second: tracegen:
image: jaegertracing/jaeger-tracegen:latest
environment:
- OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/traces
command: ["-adaptive-sampling", "http://jaeger:5778/api/sampling", "-pause", "100ms", "-duration", "60m"]
depends_on:
- jaeger and here is my configuration: adaptive:
sampling_store: some_store
initial_sampling_probability: 1.0
target_samples_per_second: 1
calculation_interval: 10s
calculation_delay: 20s When I query the sampling endpoint after a minute or so, I get the following response: {
"strategyType": 0,
"operationSampling": {
"defaultSamplingProbability": 1,
"defaultLowerBoundTracesPerSecond": 0.016666666666666666,
"perOperationStrategies": [
{
"operation": "lets-go",
"probabilisticSampling": {
"samplingRate": 0.00001
}
}
],
"defaultUpperBoundTracesPerSecond": 0
}
} Why is the |
do you see this rate changing overtime at all, or it is always the same? If the same, maybe it's the default value that's being returned and the adaptive sampling part is actually not working, i.e. it's not recognizing the traffic. I create a helper extension #5986 to expose |
@yurishkuro The rate does seem to be changing over time but the calculation doesn't seem to make sense. I am hardcoding this function right now to return true for the time being - could that be why? I also wanted to ask what configuration we should use for the time being to bypass this check. Thanks for adding the extension in! I'll play around with it. |
You need to hardcode bypasses for both checks. I think it's best to implement reporting of aggregator's internal state to expvar, then it will make is a lot easier to investigate what happens. |
@yurishkuro I pushed up my working state at the moment. It looks like hardcoding the Let me know if you have any thoughts on how to proceed. I'll work on adding the expvar for the aggregator's service cache. |
We never had these tests properly setup. Now that Jaeger SDKs are deprecated we don't even know if there is a working example of adaptive sampling using OTEL SDKs. We need to set up integration tests to validate that the functionality is still working.
There are two possible ways we could validate adaptive sampling:
And even simpler alternative might be to just detect that the pre-configured default probability (which would be returned by the backend in the absence of any traffic for a given service) would change to a different calculated probability. For example, we could configure default probability as 100%, configure adaptive sampling target to be 1 trace per second, and run load generator with 10 traces per second. We would expect the calculated rate to go down to about 10% (that's the strategy 1 above), but even a change to 50% would already be an indication that adaptive sampling is working end to end. We don't need the e2e test to validate the exact calculation, since we do that in the unit tests.
The text was updated successfully, but these errors were encountered: