-
-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support distributed trace sampling #142
base: master
Are you sure you want to change the base?
Changes from all commits
b741d10
8989810
230254c
e625d6e
49bbdfd
0269408
6ae8458
130ea6e
71eddca
fb3db25
2c7253f
30a7d11
8b36101
004fa12
a07a0fa
2a0b286
605d3a0
99b24f0
6557fd0
0d1dd2c
9b8bd5a
f17ae74
fa41b1c
25b08d0
ef04a4e
734eb1f
a2fd94e
ab8b663
a40b1cb
80a7ce2
7174fcb
fb29312
06be739
f998617
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
--- | ||
name: CI | ||
on: push | ||
|
||
jobs: | ||
elixir: | ||
name: Elixir Tests | ||
runs-on: ubuntu-20.04 | ||
env: | ||
MIX_ENV: test | ||
strategy: | ||
matrix: | ||
elixir: ['1.11.4', '1.13.4'] | ||
otp: ['24.3'] | ||
steps: | ||
- name: Cancel previous runs | ||
uses: styfle/[email protected] | ||
with: | ||
access_token: ${{ github.token }} | ||
|
||
- name: Checkout code | ||
uses: actions/checkout@v2 | ||
|
||
- name: Setup Elixir | ||
uses: erlef/setup-beam@v1 | ||
with: | ||
otp-version: ${{ matrix.otp }} | ||
elixir-version: ${{ matrix.elixir }} | ||
|
||
- name: Get deps cache | ||
uses: actions/cache@v2 | ||
with: | ||
path: deps/ | ||
key: deps-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }} | ||
|
||
- name: Get build cache | ||
uses: actions/cache@v2 | ||
with: | ||
path: _build/test/ | ||
key: build-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }} | ||
|
||
- name: Install deps | ||
run: | | ||
mix local.rebar --force | ||
mix local.hex --force | ||
mix deps.get | ||
|
||
- name: Compile code | ||
run: | | ||
# Use of @deprecated causes compile warnings | ||
# mix compile --warnings-as-errors | ||
mix compile | ||
|
||
- name: Run tests | ||
run: | | ||
mix test | ||
mix format --check-formatted | ||
mix coveralls | ||
mix inch.report | ||
|
||
# - name: Publish unit test results to GitHub | ||
# uses: EnricoMi/publish-unit-test-result-action@v2 | ||
# if: always() # always run even if tests fail | ||
# with: | ||
# junit_files: _build/test/lib/*/test-junit-report.xml |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,18 @@ See [Conventional Commits](Https://conventionalcommits.org) for commit guideline | |
|
||
<!-- changelog --> | ||
|
||
### Breaking Changes: | ||
|
||
* Don't set default priority to support distributed traces and sampling | ||
* Use new `Config` instead of `Mix.Config` in `config` files | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is a breaking change at all in terms of people using Spandex, right? I just want to be sure I'm not missing something there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The effect of not defaulting sampling to 1 is that the sampling metric ( Failure to set sampling by default is unlikely to cause problems. On the other hand, setting it by default makes it difficult to actually use distributed sampling, resulting in higher Datadog bills. |
||
* Update libraries | ||
|
||
### Bug Fixes: | ||
* Apply fix for interpolated string span and trace names from PR #136 | ||
|
||
### Features: | ||
* Add functions to get and set trace priority | ||
|
||
## [3.1.0](https://github.com/spandex-project/spandex/compare/3.0.3...3.1.0) (2021-10-23) | ||
|
||
* Encode logger metadata as string. by @aselder in https://github.com/spandex-project/spandex/pull/127 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,12 @@ | ||
# This file is responsible for configuring your application | ||
# and its dependencies with the aid of the Mix.Config module. | ||
use Mix.Config | ||
# and its dependencies with the aid of the Config module. | ||
# | ||
# This configuration file is loaded before any dependency and | ||
# is restricted to this project. | ||
|
||
import_config "#{Mix.env()}.exs" | ||
# General application configuration | ||
import Config | ||
|
||
# Import environment specific config. This must remain at the bottom | ||
# of this file so it overrides the configuration defined above. | ||
import_config "#{config_env()}.exs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if you have any documentation references from Datadog about not deciding a value / sending out a sampling decision downstream unless one is received upstream or decided in the application. Shouldn't we always default to choose something, as opposed to leaving it
nil
? I like the idea of adding a probabilistic sampler instead of having it always hard-coded, but I'm curious if there's some reason that it makes sense for it to ever benil
, when there are already provisions for auto/manual keep/reject.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Datdog docs are here: https://docs.datadoghq.com/tracing/trace_pipeline/ingestion_controls/
That is not super clear, however. I gleaned most of my knowledge by reading the Datadog client library for Ruby and doing things the same way.
The theory is this:
_sampling_priority_v1
when uploading the trace, and it communicates that decision to downstream processes by setting thex-datagog-sampling-priority
HTTP header. Downstream processes will then consider the sampling decision already made and preserve it.To answer your question, if we don't set a priority, then the trace is still sent to Datadog, and Datadog will ingest it. Setting it to 1 by default means that we are saying that a decision has been made to sample this trace, when no decision has actually been made. This can hinder a downstream process from making the decision to trace or not. For example, it's pretty common for the trace to be started in e.g. a load balancer, but it doesn't make the decision about sampling. By leaving the decision unmade, something else can handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the details here! I did read through those Datadog docs, but as you said, it's not really written for implementers, but just for general APM users to understand how it works and not all the details of how to implement it.
I can definitely see your point about not wanting to have a load-balancer make the sampling decision, but in the context of Spandex, wouldn't an Elixir application always want to make a decision one way or the other? If I understand correctly, there are a few possible scenarios:
0
or1
based on a configurable probability, but that's not currently built into Spandex. Since Spandex has made a decision, that decision will be propagated to downstream services and also to the Datadog Agent. Since the decision is passed to the Datadog Agent, it's no longer possible to configure the sample rate in Datadog's Agent for all services because the service is configuring it_sampling_priority_v1
and related metadata in the first place. I'll try to do some more experimentation there to see how it behaves if you never send any spans with this metadata, but I currently don't have access to a Datadog testing environment because I'm waiting for them to renew the trial account that I use for development.RateSampler
like you've already proposed to add tospandex_datadog
. It might make sense to add this directly tospandex
instead... I'd have to think about that. This would allow for configurable probabilistic sampling per service (probably at the level of aSpandex.Tracer
instance). The benefit here is that it's an additive feature to what exists today, and it could be as simple as always using theRateSampler
but having it default to 100%, allowing the user to configure it to whatever smaller rate they want.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The background is that we are spending crazy amounts on Datadog and need to implement sampling.
We need all decisions about sampling and span analytics to be explicit. The current defaults force everything on.
Using
nil
by default for priority allows us to distinguish when priority has been set by an upstream header.A plug can then make a decision about whether to set the value, i.e. if it has been set upstream by a header, then use it, otherwise make a sampling decision. If it is 1 by default, then we have no way of knowing what happened.
I believe Datadog will behave reasonably if no priority is set when sending traces. The Ruby Datadog client library does not set
_sampling_priority_v1
if priority is not defined, and I think we should do the same.The same thing applies to other philosophical questions, just do what the official Datadog clients do :-)