Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distributed trace sampling #142

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
b741d10
Make priority optional
reachfh Aug 9, 2022
8989810
Use Config instead of Mix.Config
reachfh Aug 9, 2022
230254c
Upgrade dialyxir
reachfh Aug 9, 2022
e625d6e
Upgrade dialyxir
reachfh Aug 9, 2022
49bbdfd
mix format
reachfh Aug 9, 2022
0269408
Upgrade credo and inch
reachfh Aug 9, 2022
6ae8458
Upgrade libraries
reachfh Aug 9, 2022
130ea6e
PR136: Fix interpolated string span and trace names
reachfh Aug 9, 2022
71eddca
Add functions to get and set trace priority
reachfh Aug 11, 2022
fb3db25
Update CHANGELOG
reachfh Aug 20, 2022
2c7253f
Add docs
reachfh Aug 22, 2022
30a7d11
Use new names for priority constants
reachfh Aug 22, 2022
8b36101
Allow Spandex.Tracer functions to be overrided
reachfh Aug 22, 2022
004fa12
Use bind_quoted for opts
reachfh Aug 22, 2022
a07a0fa
Fix typo
reachfh Aug 22, 2022
2a0b286
Document overriding span_error
reachfh Aug 22, 2022
605d3a0
Add GitHub Action
reachfh Aug 23, 2022
99b24f0
mix format
reachfh Aug 23, 2022
6557fd0
Disable --warnings-as-errors
reachfh Aug 23, 2022
0d1dd2c
Doc
reachfh Aug 23, 2022
9b8bd5a
Fix name
reachfh Aug 23, 2022
f17ae74
Add :file and :line to test logs
reachfh Aug 23, 2022
fa41b1c
Use Elixir 1.11 test image; Run mix format check after deps.get
reachfh Aug 23, 2022
25b08d0
Require Elixir 1.11
reachfh Aug 23, 2022
ef04a4e
Update docs
reachfh Aug 24, 2022
734eb1f
Don't send coveralls report to server
reachfh Aug 24, 2022
a2fd94e
Test older versions of Elixir
reachfh Aug 24, 2022
ab8b663
Use setup-elixir instead of setup-beam
reachfh Aug 24, 2022
a40b1cb
Use later version of setup-elixir
reachfh Aug 24, 2022
80a7ce2
Specify detailed versions of Elxir and OTP
reachfh Aug 24, 2022
7174fcb
Fix variable
reachfh Aug 24, 2022
fb29312
Try setup-beam again
reachfh Aug 24, 2022
06be739
Use correct name
reachfh Aug 24, 2022
f998617
Only test on OTP 24
reachfh Aug 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ version: 2
jobs:
build:
docker:
# specify the version here
- image: circleci/elixir:1.7
- image: cimg/elixir:1.11
environment:
MIX_ENV: test

# Specify service dependencies here if necessary
# CircleCI maintains a library of pre-built images
Expand All @@ -16,12 +17,12 @@ jobs:
working_directory: ~/spandex
steps:
- checkout

# specify any bash command here prefixed with `run: `
- run: mix format --check-formatted
- run: mix local.hex --force
- run: mix local.rebar --force
- run: mix deps.get
- run: mix format --check-formatted
- run: mix compile --warnings-as-errors
- run: mix coveralls.circle
# (ExCoveralls.ReportUploadError) Failed to upload the report to 'https://coveralls.io' (reason: status_code = 422, body = {"message":"Couldn't find a repository matching this job.","error":true}).
# - run: mix coveralls.circle
- run: mix coveralls
- run: mix inch.report
65 changes: 65 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
name: CI
on: push

jobs:
elixir:
name: Elixir Tests
runs-on: ubuntu-20.04
env:
MIX_ENV: test
strategy:
matrix:
elixir: ['1.11.4', '1.13.4']
otp: ['24.3']
steps:
- name: Cancel previous runs
uses: styfle/[email protected]
with:
access_token: ${{ github.token }}

- name: Checkout code
uses: actions/checkout@v2

- name: Setup Elixir
uses: erlef/setup-beam@v1
with:
otp-version: ${{ matrix.otp }}
elixir-version: ${{ matrix.elixir }}

- name: Get deps cache
uses: actions/cache@v2
with:
path: deps/
key: deps-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }}

- name: Get build cache
uses: actions/cache@v2
with:
path: _build/test/
key: build-${{ runner.os }}-${{ matrix.otp }}-${{ matrix.elixir }}-${{ hashFiles('**/mix.lock') }}

- name: Install deps
run: |
mix local.rebar --force
mix local.hex --force
mix deps.get

- name: Compile code
run: |
# Use of @deprecated causes compile warnings
# mix compile --warnings-as-errors
mix compile

- name: Run tests
run: |
mix test
mix format --check-formatted
mix coveralls
mix inch.report

# - name: Publish unit test results to GitHub
# uses: EnricoMi/publish-unit-test-result-action@v2
# if: always() # always run even if tests fail
# with:
# junit_files: _build/test/lib/*/test-junit-report.xml
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,18 @@ See [Conventional Commits](Https://conventionalcommits.org) for commit guideline

<!-- changelog -->

### Breaking Changes:

* Don't set default priority to support distributed traces and sampling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if you have any documentation references from Datadog about not deciding a value / sending out a sampling decision downstream unless one is received upstream or decided in the application. Shouldn't we always default to choose something, as opposed to leaving it nil? I like the idea of adding a probabilistic sampler instead of having it always hard-coded, but I'm curious if there's some reason that it makes sense for it to ever be nil, when there are already provisions for auto/manual keep/reject.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Datdog docs are here: https://docs.datadoghq.com/tracing/trace_pipeline/ingestion_controls/
That is not super clear, however. I gleaned most of my knowledge by reading the Datadog client library for Ruby and doing things the same way.

The theory is this:

  • When we are doing "head sampling", the first process in the distributed trace makes a decision about whether this specific trace should be sampled. So it could e.g. say that only 10% of the traces should be sampled, and 10% of the time it would set priority to 1. The rest of the time it would set it to 0. That means that it has made a decision. It passes that decision to Datadog by setting _sampling_priority_v1 when uploading the trace, and it communicates that decision to downstream processes by setting the x-datagog-sampling-priority HTTP header. Downstream processes will then consider the sampling decision already made and preserve it.
  • If an error occurs in a trace that was otherwise not sampled (priority = 0), then the app can select it for tracing after the fact, by setting priority = 2. The Datadog model is that it receives all of the traces, then makes a decision about which ones to index or retain long term. So it can still include the complete trace across all apps, avoiding broken traces.
  • This model is different from throwing away traces when sampling, i.e. deciding to not do tracing at all or not send the trace to Datadog. That reduces the overhead of tracing, but results in broken traces in a distributed environment and metrics based on counting spans are incorrect.

To answer your question, if we don't set a priority, then the trace is still sent to Datadog, and Datadog will ingest it. Setting it to 1 by default means that we are saying that a decision has been made to sample this trace, when no decision has actually been made. This can hinder a downstream process from making the decision to trace or not. For example, it's pretty common for the trace to be started in e.g. a load balancer, but it doesn't make the decision about sampling. By leaving the decision unmade, something else can handle it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the details here! I did read through those Datadog docs, but as you said, it's not really written for implementers, but just for general APM users to understand how it works and not all the details of how to implement it.

I can definitely see your point about not wanting to have a load-balancer make the sampling decision, but in the context of Spandex, wouldn't an Elixir application always want to make a decision one way or the other? If I understand correctly, there are a few possible scenarios:

  1. (Current situation) Spandex defaults to sampling priority 1 for all traces. It would be pretty easy for someone to make a mechanism to randomly choose priority 0 or 1 based on a configurable probability, but that's not currently built into Spandex. Since Spandex has made a decision, that decision will be propagated to downstream services and also to the Datadog Agent. Since the decision is passed to the Datadog Agent, it's no longer possible to configure the sample rate in Datadog's Agent for all services because the service is configuring it
  2. (Proposed change in this PR) Spandex doesn't make any decision by default, leaving it up to the Datadog Agent configuration. It doesn't propagate a decision to downstream services or to the Datadog Agent unless a decision is made somehow in application code, or a decision is passed in from upstream. It's still possible to insert a mechanism to make the decision, as described in scenario 1 above, e.g. via a Plug. The down-side of this scenario is that if someone upgrades their Spandex deps without adding a decision-making mechanism anywhere, I believe they would lose access to the trace analytics feature in Datadog altogether, and they would probably not know why. I haven't confirmed this, but I believe that's the reason we started to include the _sampling_priority_v1 and related metadata in the first place. I'll try to do some more experimentation there to see how it behaves if you never send any spans with this metadata, but I currently don't have access to a Datadog testing environment because I'm waiting for them to renew the trial account that I use for development.
  3. (Alternate solution) Spandex always makes a sampling decision, but defaults to sampling priority 1 for all traces, for backward-compatibility, but offers a new RateSampler like you've already proposed to add to spandex_datadog. It might make sense to add this directly to spandex instead... I'd have to think about that. This would allow for configurable probabilistic sampling per service (probably at the level of a Spandex.Tracer instance). The benefit here is that it's an additive feature to what exists today, and it could be as simple as always using the RateSampler but having it default to 100%, allowing the user to configure it to whatever smaller rate they want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background is that we are spending crazy amounts on Datadog and need to implement sampling.
We need all decisions about sampling and span analytics to be explicit. The current defaults force everything on.

Using nil by default for priority allows us to distinguish when priority has been set by an upstream header.
A plug can then make a decision about whether to set the value, i.e. if it has been set upstream by a header, then use it, otherwise make a sampling decision. If it is 1 by default, then we have no way of knowing what happened.

I believe Datadog will behave reasonably if no priority is set when sending traces. The Ruby Datadog client library does not set _sampling_priority_v1 if priority is not defined, and I think we should do the same.
The same thing applies to other philosophical questions, just do what the official Datadog clients do :-)

* Use new `Config` instead of `Mix.Config` in `config` files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a breaking change at all in terms of people using Spandex, right? I just want to be sure I'm not missing something there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of not defaulting sampling to 1 is that the sampling metric (_sampling_priority_v1) won't be set when sending traces to Datadog. If it's not set, then Datadog will do whatever its default is, instead of considering the trace to be selected for sampling. The x-datagog-sampling-priority header won't be set for downstream requests, and downstream processes won't consider the request selected for sampling.

Failure to set sampling by default is unlikely to cause problems. On the other hand, setting it by default makes it difficult to actually use distributed sampling, resulting in higher Datadog bills.

* Update libraries

### Bug Fixes:
* Apply fix for interpolated string span and trace names from PR #136

### Features:
* Add functions to get and set trace priority

## [3.1.0](https://github.com/spandex-project/spandex/compare/3.0.3...3.1.0) (2021-10-23)

* Encode logger metadata as string. by @aselder in https://github.com/spandex-project/spandex/pull/127
Expand Down
55 changes: 53 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ monitoring tool that allows you get extremely granular information about the
runtime of your system. Using distributed tracing, you can also get a view of
how requests make their way through your entire ecosystem of microservices or
applications. Currently, Spandex only supports integrating with
[datadog](https://www.datadoghq.com/), but it is built to be agnostic to what
[Datadog](https://www.datadoghq.com/), but it is built to be agnostic to what
platform you choose to view your trace data. Eventually it should support Open
Zipkin, Stackdriver, and any other trace viewer/aggregation tool you'd like to
integrate with. We are still under active development, working on moving to a
Expand All @@ -38,7 +38,7 @@ This is Datadog-specific since that's currently the only adapter.
## Adapters

* [Datadog](https://github.com/spandex-project/spandex_datadog)
* Thats it so far! If you want another adapter, it should be relatively easy to
* That's it so far! If you want another adapter, it should be relatively easy to
write! This library is in charge of handling the state management of spans,
and the adapter is just in charge of generating certain values and ultimately
sending the values to the service.
Expand Down Expand Up @@ -282,3 +282,54 @@ Check out [spandex_ecto](https://github.com/spandex-project/spandex_ecto).
## Phoenix Tracing

Check out [spandex_phoenix](https://github.com/spandex-project/spandex_phoenix).

## Sampling and Rate Limiting

When the load or cost from tracing increases, it is useful to use rate limiting
or sampling to reduce tracing. When many traces are the same, it's enough to
trace only e.g. 10% of them, reducing the bill by 90% while still preserving
the ability to troubleshoot the system. The tracing still happens, but it
may not be sent to the monitoring service, or the service may drop it or not
retain detailed information.

Spandex stores the `priority` as an integer in the top level `Trace`.

In Datadog, there are four values:
* `MANUAL_KEEP`(2) indicates that the application wants to ensure that a trace is
sampled, e.g. if there is an error
* `AUTO_KEEP` (1) indicates that a trace has been selected for sampling
* `AUTO_REJECT` (0) indicates that the trace has not been selected for sampling
* `MANUAL_REJECT` (-1) indicates that the application wants a trace to be dropped

Similarly, OpenTracing uses 0 and 1 to indicate that a trace is sampled.

In distributed tracing, multiple processes contribute to the same trace. When
sampling, the process that starts the trace can make a decision about whether
it should be sampled. It then passes that information to downstream processes
via an HTTP header.

A trace may be sampled out, i.e. priority of 0, but the application can
override the priority manually. This is usually done for requests with errors,
as they are the ones that need troubleshooting. You can also enable tracing
dynamically with a feature flag to debug a feature in production.

Spandex has functions to read and set the priority
(`Spandex.Tracer.current_priority/1` and `Spandex.Tracer.update_priority/2`).

The following code overrides the `span_error` function on the tracer to set the
priority to `MANUAL_KEEP` when there is an error on a trace:

```elixir
defmodule Foo.Tracer do
use Spandex.Tracer, otp_app: :foo

@impl Spandex.Tracer
def span_error(error, stacktrace, opts) do
super(error, stacktrace, opts)
__MODULE__.update_priority(2)
end
```

The specific details of priority and other sampling and rate limiting are specific
to the observability back end, so look to e.g.
[spandex_datadog](https://github.com/spandex-project/spandex_datadog) for details.
13 changes: 10 additions & 3 deletions config/config.exs
Original file line number Diff line number Diff line change
@@ -1,5 +1,12 @@
# This file is responsible for configuring your application
# and its dependencies with the aid of the Mix.Config module.
use Mix.Config
# and its dependencies with the aid of the Config module.
#
# This configuration file is loaded before any dependency and
# is restricted to this project.

import_config "#{Mix.env()}.exs"
# General application configuration
import Config

# Import environment specific config. This must remain at the bottom
# of this file so it overrides the configuration defined above.
import_config "#{config_env()}.exs"
4 changes: 1 addition & 3 deletions config/dev.exs
Original file line number Diff line number Diff line change
@@ -1,6 +1,4 @@
# This file is responsible for configuring your application
# and its dependencies with the aid of the Mix.Config module.
use Mix.Config
import Config

config :git_ops,
mix_project: Spandex.Mixfile,
Expand Down
4 changes: 2 additions & 2 deletions config/test.exs
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
use Mix.Config
import Config

config :logger, :console,
level: :debug,
colors: [enabled: false],
format: "$time $metadata[$level] $message\n",
metadata: [:trace_id, :span_id]
metadata: [:trace_id, :span_id, :file, :line]

config :spandex, :decorators, tracer: Spandex.Test.Support.Tracer

Expand Down
4 changes: 2 additions & 2 deletions lib/span_context.ex
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ defmodule Spandex.SpanContext do
@type t :: %__MODULE__{
trace_id: Spandex.id(),
parent_id: Spandex.id(),
priority: integer(),
priority: integer() | nil,
baggage: Keyword.t()
}

defstruct trace_id: nil,
parent_id: nil,
priority: 1,
priority: nil,
baggage: []
end
46 changes: 42 additions & 4 deletions lib/spandex.ex
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ defmodule Spandex do
| {:error, :disabled}
| {:error, :trace_running}
| {:error, [Optimal.error()]}
def start_trace(_, :disabled), do: {:error, :disabled}
def start_trace(name, :disabled) when is_binary(name), do: {:error, :disabled}

def start_trace(name, opts) do
def start_trace(name, opts) when is_binary(name) do
strategy = opts[:strategy]

if strategy.trace_active?(opts[:trace_key]) do
Expand All @@ -55,9 +55,9 @@ defmodule Spandex do
{:ok, Span.t()}
| {:error, :disabled}
| {:error, :no_trace_context}
def start_span(_, :disabled), do: {:error, :disabled}
def start_span(name, :disabled) when is_binary(name), do: {:error, :disabled}

def start_span(name, opts) do
def start_span(name, opts) when is_binary(name) do
strategy = opts[:strategy]

case strategy.get_trace(opts[:trace_key]) do
Expand Down Expand Up @@ -104,6 +104,24 @@ defmodule Spandex do
end
end

@doc """
Update the priority of the current trace.
"""
@spec update_priority(integer(), Tracer.opts()) ::
{:ok, Trace.t()}
| {:error, :disabled}
| {:error, :no_trace_context}
| {:error, [Optimal.error()]}
def update_priority(_, :disabled), do: {:error, :disabled}

def update_priority(priority, opts) do
strategy = opts[:strategy]

with {:ok, trace} <- strategy.get_trace(opts[:trace_key]) do
strategy.put_trace(opts[:trace_key], %{trace | priority: priority})
end
end

@doc """
Updates the top-most parent span.

Expand Down Expand Up @@ -247,6 +265,26 @@ defmodule Spandex do
update_span(Keyword.put_new(opts, :error, updates))
end

@doc """
Returns the priority of the currently running trace.
"""
@spec current_priority(Tracer.opts()) :: integer() | nil
def current_priority(:disabled), do: nil

def current_priority(opts) do
strategy = opts[:strategy]

case strategy.get_trace(opts[:trace_key]) do
{:ok, %Trace{priority: priority}} ->
priority

{:error, _} ->
# TODO: Alter the return type of this interface to allow for returning
# errors from fetching the trace.
nil
end
end

@doc """
Returns the id of the currently-running trace.
"""
Expand Down
4 changes: 2 additions & 2 deletions lib/trace.ex
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ defmodule Spandex.Trace do
"""
defstruct baggage: [],
id: nil,
priority: 1,
priority: nil,
spans: [],
stack: []

@typedoc @moduledoc
@type t :: %__MODULE__{
baggage: Keyword.t(),
id: Spandex.id(),
priority: integer(),
priority: integer() | nil,
spans: [Spandex.Span.t()],
stack: [Spandex.Span.t()]
}
Expand Down
Loading