Better Tracing #2032

balegas · 2024-11-22T15:08:36Z

We have set up basic Tracing for Electric but we want to continue improving what we become more successful investigating issues.

We're following our intuition a bit and we will continue. Having real traces that we need to investigar help us focus on the things that need more observability
in some cases we're abusing the use of spans which migh lead to high volumes of traces (these are billed). We should try to replace some traces for attributes and reserve the finer details for a TRACE mode that we can activate to increase the level of details to investigate incidents

balegas · 2024-11-24T22:49:48Z

note #1738

balegas · 2024-11-25T09:20:52Z

in some cases we're abusing the use of spans which migh lead to high volumes of traces (these are billed). We should try to replace some traces for ...

We want to be able to turn-on this level of detail if we need to investigate any occurrences, but during normal execution we should zoom out on some of the details and put more details into the attributes. I'm not sure if there is any pattern to do that with OTEL, otherwise we could put them behind a flag.

For the top-level number of transactions we handle, the developer can setup sampling

robacourt · 2024-11-25T10:32:21Z

We should try to replace some traces for attributes and reserve the finer details for a TRACE mode that we can activate to increase the level of details to investigate incidents

Beware of making the code complicated in order to achieve this!

A great way to reduce the number of traces is Sampling: https://opentelemetry.io/docs/concepts/sampling/ You can reduce traffic 10/100 fold while not losing detail.

Part of #2032. I removed 2 spans that seem redundant: - `shape_write.log_collector.handle_txn`: this span wraps the `handle_transaction` function. However, there is already a span `pg_txn.replication_client.transaction_received` that in fact calls into `handle_transaction`. - `shape_write.log_collector.handle_relation`: this span wraps the handling of relation messages. Similarly to transactions, there is a `pg_txn.replication_client.relation_received` span that ends up calling into the `handle_relation` function. Here are the relevant code snippets: `Electric.Postgres.ReplicationClient`: ```ex {m, f, args} = state.transaction_received OpenTelemetry.with_span( "pg_txn.replication_client.transaction_received", [num_changes: length(txn.changes), num_relations: MapSet.size(txn.affected_relations)], fn -> apply(m, f, [txn | args]) end ) ``` The call to `apply` is a chain of calls that eventually ends up calling `handle_transaction` (and does nothing more): `Electric.StackSupervisor`: ```ex transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [shape_log_collector]} ``` `Electric.Replication.ShapeLogCollector.ex`: ```ex def store_transaction(%Transaction{} = txn, server) do ot_span_ctx = OpenTelemetry.get_current_context() GenStage.call(server, {:new_txn, txn, ot_span_ctx}, :infinity) end def handle_call({:new_txn, %Transaction{xid: xid, lsn: lsn} = txn, ot_span_ctx}, from, state) do OpenTelemetry.set_current_context(ot_span_ctx) Logger.info("Received transaction #{xid} from Postgres at #{lsn}") Logger.debug(fn -> "Txn received in ShapeLogCollector: #{inspect(txn)}" end) OpenTelemetry.with_span("shape_write.log_collector.handle_txn", [], fn -> handle_transaction(txn, from, state) end) end ``` ### Question I removed the spans from `Electric.Replication.ShapeLogCollector`, another option would be to remove the spans from `Electric.Postgres.ReplicationClient`, any preference here?

Part of #2032. We noticed that most spans are descendants of the `pg_txn.replication_client.process_x_log_data` root span. Therefore, we decided to only sample a portion of those spans. This PR introduces a custom sampler that works as follows: - Samples a configurable ratio of `pg_txn.replication_client.process_x_log_data` root spans - Samples all other root spans - Child spans are sampled if their parent is sampled ### Problem We would like to sample all errors. To do this we need to make the sampling decision at the end of the span when we have all attributes and events because errors are recorded using an "exception" event, this is known as tail sampling. However, the Erlang opentelemetry library only seems to support head sampling: > Sampling is performed at span creation time by the Sampler configured on the Tracer cf. https://docs.honeycomb.io/manage-data-volume/sample/techniques/ if you're not familiar with head vs tail sampling. EDIT: solving this problem may require using HoneyComb's "Refinery" mechanism. So Electric would sample all traces and we would setup Refinery with custom tail sampling logic. However, this requires extra infrastructure to set up.

KyleAMathews · 2024-12-12T02:05:50Z

@balegas more work needed here?

balegas · 2024-12-16T14:17:22Z

@icehaunter let's add the Electric metrics to honeycomb.

balegas added this to the Production Readiness milestone Nov 22, 2024

kevin-dp mentioned this issue Nov 25, 2024

chore (sync service): remove redundant spans #2037

Merged

balegas added the tracing label Nov 26, 2024

kevin-dp mentioned this issue Nov 27, 2024

chore (sync service): probablistic sampling of transaction spans #2054

Merged

balegas assigned icehaunter Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Tracing #2032

Better Tracing #2032

balegas commented Nov 22, 2024

balegas commented Nov 24, 2024

balegas commented Nov 25, 2024

robacourt commented Nov 25, 2024

KyleAMathews commented Dec 12, 2024

balegas commented Dec 16, 2024

Better Tracing #2032

Better Tracing #2032

Comments

balegas commented Nov 22, 2024

balegas commented Nov 24, 2024

balegas commented Nov 25, 2024

robacourt commented Nov 25, 2024

KyleAMathews commented Dec 12, 2024

balegas commented Dec 16, 2024