-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to OpenTelemetry tracing library #1824
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1824 +/- ##
==========================================
- Coverage 77.70% 77.70% -0.01%
==========================================
Files 294 294
Lines 62210 62200 -10
==========================================
- Hits 48340 48330 -10
Misses 12147 12147
Partials 1723 1723
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Cadence Benchstat comparisonThis branch with compared with the base branch onflow:master commit 226a46f Collapsed results for better readability
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with tracing, which was added by Ramtin.
@ramtinms Could you please have a look here?
@SaveTheRbtz Could you please benchmark locally and see if this change has a significant impact? |
@turbolent FYI I think there are two schools of thought about this:
I did some timing of the old Jaeger instrumentation here [1] and discovered that surprisingly it's even slower than using zerolog just for the stats collection part which presumably doesn't actually send off any data to the Jaeger server. FYI I also suggest using a different form of instrumentation which only records the minimum info -- e.g. timestamp and context -- during the time critical code, into a simple memory buffer. Then at the end of the time critical code, the recorded info can be processed asynchronously and even sent to multiple places concurrently, e.g. zlog, prometheus, and Jaeger. I show a working PoC on dummy code here [1] too. This method is the absolute fastest way to record the minimum info and in tests only spends about 50 nanoseconds per record event. This means if we wanted to, we can record everything without sampling and it's still much faster than sampling with Jaeger. Why? In tests then just recording in the time critical code is about 50 times fasters than logging, which is already faster than Jaeger. Therefore if Jaeger is sampling every 16th TX, because this other technique is 50 times faster, it's possible to sample all TXs and still be faster. This also means that we can add almost arbitrary extra timing instrumentation and still sample all TXs and still be faster... |
@turbolent on the cadence side perf impact is negligible since it only operates on (already escaping) data structures. Maybe even slightly faster on the newer version of the code given that it does not have a separate PS. Flow tracing microbenchmarks are here: onflow/flow-go#2806 (i'm missing a second reviewer there =))
It is quite hard to get a clean run on my laptop and even on a linux server (likely due to Turbo boost), but at least it shows that allocations did not change.
|
@simonhf @SaveTheRbtz Thank you for the info. Looks good! |
@@ -1,10 +1,8 @@ | |||
github.com/onflow/cadence () | |||
|
|||
github.com/davecgh/go-spew/spew (file) | |||
github.com/onflow/cadence/runtime/interpreter () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this expected ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess no entry is equivalent to ()
? @fxamacker do you maybe know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me, thanks for using attributes instead of events (aka logs) here, makes more sense
2823: [All] OpenTracing to OpenTelemetry migration r=SaveTheRbtz a=SaveTheRbtz Issue: #2796 OpenTracing spec is deprecated: opentracing/specification#163. Switching to the supported OpenTelemetry implementation. Note that new tracing code would honor [OTEL env vars](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.12.0/specification/protocol/exporter.md) instead of JAEGER ones, e.g.: `JAEGER_ENDPOINT` -> `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT`, `OTEL_EXPORTER_OTLP_TRACES_INSECURE`, etc (cc: `@sjonpaulbrown` `@haroldsphinx)` New API is slightly slower and consumes more memory but does substantially less allocs (~ -50%): ``` name old time/op new time/op delta StartSpanFromParent-8 485ns ± 6% 639ns ± 1% +31.71% (p=0.000 n=10+9) StartTransactionSpan/cacheHit-8 993ns ± 4% 927ns ± 3% -6.69% (p=0.000 n=10+10) StartTransactionSpan/cacheMiss-8 2.24µs ± 1% 2.19µs ± 1% -2.42% (p=0.000 n=10+10) name old alloc/op new alloc/op delta StartSpanFromParent-8 577B ± 0% 864B ± 0% +49.74% (p=0.000 n=10+10) StartTransactionSpan/cacheHit-8 776B ± 1% 944B ± 0% +21.65% (p=0.000 n=10+8) StartTransactionSpan/cacheMiss-8 2.16kB ± 1% 2.39kB ± 0% +10.62% (p=0.000 n=10+10) name old allocs/op new allocs/op delta StartSpanFromParent-8 7.00 ± 0% 4.00 ± 0% -42.86% (p=0.000 n=10+10) StartTransactionSpan/cacheHit-8 10.3 ± 7% 6.0 ± 0% -41.75% (p=0.000 n=10+10) StartTransactionSpan/cacheMiss-8 35.3 ± 2% 18.0 ± 0% -49.01% (p=0.000 n=10+10) ``` An example of both extensive tracing and cadence tracing enabled: <img width="1459" alt="Screen Shot 2022-07-25 at 10 21 07 PM" src="https://user-images.githubusercontent.com/169976/180929610-edbfdc54-e604-401f-aea5-501af2f861b2.png"> - [x] test against localnet's Tempo instance. - [x] noop tracer. - [x] log tracer. - [x] merge cadence bits onflow/cadence#1824. - [x] merge onflow/flow-core-contracts#298. - [x] update cadence. Co-authored-by: Alexey Ivanov <[email protected]>
Issue: onflow/flow-go#2796
Description
This diff switches deprecated OpenTracing-based library with standardized OpenTelemetry API. This is also needed since we are migrating Flow production tracing to OTEL Collector that prefers native OTEL formats and libraries.
OpenTelemetry says the following in their docs:
PS. While here, regenerated
go.cap
though I have no idea what i'm doing there.PPS. This commit will break flow-go interfaces, so it needs to be coupled with a similar diff there. (Not sure if the "Breaking change" label should be used for this or this is only for user-visible changes.)
master
branchFiles changed
in the Github PR explorer