-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(instrumentation-runtime): initial version #1689
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1689 +/- ##
==========================================
+ Coverage 91.72% 91.80% +0.08%
==========================================
Files 139 141 +2
Lines 7128 7191 +63
Branches 1433 1435 +2
==========================================
+ Hits 6538 6602 +64
+ Misses 590 589 -1
|
baac6e1
to
bfebf77
Compare
plugins/node/opentelemetry-instrumentation-runtime/src/instrumentation.ts
Outdated
Show resolved
Hide resolved
Another thing you could do is rename the directory:
|
/** Function for adding custom metric attributes on all recorded metrics */ | ||
customMetricAttributes?: () => Attributes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config option can be added to README as well under ### Runtime Instrumentation Options
. probably forgotten :)
*/ | ||
|
||
export * from './instrumentation'; | ||
export * from './types'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also export ./enums/AttributeNames
? to make it available to potential consumers as part of the public API?
* @default 10 | ||
*/ | ||
monitorEventLoopDelayResolution?: number; | ||
/** Function for adding custom metric attributes on all recorded metrics */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with metrics practices. was wondering what kind of attributes you think are useful here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everyone has their own use case, some people have expressed interest here: open-telemetry/opentelemetry-js#4088
For me, I need to add pod role (stable / canary) which is not known at boot time of the app, it needs to come during request phase when app is already started, as this role can change during lifecycle of the app (a canary can be promoted to stable) which means this cannot be done as environment variable or in prometheus exporter.
this.eventLoopDelayHistogram = monitorEventLoopDelay({ | ||
resolution: config.monitorEventLoopDelayResolution, | ||
}); | ||
this.eventLoopDelayHistogram.enable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done in the constructor, thus will always do work even if the instrumentation is installed but disabled (as might be the case with the auto-instrumentation
package.
I wonder if it might be more appropriate to use the InstrumentationBase
functions:
enable(): void;
disable(): void;
to start and stop data recording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context: I am personally not a fan of the auto-instrumentation
package and have no desire of adding this plugin there. What I am a fan of is that people hand pick the plugins that make sense to their app.
And if they pick them, they should be active. I dont want them to pass in a "disabled" or "enabled" flag. That's why current implementation makes sense to me: You add it, you use it. And that is how it works now.
So my question is with this enabled
/ disabled
API, how does it work? You add a plugin and somehow have to "enable" it as well, or pass in an "enabled" flag? 🤷 It makes me unsure how it would work if I made this change.
Can I suggest that if someone want to add this plugin to "auto instrumentation" that they also do any update required in this plugin? I would rather skip it unless it also makes sense for this package in isolation in this initial version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding it would be beneficial - while the auto-instrumentation package is something that uses enable()
/disable()
, the registerInstrumentations()
function also uses them - this function returns a function to call disable()
all instrumentations registered with it, which when called is supposed to stop instrumentations from emitting telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some further thoughts: I'm really worried about users depending on nodejs.event_loop_delay*
metrics in their dashboards, as they don't really conform to how OTel is supposed to work. I guess in the long run, if these are metrics that we want exported, I don't see any way around instrumenting the native Node.js code (maybe using OTel C++?) and exporting proper (exponential) histograms from there. I wonder if there's something we could do to make people aware that these may not stick around (like making them opt-in via configuration). 🤔
|
||
observable.observe( | ||
eventLoopDelayMinGauge, | ||
this.eventLoopDelayHistogram.min / 1e6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps introduce a nanoSecondToMilliSecond
function or constant to make the conversion units explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, why not.
|
||
private _getCustomMetricAttributes(): Attributes { | ||
return this.config.customMetricAttributes | ||
? this.config.customMetricAttributes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a common practice for instrumentations in this repo to assume that the user-provided hook might throw, and protect against it with try/catch
and proper diag message.
While not very likely to happen, it could be nice to keep consistency with other packages in this repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely!
plugins/node/instrumentation-runtime/src/enums/AttributeNames.ts
Outdated
Show resolved
Hide resolved
plugins/node/instrumentation-runtime/src/enums/AttributeNames.ts
Outdated
Show resolved
Hide resolved
plugins/node/instrumentation-runtime/src/enums/AttributeNames.ts
Outdated
Show resolved
Hide resolved
916917a
to
d5099b3
Compare
…to instrumentation
d5099b3
to
8b7966f
Compare
@pichlermarc @blumamir Are we happy? 🤔 |
Similarly, mean/standard deviation/PXX values are more and more likely to stay the same after the process has been running for a while and will be less useful to monitor the longer the process has been running. @klippx What are your thoughts on adding an extra option to this instrumentation that, if set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more thoughts, I think this may be a bit trickier than I initially thought. 🤔
this.eventLoopDelayHistogram = monitorEventLoopDelay({ | ||
resolution: config.monitorEventLoopDelayResolution, | ||
}); | ||
this.eventLoopDelayHistogram.enable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding it would be beneficial - while the auto-instrumentation package is something that uses enable()
/disable()
, the registerInstrumentations()
function also uses them - this function returns a function to call disable()
all instrumentations registered with it, which when called is supposed to stop instrumentations from emitting telemetry.
GaugeNames.NODE_EVENT_LOOP_UTILIZATION, | ||
{ | ||
description: 'The percentage utilization of the event loop.', | ||
unit: 'percent', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to https://github.com/open-telemetry/semantic-conventions/blob/ea50a0d2e13f215047b897133e29fd3efad093b6/docs/general/metrics.md?plain=1#L199-L200 we need to do
unit: 'percent', | |
unit: '%', |
but it's more likely we'll need to
unit: 'percent', | |
unit: 1, |
and keep the value range [0.0,1.0] as returned by ELU.utilization
ELU.utilization * 100, | ||
attributes | ||
); | ||
observable.observe(loopIdleGauge, ELU.idle, attributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how useful adding idle time here actually is; consider the following:
- metrics are observed when the
MetricReader
collects the data - the interval in which the
MetricReader
collects that data may not be fixed or the same across all apps that report this metric
IIUC this would mean that if idle time is reported, it will not be comparable between processes (as the interval may be different) and won't be comparable form one data-point to the next as the the interval may vary there as well.
If my assumption holds, then this also holds for active-time.
NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop_delay', | ||
NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop_delay.min', | ||
NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop_delay.max', | ||
NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop_delay.mean', | ||
NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop_delay.stddev', | ||
NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop_delay.p50', | ||
NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop_delay.p95', | ||
NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop_delay.p99', | ||
NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop_utilization', | ||
NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop_utilization.idle', | ||
NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop_utilization.active', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I've seen in the semconv, most of the time dots are preferred (edit: not sure if there's a rule, though)
NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop_delay', | |
NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop_delay.min', | |
NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop_delay.max', | |
NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop_delay.mean', | |
NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop_delay.stddev', | |
NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop_delay.p50', | |
NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop_delay.p95', | |
NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop_delay.p99', | |
NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop_utilization', | |
NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop_utilization.idle', | |
NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop_utilization.active', | |
NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop.delay', | |
NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop.delay.min', | |
NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop.delay.max', | |
NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop.delay.mean', | |
NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop.delay.stddev', | |
NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop.delay.p50', | |
NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop.delay.p95', | |
NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop.delay.p99', | |
NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop.utilization', | |
NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop.utilization.idle', | |
NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop.utilization.active', |
this.eventLoopDelayHistogram = monitorEventLoopDelay({ | ||
resolution: config.monitorEventLoopDelayResolution, | ||
}); | ||
this.eventLoopDelayHistogram.enable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some further thoughts: I'm really worried about users depending on nodejs.event_loop_delay*
metrics in their dashboards, as they don't really conform to how OTel is supposed to work. I guess in the long run, if these are metrics that we want exported, I don't see any way around instrumenting the native Node.js code (maybe using OTel C++?) and exporting proper (exponential) histograms from there. I wonder if there's something we could do to make people aware that these may not stick around (like making them opt-in via configuration). 🤔
GaugeNames.NODE_EVENT_LOOP_UTILIZATION_IDLE, | ||
{ | ||
description: 'The idle time utilization of event loop.', | ||
unit: 'ms', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to use seconds for this instead: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metrics.md#instrument-units
I have been thinking about this, and I dont really think anyone will enjoy this plugin too much. In order to be useful, we want a plugin that can make use of a real opentelemetry bucket. |
@klippx I think a few people will find this instrumentation useful. Event loop delay/utilization can supplement CPU utilization for autoscaling, for example. @pichlermarc would there still be interest in merging this PR in once the comments are addressed? If so, I'd be interested in seeing this to completion.
One of the benefits of just relaying the
I also think they won't be of much use since their individual values are not comparable between scrapes. The only reason I can think of of exposing them is to calculate the % utilization, so we can just expose the |
I think it can be very helpful. I'd perefer a PR that just starts with
I agree that this is better in terms of performance. I'm not completely opposed to having them, I'd just rather have these metrics disabled by default (opt-in) as opposed to on-by-default. 🙂 |
@pichlermarc I'll cook up something with just |
@pichlermarc FYI, I've created a new PR (#1902) with just |
Resurrecting the closed #1272
Which problem is this PR solving?
Prepares a module for runtime instrumentation, runtime metrics.
#1106
Short description of the changes
Original description from @mentos1386 :
I have in addition to his PR also...