Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(instrumentation-runtime): initial version #1689

Closed
wants to merge 15 commits into from

Conversation

klippx
Copy link

@klippx klippx commented Sep 19, 2023

Resurrecting the closed #1272

Which problem is this PR solving?

Prepares a module for runtime instrumentation, runtime metrics.

#1106

Short description of the changes

Original description from @mentos1386 :

I have created a new module for runtime instrumentation and added initial example of "event loop lag" metrics.
I would refrain from adding more metrics in this PR to not block it for too long and add new metrics in subsequent PR.
The "event loop lag" metrics are inspired by https://github.com/marcbachmann/opentelemetry-node-metrics which is inspired by https://github.com/siimon/prom-client.

I have in addition to his PR also...

  • updated it according to code review comments by @Ugzuzg
  • added event loop utilization metrics

@klippx klippx requested a review from a team September 19, 2023 13:50
@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

Merging #1689 (0b7aedc) into main (a8c225d) will increase coverage by 0.08%.
The diff coverage is 96.96%.

❗ Current head 0b7aedc differs from pull request most recent head 8b7966f. Consider uploading reports for the commit 8b7966f to get more accurate results

@@            Coverage Diff             @@
##             main    #1689      +/-   ##
==========================================
+ Coverage   91.72%   91.80%   +0.08%     
==========================================
  Files         139      141       +2     
  Lines        7128     7191      +63     
  Branches     1433     1435       +2     
==========================================
+ Hits         6538     6602      +64     
+ Misses        590      589       -1     
Files Coverage Δ
...nstrumentation-runtime/src/enums/AttributeNames.ts 100.00% <100.00%> (ø)
...ode/instrumentation-runtime/src/instrumentation.ts 96.29% <96.29%> (ø)

... and 7 files with indirect coverage changes

@klippx klippx force-pushed the feat/node-event-loop-lag branch from baac6e1 to bfebf77 Compare September 19, 2023 15:40
@klippx klippx requested a review from Ugzuzg September 20, 2023 14:47
@Ugzuzg
Copy link
Contributor

Ugzuzg commented Sep 20, 2023

Another thing you could do is rename the directory:

The directory name opentelemetry-instrumentation-x is deprecated and we tend to add new instrumentation to a directory named instrumentation-x. I think it will be best if you can rename the folder to instrumentation-cucumber instead of opentelemetry-instrumentation-cucumber
-- #1252 (review)

@klippx klippx changed the title feat(opentelemetry-instrumentation-runtime): initial version feat(instrumentation-runtime): initial version Sep 21, 2023
plugins/node/instrumentation-runtime/package.json Outdated Show resolved Hide resolved
Comment on lines +27 to +28
/** Function for adding custom metric attributes on all recorded metrics */
customMetricAttributes?: () => Attributes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config option can be added to README as well under ### Runtime Instrumentation Options. probably forgotten :)

*/

export * from './instrumentation';
export * from './types';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also export ./enums/AttributeNames? to make it available to potential consumers as part of the public API?

* @default 10
*/
monitorEventLoopDelayResolution?: number;
/** Function for adding custom metric attributes on all recorded metrics */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with metrics practices. was wondering what kind of attributes you think are useful here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everyone has their own use case, some people have expressed interest here: open-telemetry/opentelemetry-js#4088

For me, I need to add pod role (stable / canary) which is not known at boot time of the app, it needs to come during request phase when app is already started, as this role can change during lifecycle of the app (a canary can be promoted to stable) which means this cannot be done as environment variable or in prometheus exporter.

this.eventLoopDelayHistogram = monitorEventLoopDelay({
resolution: config.monitorEventLoopDelayResolution,
});
this.eventLoopDelayHistogram.enable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in the constructor, thus will always do work even if the instrumentation is installed but disabled (as might be the case with the auto-instrumentation package.

I wonder if it might be more appropriate to use the InstrumentationBase functions:

    enable(): void;
    disable(): void;

to start and stop data recording

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context: I am personally not a fan of the auto-instrumentation package and have no desire of adding this plugin there. What I am a fan of is that people hand pick the plugins that make sense to their app.

And if they pick them, they should be active. I dont want them to pass in a "disabled" or "enabled" flag. That's why current implementation makes sense to me: You add it, you use it. And that is how it works now.

So my question is with this enabled / disabled API, how does it work? You add a plugin and somehow have to "enable" it as well, or pass in an "enabled" flag? 🤷 It makes me unsure how it would work if I made this change.

Can I suggest that if someone want to add this plugin to "auto instrumentation" that they also do any update required in this plugin? I would rather skip it unless it also makes sense for this package in isolation in this initial version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding it would be beneficial - while the auto-instrumentation package is something that uses enable()/disable(), the registerInstrumentations() function also uses them - this function returns a function to call disable() all instrumentations registered with it, which when called is supposed to stop instrumentations from emitting telemetry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some further thoughts: I'm really worried about users depending on nodejs.event_loop_delay* metrics in their dashboards, as they don't really conform to how OTel is supposed to work. I guess in the long run, if these are metrics that we want exported, I don't see any way around instrumenting the native Node.js code (maybe using OTel C++?) and exporting proper (exponential) histograms from there. I wonder if there's something we could do to make people aware that these may not stick around (like making them opt-in via configuration). 🤔


observable.observe(
eventLoopDelayMinGauge,
this.eventLoopDelayHistogram.min / 1e6,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps introduce a nanoSecondToMilliSecond function or constant to make the conversion units explicit?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why not.


private _getCustomMetricAttributes(): Attributes {
return this.config.customMetricAttributes
? this.config.customMetricAttributes()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a common practice for instrumentations in this repo to assume that the user-provided hook might throw, and protect against it with try/catch and proper diag message.

While not very likely to happen, it could be nice to keep consistency with other packages in this repo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

@klippx klippx force-pushed the feat/node-event-loop-lag branch 2 times, most recently from 916917a to d5099b3 Compare October 3, 2023 15:28
@klippx klippx requested review from blumamir and pichlermarc October 3, 2023 15:29
@klippx klippx force-pushed the feat/node-event-loop-lag branch from d5099b3 to 8b7966f Compare October 11, 2023 17:09
@klippx
Copy link
Author

klippx commented Oct 11, 2023

@pichlermarc @blumamir Are we happy? 🤔

@d4nyll
Copy link
Contributor

d4nyll commented Oct 16, 2023

eventLoopDelayHistogram.max will show the maximum event loop delay for the entire duration of the Node.js process. When there's a large 'spike' in the event loop delay, eventLoopDelayHistogram.max will increase, but subsequent spikes in delay won't show up on this metric unless it is even slower than all previous spikes.

Similarly, mean/standard deviation/PXX values are more and more likely to stay the same after the process has been running for a while and will be less useful to monitor the longer the process has been running.

@klippx What are your thoughts on adding an extra option to this instrumentation that, if set to true, will run histogram.reset() on every collection (inside addBatchObservableCallback), such that the exported statistics are only for the time since the last collection?

Copy link
Member

@pichlermarc pichlermarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more thoughts, I think this may be a bit trickier than I initially thought. 🤔

this.eventLoopDelayHistogram = monitorEventLoopDelay({
resolution: config.monitorEventLoopDelayResolution,
});
this.eventLoopDelayHistogram.enable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding it would be beneficial - while the auto-instrumentation package is something that uses enable()/disable(), the registerInstrumentations() function also uses them - this function returns a function to call disable() all instrumentations registered with it, which when called is supposed to stop instrumentations from emitting telemetry.

GaugeNames.NODE_EVENT_LOOP_UTILIZATION,
{
description: 'The percentage utilization of the event loop.',
unit: 'percent',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to https://github.com/open-telemetry/semantic-conventions/blob/ea50a0d2e13f215047b897133e29fd3efad093b6/docs/general/metrics.md?plain=1#L199-L200 we need to do

Suggested change
unit: 'percent',
unit: '%',

but it's more likely we'll need to

Suggested change
unit: 'percent',
unit: 1,

and keep the value range [0.0,1.0] as returned by ELU.utilization

ELU.utilization * 100,
attributes
);
observable.observe(loopIdleGauge, ELU.idle, attributes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how useful adding idle time here actually is; consider the following:

  • metrics are observed when the MetricReader collects the data
  • the interval in which the MetricReader collects that data may not be fixed or the same across all apps that report this metric

IIUC this would mean that if idle time is reported, it will not be comparable between processes (as the interval may be different) and won't be comparable form one data-point to the next as the the interval may vary there as well.
If my assumption holds, then this also holds for active-time.

Comment on lines +21 to +31
NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop_delay',
NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop_delay.min',
NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop_delay.max',
NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop_delay.mean',
NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop_delay.stddev',
NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop_delay.p50',
NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop_delay.p95',
NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop_delay.p99',
NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop_utilization',
NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop_utilization.idle',
NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop_utilization.active',
Copy link
Member

@pichlermarc pichlermarc Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen in the semconv, most of the time dots are preferred (edit: not sure if there's a rule, though)

Suggested change
NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop_delay',
NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop_delay.min',
NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop_delay.max',
NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop_delay.mean',
NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop_delay.stddev',
NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop_delay.p50',
NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop_delay.p95',
NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop_delay.p99',
NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop_utilization',
NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop_utilization.idle',
NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop_utilization.active',
NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop.delay',
NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop.delay.min',
NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop.delay.max',
NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop.delay.mean',
NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop.delay.stddev',
NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop.delay.p50',
NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop.delay.p95',
NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop.delay.p99',
NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop.utilization',
NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop.utilization.idle',
NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop.utilization.active',

this.eventLoopDelayHistogram = monitorEventLoopDelay({
resolution: config.monitorEventLoopDelayResolution,
});
this.eventLoopDelayHistogram.enable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some further thoughts: I'm really worried about users depending on nodejs.event_loop_delay* metrics in their dashboards, as they don't really conform to how OTel is supposed to work. I guess in the long run, if these are metrics that we want exported, I don't see any way around instrumenting the native Node.js code (maybe using OTel C++?) and exporting proper (exponential) histograms from there. I wonder if there's something we could do to make people aware that these may not stick around (like making them opt-in via configuration). 🤔

GaugeNames.NODE_EVENT_LOOP_UTILIZATION_IDLE,
{
description: 'The idle time utilization of event loop.',
unit: 'ms',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klippx
Copy link
Author

klippx commented Nov 7, 2023

I have been thinking about this, and I dont really think anyone will enjoy this plugin too much. In order to be useful, we want a plugin that can make use of a real opentelemetry bucket.

@klippx klippx closed this Nov 7, 2023
@d4nyll
Copy link
Contributor

d4nyll commented Nov 11, 2023

@klippx I think a few people will find this instrumentation useful. Event loop delay/utilization can supplement CPU utilization for autoscaling, for example.

@pichlermarc would there still be interest in merging this PR in once the comments are addressed? If so, I'd be interested in seeing this to completion.

exporting proper (exponential) histograms

One of the benefits of just relaying the perf_hooks is that you get that for free without having to introduce additional overhead in capturing raw measurements and dynamically generating 'custom' histograms. Perhaps we should make it explicit to users that they should not view these nodejs.event_loop.delay* metrics as histograms but as what they are - gauges that simply relay what Node.js exposes.

idle time / active time

I also think they won't be of much use since their individual values are not comparable between scrapes. The only reason I can think of of exposing them is to calculate the % utilization, so we can just expose the nodejs.event_loop.utilization metric only.

@pichlermarc
Copy link
Member

@pichlermarc would there still be interest in merging this PR in once the comments are addressed? If so, I'd be interested in seeing this to completion.

I think it can be very helpful. I'd perefer a PR that just starts with nodejs.event_loop.utilization in the new package, and then we can add the other metrics in a follow-up - I'd expect it to be way lower friction this way. Just having the utilization gauge should be uncontroversial 🙂

exporting proper (exponential) histograms

One of the benefits of just relaying the perf_hooks is that you get that for free without having to introduce additional overhead in capturing raw measurements and dynamically generating 'custom' histograms. Perhaps we should make it explicit to users that they should not view these nodejs.event_loop.delay* metrics as histograms but as what they are - gauges that simply relay what Node.js exposes.

I agree that this is better in terms of performance. I'm not completely opposed to having them, I'd just rather have these metrics disabled by default (opt-in) as opposed to on-by-default. 🙂

@d4nyll
Copy link
Contributor

d4nyll commented Nov 23, 2023

@pichlermarc I'll cook up something with just nodejs.event_loop.utilization. I should have some time to work on it in December 🤞

@d4nyll
Copy link
Contributor

d4nyll commented Jan 16, 2024

@pichlermarc FYI, I've created a new PR (#1902) with just nodejs.event_loop.utilization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants