feat(instrumentation-runtime): initial version #1689

klippx · 2023-09-19T13:50:26Z

Resurrecting the closed #1272

Which problem is this PR solving?

Prepares a module for runtime instrumentation, runtime metrics.

Short description of the changes

Original description from @mentos1386 :

I have created a new module for runtime instrumentation and added initial example of "event loop lag" metrics.
I would refrain from adding more metrics in this PR to not block it for too long and add new metrics in subsequent PR.
The "event loop lag" metrics are inspired by https://github.com/marcbachmann/opentelemetry-node-metrics which is inspired by https://github.com/siimon/prom-client.

I have in addition to his PR also...

updated it according to code review comments by @Ugzuzg
added event loop utilization metrics

codecov · 2023-09-19T14:37:20Z

Codecov Report

Merging #1689 (0b7aedc) into main (a8c225d) will increase coverage by 0.08%.
The diff coverage is 96.96%.

❗ Current head 0b7aedc differs from pull request most recent head 8b7966f. Consider uploading reports for the commit 8b7966f to get more accurate results

@@            Coverage Diff             @@
##             main    #1689      +/-   ##
==========================================
+ Coverage   91.72%   91.80%   +0.08%     
==========================================
  Files         139      141       +2     
  Lines        7128     7191      +63     
  Branches     1433     1435       +2     
==========================================
+ Hits         6538     6602      +64     
+ Misses        590      589       -1

Files	Coverage Δ
...nstrumentation-runtime/src/enums/AttributeNames.ts	`100.00% <100.00%> (ø)`
...ode/instrumentation-runtime/src/instrumentation.ts	`96.29% <96.29%> (ø)`

... and 7 files with indirect coverage changes

plugins/node/opentelemetry-instrumentation-runtime/src/instrumentation.ts

Ugzuzg · 2023-09-20T14:56:41Z

Another thing you could do is rename the directory:

The directory name opentelemetry-instrumentation-x is deprecated and we tend to add new instrumentation to a directory named instrumentation-x. I think it will be best if you can rename the folder to instrumentation-cucumber instead of opentelemetry-instrumentation-cucumber
-- #1252 (review)

plugins/node/instrumentation-runtime/package.json

blumamir · 2023-09-24T05:06:09Z

plugins/node/instrumentation-runtime/src/types.ts

+  /** Function for adding custom metric attributes on all recorded metrics */
+  customMetricAttributes?: () => Attributes;


This config option can be added to README as well under ### Runtime Instrumentation Options. probably forgotten :)

blumamir · 2023-09-24T05:10:33Z

plugins/node/instrumentation-runtime/src/index.ts

+ */
+
+export * from './instrumentation';
+export * from './types';


Should we also export ./enums/AttributeNames? to make it available to potential consumers as part of the public API?

plugins/node/instrumentation-runtime/src/instrumentation.ts

blumamir · 2023-09-24T05:43:12Z

plugins/node/instrumentation-runtime/src/types.ts

+   * @default 10
+   */
+  monitorEventLoopDelayResolution?: number;
+  /** Function for adding custom metric attributes on all recorded metrics */


I am not familiar with metrics practices. was wondering what kind of attributes you think are useful here?

Everyone has their own use case, some people have expressed interest here: open-telemetry/opentelemetry-js#4088

For me, I need to add pod role (stable / canary) which is not known at boot time of the app, it needs to come during request phase when app is already started, as this role can change during lifecycle of the app (a canary can be promoted to stable) which means this cannot be done as environment variable or in prometheus exporter.

blumamir · 2023-09-24T05:45:04Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+    this.eventLoopDelayHistogram = monitorEventLoopDelay({
+      resolution: config.monitorEventLoopDelayResolution,
+    });
+    this.eventLoopDelayHistogram.enable();


This is done in the constructor, thus will always do work even if the instrumentation is installed but disabled (as might be the case with the auto-instrumentation package.

I wonder if it might be more appropriate to use the InstrumentationBase functions:

enable(): void; disable(): void;

to start and stop data recording

Context: I am personally not a fan of the auto-instrumentation package and have no desire of adding this plugin there. What I am a fan of is that people hand pick the plugins that make sense to their app.

And if they pick them, they should be active. I dont want them to pass in a "disabled" or "enabled" flag. That's why current implementation makes sense to me: You add it, you use it. And that is how it works now.

So my question is with this enabled / disabled API, how does it work? You add a plugin and somehow have to "enable" it as well, or pass in an "enabled" flag? 🤷 It makes me unsure how it would work if I made this change.

Can I suggest that if someone want to add this plugin to "auto instrumentation" that they also do any update required in this plugin? I would rather skip it unless it also makes sense for this package in isolation in this initial version.

I think adding it would be beneficial - while the auto-instrumentation package is something that uses enable()/disable(), the registerInstrumentations() function also uses them - this function returns a function to call disable() all instrumentations registered with it, which when called is supposed to stop instrumentations from emitting telemetry.

Some further thoughts: I'm really worried about users depending on nodejs.event_loop_delay* metrics in their dashboards, as they don't really conform to how OTel is supposed to work. I guess in the long run, if these are metrics that we want exported, I don't see any way around instrumenting the native Node.js code (maybe using OTel C++?) and exporting proper (exponential) histograms from there. I wonder if there's something we could do to make people aware that these may not stick around (like making them opt-in via configuration). 🤔

blumamir · 2023-09-24T05:52:59Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+
+        observable.observe(
+          eventLoopDelayMinGauge,
+          this.eventLoopDelayHistogram.min / 1e6,


nit: perhaps introduce a nanoSecondToMilliSecond function or constant to make the conversion units explicit?

Sure, why not.

blumamir · 2023-09-24T06:02:28Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+
+  private _getCustomMetricAttributes(): Attributes {
+    return this.config.customMetricAttributes
+      ? this.config.customMetricAttributes()


It is a common practice for instrumentations in this repo to assume that the user-provided hook might throw, and protect against it with try/catch and proper diag message.

While not very likely to happen, it could be nice to keep consistency with other packages in this repo

Definitely!

plugins/node/instrumentation-runtime/src/enums/AttributeNames.ts

plugins/node/instrumentation-runtime/src/instrumentation.ts

… metric

…to instrumentation

klippx · 2023-10-11T17:09:40Z

@pichlermarc @blumamir Are we happy? 🤔

d4nyll · 2023-10-16T10:46:58Z

eventLoopDelayHistogram.max will show the maximum event loop delay for the entire duration of the Node.js process. When there's a large 'spike' in the event loop delay, eventLoopDelayHistogram.max will increase, but subsequent spikes in delay won't show up on this metric unless it is even slower than all previous spikes.

Similarly, mean/standard deviation/PXX values are more and more likely to stay the same after the process has been running for a while and will be less useful to monitor the longer the process has been running.

@klippx What are your thoughts on adding an extra option to this instrumentation that, if set to true, will run histogram.reset() on every collection (inside addBatchObservableCallback), such that the exported statistics are only for the time since the last collection?

pichlermarc

A few more thoughts, I think this may be a bit trickier than I initially thought. 🤔

pichlermarc · 2023-10-13T06:15:13Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+    this.eventLoopDelayHistogram = monitorEventLoopDelay({
+      resolution: config.monitorEventLoopDelayResolution,
+    });
+    this.eventLoopDelayHistogram.enable();


I think adding it would be beneficial - while the auto-instrumentation package is something that uses enable()/disable(), the registerInstrumentations() function also uses them - this function returns a function to call disable() all instrumentations registered with it, which when called is supposed to stop instrumentations from emitting telemetry.

pichlermarc · 2023-10-13T06:27:43Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+      GaugeNames.NODE_EVENT_LOOP_UTILIZATION,
+      {
+        description: 'The percentage utilization of the event loop.',
+        unit: 'percent',


according to https://github.com/open-telemetry/semantic-conventions/blob/ea50a0d2e13f215047b897133e29fd3efad093b6/docs/general/metrics.md?plain=1#L199-L200 we need to do

Suggested change

unit: 'percent',

unit: '%',

but it's more likely we'll need to

Suggested change

unit: 'percent',

unit: 1,

and keep the value range [0.0,1.0] as returned by ELU.utilization

pichlermarc · 2023-10-13T06:47:14Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+          ELU.utilization * 100,
+          attributes
+        );
+        observable.observe(loopIdleGauge, ELU.idle, attributes);


I wonder how useful adding idle time here actually is; consider the following:

metrics are observed when the MetricReader collects the data

the interval in which the MetricReader collects that data may not be fixed or the same across all apps that report this metric

IIUC this would mean that if idle time is reported, it will not be comparable between processes (as the interval may be different) and won't be comparable form one data-point to the next as the the interval may vary there as well.
If my assumption holds, then this also holds for active-time.

pichlermarc · 2023-10-13T06:50:14Z

plugins/node/instrumentation-runtime/src/enums/GaugeNames.ts

+  NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop_delay',
+  NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop_delay.min',
+  NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop_delay.max',
+  NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop_delay.mean',
+  NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop_delay.stddev',
+  NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop_delay.p50',
+  NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop_delay.p95',
+  NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop_delay.p99',
+  NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop_utilization',
+  NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop_utilization.idle',
+  NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop_utilization.active',


From what I've seen in the semconv, most of the time dots are preferred (edit: not sure if there's a rule, though)

Suggested change

NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop_delay',

NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop_delay.min',

NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop_delay.max',

NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop_delay.mean',

NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop_delay.stddev',

NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop_delay.p50',

NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop_delay.p95',

NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop_delay.p99',

NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop_utilization',

NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop_utilization.idle',

NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop_utilization.active',

NODE_EVENT_LOOP_DELAY = 'nodejs.event_loop.delay',

NODE_EVENT_LOOP_DELAY_MIN = 'nodejs.event_loop.delay.min',

NODE_EVENT_LOOP_DELAY_MAX = 'nodejs.event_loop.delay.max',

NODE_EVENT_LOOP_DELAY_MEAN = 'nodejs.event_loop.delay.mean',

NODE_EVENT_LOOP_DELAY_STDDEV = 'nodejs.event_loop.delay.stddev',

NODE_EVENT_LOOP_DELAY_P50 = 'nodejs.event_loop.delay.p50',

NODE_EVENT_LOOP_DELAY_P95 = 'nodejs.event_loop.delay.p95',

NODE_EVENT_LOOP_DELAY_P99 = 'nodejs.event_loop.delay.p99',

NODE_EVENT_LOOP_UTILIZATION = 'nodejs.event_loop.utilization',

NODE_EVENT_LOOP_UTILIZATION_IDLE = 'nodejs.event_loop.utilization.idle',

NODE_EVENT_LOOP_UTILIZATION_ACTIVE = 'nodejs.event_loop.utilization.active',

pichlermarc · 2023-10-13T06:54:30Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+    this.eventLoopDelayHistogram = monitorEventLoopDelay({
+      resolution: config.monitorEventLoopDelayResolution,
+    });
+    this.eventLoopDelayHistogram.enable();


Some further thoughts: I'm really worried about users depending on nodejs.event_loop_delay* metrics in their dashboards, as they don't really conform to how OTel is supposed to work. I guess in the long run, if these are metrics that we want exported, I don't see any way around instrumenting the native Node.js code (maybe using OTel C++?) and exporting proper (exponential) histograms from there. I wonder if there's something we could do to make people aware that these may not stick around (like making them opt-in via configuration). 🤔

pichlermarc · 2023-10-16T12:26:44Z

plugins/node/instrumentation-runtime/src/instrumentation.ts

+      GaugeNames.NODE_EVENT_LOOP_UTILIZATION_IDLE,
+      {
+        description: 'The idle time utilization of event loop.',
+        unit: 'ms',


I think we may need to use seconds for this instead: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metrics.md#instrument-units

klippx · 2023-11-07T14:18:14Z

I have been thinking about this, and I dont really think anyone will enjoy this plugin too much. In order to be useful, we want a plugin that can make use of a real opentelemetry bucket.

d4nyll · 2023-11-11T15:50:31Z

@klippx I think a few people will find this instrumentation useful. Event loop delay/utilization can supplement CPU utilization for autoscaling, for example.

@pichlermarc would there still be interest in merging this PR in once the comments are addressed? If so, I'd be interested in seeing this to completion.

exporting proper (exponential) histograms

One of the benefits of just relaying the perf_hooks is that you get that for free without having to introduce additional overhead in capturing raw measurements and dynamically generating 'custom' histograms. Perhaps we should make it explicit to users that they should not view these nodejs.event_loop.delay* metrics as histograms but as what they are - gauges that simply relay what Node.js exposes.

idle time / active time

I also think they won't be of much use since their individual values are not comparable between scrapes. The only reason I can think of of exposing them is to calculate the % utilization, so we can just expose the nodejs.event_loop.utilization metric only.

pichlermarc · 2023-11-13T10:47:43Z

@pichlermarc would there still be interest in merging this PR in once the comments are addressed? If so, I'd be interested in seeing this to completion.

I think it can be very helpful. I'd perefer a PR that just starts with nodejs.event_loop.utilization in the new package, and then we can add the other metrics in a follow-up - I'd expect it to be way lower friction this way. Just having the utilization gauge should be uncontroversial 🙂

exporting proper (exponential) histograms

One of the benefits of just relaying the perf_hooks is that you get that for free without having to introduce additional overhead in capturing raw measurements and dynamically generating 'custom' histograms. Perhaps we should make it explicit to users that they should not view these nodejs.event_loop.delay* metrics as histograms but as what they are - gauges that simply relay what Node.js exposes.

I agree that this is better in terms of performance. I'm not completely opposed to having them, I'd just rather have these metrics disabled by default (opt-in) as opposed to on-by-default. 🙂

d4nyll · 2023-11-23T21:23:36Z

@pichlermarc I'll cook up something with just nodejs.event_loop.utilization. I should have some time to work on it in December 🤞

d4nyll · 2024-01-16T12:15:21Z

@pichlermarc FYI, I've created a new PR (#1902) with just nodejs.event_loop.utilization

klippx requested a review from a team September 19, 2023 13:50

klippx force-pushed the feat/node-event-loop-lag branch from baac6e1 to bfebf77 Compare September 19, 2023 15:40

klippx commented Sep 19, 2023

View reviewed changes

plugins/node/opentelemetry-instrumentation-runtime/src/instrumentation.ts Outdated Show resolved Hide resolved

klippx requested a review from Ugzuzg September 20, 2023 14:47

github-actions bot assigned klippx Sep 21, 2023

klippx changed the title ~~feat(opentelemetry-instrumentation-runtime): initial version~~ feat(instrumentation-runtime): initial version Sep 21, 2023

blumamir reviewed Sep 25, 2023

View reviewed changes

pichlermarc reviewed Sep 27, 2023

View reviewed changes

klippx force-pushed the feat/node-event-loop-lag branch 2 times, most recently from 916917a to d5099b3 Compare October 3, 2023 15:28

klippx requested review from blumamir and pichlermarc October 3, 2023 15:29

mentos1386 and others added 15 commits October 11, 2023 13:09

feat(opentelemetry-instrumentation-runtime): initial version

5481293

chore: update release-please-config

d369bec

test(opentelemetry-instrumentation-runtime): basic test

753c934

Use _updateMetricInstruments API to bootstrap instrument

2bb666b

Add event loop utilization metrics

db89cdc

Add lint config and fix lint issue

291ef8a

Update README

6dc55cb

Update default values, and add descriptions to them

e7a9370

Code review changes

2ddd879

Update package.json to be in line with other node/plugins

e4d4612

Add customMetricAttributes hook to tweak what attributes are added to…

530f43a

… metric

Rename directory

bd1e529

Add to component owners

981480c

Update README to remove sections that this package is part of node au…

2110967

…to instrumentation

Code review

8b7966f

klippx force-pushed the feat/node-event-loop-lag branch from d5099b3 to 8b7966f Compare October 11, 2023 17:09

pichlermarc reviewed Oct 16, 2023

View reviewed changes

klippx closed this Nov 7, 2023

d4nyll mentioned this pull request Jan 16, 2024

feat(perf_hooks): add node instrumentation perf_hooks #1902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(instrumentation-runtime): initial version #1689

feat(instrumentation-runtime): initial version #1689

klippx commented Sep 19, 2023

codecov bot commented Sep 19, 2023 •

edited

Loading

Ugzuzg commented Sep 20, 2023 •

edited

Loading

blumamir Sep 24, 2023

blumamir Sep 24, 2023

blumamir Sep 24, 2023

klippx Oct 1, 2023

blumamir Sep 24, 2023

klippx Oct 1, 2023

pichlermarc Oct 13, 2023

pichlermarc Oct 13, 2023

blumamir Sep 24, 2023

klippx Oct 1, 2023

blumamir Sep 24, 2023

klippx Oct 1, 2023

klippx commented Oct 11, 2023

d4nyll commented Oct 16, 2023 •

edited

Loading

pichlermarc left a comment

pichlermarc Oct 13, 2023

pichlermarc Oct 13, 2023

pichlermarc Oct 13, 2023

pichlermarc Oct 13, 2023 •

edited

Loading

pichlermarc Oct 13, 2023

pichlermarc Oct 16, 2023

klippx commented Nov 7, 2023

d4nyll commented Nov 11, 2023

pichlermarc commented Nov 13, 2023

d4nyll commented Nov 23, 2023

d4nyll commented Jan 16, 2024

		/** Function for adding custom metric attributes on all recorded metrics */
		customMetricAttributes?: () => Attributes;

feat(instrumentation-runtime): initial version #1689

feat(instrumentation-runtime): initial version #1689

Conversation

klippx commented Sep 19, 2023

Which problem is this PR solving?

Short description of the changes

codecov bot commented Sep 19, 2023 • edited Loading

Codecov Report

Ugzuzg commented Sep 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klippx commented Oct 11, 2023

d4nyll commented Oct 16, 2023 • edited Loading

pichlermarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pichlermarc Oct 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klippx commented Nov 7, 2023

d4nyll commented Nov 11, 2023

pichlermarc commented Nov 13, 2023

d4nyll commented Nov 23, 2023

d4nyll commented Jan 16, 2024

codecov bot commented Sep 19, 2023 •

edited

Loading

Ugzuzg commented Sep 20, 2023 •

edited

Loading

d4nyll commented Oct 16, 2023 •

edited

Loading

pichlermarc Oct 13, 2023 •

edited

Loading