NC | NSFS | Wrap with `try-catch` Prometheus Reporting `start_server` #8559

shirady · 2024-11-27T12:05:53Z

Explain the changes

The function clusterMetrics might throw an error (with an error message Operation timed out.), in case it fails, there is no try-catch block, and will end with uncaughtException. To avoid that we wrap it with try-catch block, log an error message and return from the function (we can't proceed with undefined metrics variable). Also added a json response with the details of the error in case it happens.
I noticed that the call prom_reporting.start_server(metrics_port, true); in fork_utils.js did not have await and added it - had to change the function signature of start_workers to have async and update its JSDoc @returns part, and also separate the call and the condition in endpoint.js.
Add a comment in the endpoint main for developers regarding implementation (running on the main process, running with multiple forks).
Change the handing under the function gather_metrics so we will print the json error object that we returned from the server.
Change the printing of "_create_nsfs_report: nsfs_report" so we can see the object (we used to see: "core.server.analytic_services.prometheus_reporting:: _create_nsfs_report: nsfs_report [object Object]").
Rename a function (we had a minor typo) from nsfs_io_state_handler to nsfs_io_stats_handler.
Change the order inside the function start_server in prometheus_reporting.js and set the if (req.url === '/metrics/nsfs_stats') { because it doesn't use metrics.

Issues:

As a part of the investigation issue DFBUGS-907 in version 5.15 we saw that the noobaa service exited and restarted with a PANIC error:

[ERROR] CONSOLE:: PANIC: process uncaughtException Error: Operation timed out. at Timeout._onTimeout
(/usr/local/noobaa-core/node_modules/prom-client/lib/cluster.js:59:18) at listOnTimeout (node:internal/timers:573:17) at process.processTimers (node:internal/timers:514:7)

The node is configured with forks. the error comes from cluster.js line 59: const err = new Error('Operation timed out.'); which is part of function clusterMetrics. clusterMetrics is called only in prometheus_reporting.js in start_server if the condition fork_enabled is true. prom_reporting.start_server is called from the endpoint with fork_enabled false (so it's not relevant), and in the fork_utils.js with fork_enabled true.

GAP - I'm not sure in which cases a node might face this PANIC error (before this suggested change) - this was the first time I saw that I already run long tests on GPFS machine with forks... we have opened an issue in prom-client (see comment).

Testing Instructions:

To make sure that we didn't harm the metrics in NC please run:

sudo node src/cmd/nsfs --debug 5 --forks 2 start the NSFS server with multiple forks.
sudo node src/cmd/manage_nsfs diagnose metrics see that you get an output (includes 'MetricsStatus', see example here).
The error was due to timeout between the forks, to see the thrown error you can do one of the options:
(1) (quick) we can add an error Throw new Error (SDSD error) after metrics = await aggregatorRegistry.clusterMetrics(); and restart the server (ctrl + c and rerun it).
(2) (demonstrates the timeout) add a sync wait in the code that would create a delay longer than 5 milliseconds: in fork_utils under nsfs_io_stats_handler add this part:

    const ms = 6000;
    const start = Date.now();
    let now = start;
    while (now - start < ms) {
        now = Date.now();
    }

and restart the server (ctrl + c and rerun it).
4. Fetch the response with sudo node src/cmd/manage_nsfs diagnose metricsand also from another tab withcurl -s http://127.0.0.1:7004/metrics/nsfs_stats | jq .`, expect to see:

{
"error": "Internal server error - timeout",
"message": "Looks like the server is taking a long time to respond (Could not get the metrics)"
}

Doc added/updated
Tests added

src/endpoint/endpoint.js

src/cmd/nsfs.js

src/util/fork_utils.js

src/endpoint/endpoint.js

src/server/analytic_services/prometheus_reporting.js

@returns

1. The function clusterMetrics might throw an error (with an error message Operation timed out.), in case it fails, there is no try-catch block, and will end with uncaughtException. To avoid that we wrap it with try-catch block, log an error message and return from the function (we can't proceed with undefined metrics variable). Also added a json response with the details of the error in case it happens. 2. I noticed that the call prom_reporting.start_server(metrics_port, true); in fork_utils.js did not have await and added it - had to change the function signature of start_workers to have async and update its JSDoc @returns part, and also separate the call and the condition in endpoint.js. 3. Add a comment in the endpoint main for developers regarding implementation (running on the main process, running with multiple forks). 4. Change the handing under the function gather_metrics so we will print the json error object that we returned from the server. 5. Change the printing of "_create_nsfs_report: nsfs_report" so we can see the object (we used to see: "core.server.analytic_services.prometheus_reporting:: _create_nsfs_report: nsfs_report [object Object]"). 6. Rename a function (we had a minor typo) from nsfs_io_state_handler to nsfs_io_stats_handler. 7. Change the order inside the function start_server in prometheus_reporting.js and set the if (req.url === '/metrics/nsfs_stats') { because it doesn't use metrics. Signed-off-by: shirady <[email protected]>

pull-request-size bot added the size/S label Nov 27, 2024

shirady requested review from nadavMiz, romayalon and naveenpaul1 November 27, 2024 12:37

shirady self-assigned this Nov 27, 2024

naveenpaul1 reviewed Nov 27, 2024

View reviewed changes

src/endpoint/endpoint.js Outdated Show resolved Hide resolved

shirady requested a review from naveenpaul1 November 27, 2024 13:43

naveenpaul1 approved these changes Nov 28, 2024

View reviewed changes

shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch 2 times, most recently from 14f0627 to 7c82e83 Compare November 28, 2024 06:41

romayalon requested changes Nov 28, 2024

View reviewed changes

src/cmd/nsfs.js Outdated Show resolved Hide resolved

src/util/fork_utils.js Show resolved Hide resolved

shirady changed the title ~~NC | NSFS | Disable Prometheus Reporting~~ NC | NSFS | Wrap with try-catch Prometheus Reporting start_server Nov 28, 2024

shirady requested a review from romayalon November 28, 2024 10:04

romayalon reviewed Nov 28, 2024

View reviewed changes

src/endpoint/endpoint.js Outdated Show resolved Hide resolved

romayalon reviewed Nov 28, 2024

View reviewed changes

src/server/analytic_services/prometheus_reporting.js Show resolved Hide resolved

shirady mentioned this pull request Nov 28, 2024

NC | NSFS | Add Events #8547

Open

pull-request-size bot added size/M and removed size/S labels Dec 1, 2024

shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch 2 times, most recently from 8ad7342 to 28ec0ea Compare December 1, 2024 08:38

shirady requested review from naveenpaul1 and romayalon December 1, 2024 08:47

shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch from 28ec0ea to 9ec2dd9 Compare December 2, 2024 16:32

romayalon approved these changes Dec 3, 2024

View reviewed changes

shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch from 9ec2dd9 to 09c4c45 Compare December 3, 2024 12:48

shirady merged commit c03e592 into noobaa:master Dec 4, 2024
10 checks passed

shirady deleted the nsfs-nc-disable-Prometheus-server-start branch December 4, 2024 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NC | NSFS | Wrap with `try-catch` Prometheus Reporting `start_server` #8559

NC | NSFS | Wrap with `try-catch` Prometheus Reporting `start_server` #8559

shirady commented Nov 27, 2024 •

edited

Loading

NC | NSFS | Wrap with try-catch Prometheus Reporting start_server #8559

NC | NSFS | Wrap with try-catch Prometheus Reporting start_server #8559

Conversation

shirady commented Nov 27, 2024 • edited Loading

Explain the changes

Issues:

Testing Instructions:

NC | NSFS | Wrap with `try-catch` Prometheus Reporting `start_server` #8559

NC | NSFS | Wrap with `try-catch` Prometheus Reporting `start_server` #8559

shirady commented Nov 27, 2024 •

edited

Loading