Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NC | NSFS | Wrap with try-catch Prometheus Reporting start_server #8559

Merged

Conversation

shirady
Copy link
Contributor

@shirady shirady commented Nov 27, 2024

Explain the changes

  1. The function clusterMetrics might throw an error (with an error message Operation timed out.), in case it fails, there is no try-catch block, and will end with uncaughtException. To avoid that we wrap it with try-catch block, log an error message and return from the function (we can't proceed with undefined metrics variable). Also added a json response with the details of the error in case it happens.
  2. I noticed that the call prom_reporting.start_server(metrics_port, true); in fork_utils.js did not have await and added it - had to change the function signature of start_workers to have async and update its JSDoc @returns part, and also separate the call and the condition in endpoint.js.
  3. Add a comment in the endpoint main for developers regarding implementation (running on the main process, running with multiple forks).
  4. Change the handing under the function gather_metrics so we will print the json error object that we returned from the server.
  5. Change the printing of "_create_nsfs_report: nsfs_report" so we can see the object (we used to see: "core.server.analytic_services.prometheus_reporting:: _create_nsfs_report: nsfs_report [object Object]").
  6. Rename a function (we had a minor typo) from nsfs_io_state_handler to nsfs_io_stats_handler.
  7. Change the order inside the function start_server in prometheus_reporting.js and set the if (req.url === '/metrics/nsfs_stats') { because it doesn't use metrics.

Issues:

  1. As a part of the investigation issue DFBUGS-907 in version 5.15 we saw that the noobaa service exited and restarted with a PANIC error:

[ERROR] CONSOLE:: PANIC: process uncaughtException Error: Operation timed out. at Timeout._onTimeout
(/usr/local/noobaa-core/node_modules/prom-client/lib/cluster.js:59:18) at listOnTimeout (node:internal/timers:573:17) at process.processTimers (node:internal/timers:514:7)

The node is configured with forks. the error comes from cluster.js line 59: const err = new Error('Operation timed out.'); which is part of function clusterMetrics. clusterMetrics is called only in prometheus_reporting.js in start_server if the condition fork_enabled is true. prom_reporting.start_server is called from the endpoint with fork_enabled false (so it's not relevant), and in the fork_utils.js with fork_enabled true.

GAP - I'm not sure in which cases a node might face this PANIC error (before this suggested change) - this was the first time I saw that I already run long tests on GPFS machine with forks... we have opened an issue in prom-client (see comment).

Testing Instructions:

To make sure that we didn't harm the metrics in NC please run:

  1. sudo node src/cmd/nsfs --debug 5 --forks 2 start the NSFS server with multiple forks.
  2. sudo node src/cmd/manage_nsfs diagnose metrics see that you get an output (includes 'MetricsStatus', see example here).
  3. The error was due to timeout between the forks, to see the thrown error you can do one of the options:
    (1) (quick) we can add an error Throw new Error (SDSD error) after metrics = await aggregatorRegistry.clusterMetrics(); and restart the server (ctrl + c and rerun it).
    (2) (demonstrates the timeout) add a sync wait in the code that would create a delay longer than 5 milliseconds: in fork_utils under nsfs_io_stats_handler add this part:
    const ms = 6000;
    const start = Date.now();
    let now = start;
    while (now - start < ms) {
        now = Date.now();
    }

and restart the server (ctrl + c and rerun it).
4. Fetch the response with sudo node src/cmd/manage_nsfs diagnose metricsand also from another tab withcurl -s http://127.0.0.1:7004/metrics/nsfs_stats | jq .`, expect to see:

{
"error": "Internal server error - timeout",
"message": "Looks like the server is taking a long time to respond (Could not get the metrics)"
}

  • Doc added/updated
  • Tests added

@shirady shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch 2 times, most recently from 14f0627 to 7c82e83 Compare November 28, 2024 06:41
src/cmd/nsfs.js Outdated Show resolved Hide resolved
src/util/fork_utils.js Show resolved Hide resolved
@shirady shirady changed the title NC | NSFS | Disable Prometheus Reporting NC | NSFS | Wrap with try-catch Prometheus Reporting start_server Nov 28, 2024
src/endpoint/endpoint.js Outdated Show resolved Hide resolved
@pull-request-size pull-request-size bot added size/M and removed size/S labels Dec 1, 2024
@shirady shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch 2 times, most recently from 8ad7342 to 28ec0ea Compare December 1, 2024 08:38
@shirady shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch from 28ec0ea to 9ec2dd9 Compare December 2, 2024 16:32
1. The function clusterMetrics might throw an error (with an error message Operation timed out.), in case it fails, there is no try-catch block, and will end with uncaughtException. To avoid that we wrap it with try-catch block, log an error message and return from the function (we can't proceed with undefined metrics variable). Also added a json response with the details of the error in case it happens.
2. I noticed that the call prom_reporting.start_server(metrics_port, true); in fork_utils.js did not have await and added it - had to change the function signature of start_workers to have async and update its JSDoc @returns part, and also separate the call and the condition in endpoint.js.
3. Add a comment in the endpoint main for developers regarding implementation (running on the main process, running with multiple forks).
4. Change the handing under the function gather_metrics so we will print the json error object that we returned from the server.
5. Change the printing of "_create_nsfs_report: nsfs_report" so we can see the object (we used to see: "core.server.analytic_services.prometheus_reporting:: _create_nsfs_report: nsfs_report [object Object]").
6. Rename a function (we had a minor typo) from nsfs_io_state_handler to nsfs_io_stats_handler.
7. Change the order inside the function start_server in prometheus_reporting.js and set the if (req.url === '/metrics/nsfs_stats') { because it doesn't use metrics.

Signed-off-by: shirady <[email protected]>
@shirady shirady force-pushed the nsfs-nc-disable-Prometheus-server-start branch from 9ec2dd9 to 09c4c45 Compare December 3, 2024 12:48
@shirady shirady merged commit c03e592 into noobaa:master Dec 4, 2024
10 checks passed
@shirady shirady deleted the nsfs-nc-disable-Prometheus-server-start branch December 4, 2024 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants