-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NC | NSFS | Wrap with try-catch
Prometheus Reporting start_server
#8559
Merged
shirady
merged 1 commit into
noobaa:master
from
shirady:nsfs-nc-disable-Prometheus-server-start
Dec 4, 2024
Merged
NC | NSFS | Wrap with try-catch
Prometheus Reporting start_server
#8559
shirady
merged 1 commit into
noobaa:master
from
shirady:nsfs-nc-disable-Prometheus-server-start
Dec 4, 2024
+46
−15
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
naveenpaul1
reviewed
Nov 27, 2024
naveenpaul1
approved these changes
Nov 28, 2024
shirady
force-pushed
the
nsfs-nc-disable-Prometheus-server-start
branch
2 times, most recently
from
November 28, 2024 06:41
14f0627
to
7c82e83
Compare
romayalon
requested changes
Nov 28, 2024
shirady
changed the title
NC | NSFS | Disable Prometheus Reporting
NC | NSFS | Wrap with Nov 28, 2024
try-catch
Prometheus Reporting start_server
romayalon
reviewed
Nov 28, 2024
romayalon
reviewed
Nov 28, 2024
shirady
force-pushed
the
nsfs-nc-disable-Prometheus-server-start
branch
2 times, most recently
from
December 1, 2024 08:38
8ad7342
to
28ec0ea
Compare
shirady
force-pushed
the
nsfs-nc-disable-Prometheus-server-start
branch
from
December 2, 2024 16:32
28ec0ea
to
9ec2dd9
Compare
romayalon
approved these changes
Dec 3, 2024
1. The function clusterMetrics might throw an error (with an error message Operation timed out.), in case it fails, there is no try-catch block, and will end with uncaughtException. To avoid that we wrap it with try-catch block, log an error message and return from the function (we can't proceed with undefined metrics variable). Also added a json response with the details of the error in case it happens. 2. I noticed that the call prom_reporting.start_server(metrics_port, true); in fork_utils.js did not have await and added it - had to change the function signature of start_workers to have async and update its JSDoc @returns part, and also separate the call and the condition in endpoint.js. 3. Add a comment in the endpoint main for developers regarding implementation (running on the main process, running with multiple forks). 4. Change the handing under the function gather_metrics so we will print the json error object that we returned from the server. 5. Change the printing of "_create_nsfs_report: nsfs_report" so we can see the object (we used to see: "core.server.analytic_services.prometheus_reporting:: _create_nsfs_report: nsfs_report [object Object]"). 6. Rename a function (we had a minor typo) from nsfs_io_state_handler to nsfs_io_stats_handler. 7. Change the order inside the function start_server in prometheus_reporting.js and set the if (req.url === '/metrics/nsfs_stats') { because it doesn't use metrics. Signed-off-by: shirady <[email protected]>
shirady
force-pushed
the
nsfs-nc-disable-Prometheus-server-start
branch
from
December 3, 2024 12:48
9ec2dd9
to
09c4c45
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Explain the changes
clusterMetrics
might throw an error (with an error messageOperation timed out.
), in case it fails, there is no try-catch block, and will end withuncaughtException
. To avoid that we wrap it with try-catch block, log an error message and return from the function (we can't proceed withundefined
metrics
variable). Also added a json response with the details of the error in case it happens.prom_reporting.start_server(metrics_port, true);
infork_utils.js
did not haveawait
and added it - had to change the function signature ofstart_workers
to haveasync
and update its JSDoc@returns
part, and also separate the call and the condition inendpoint.js
.endpoint
main for developers regarding implementation (running on the main process, running with multiple forks).gather_metrics
so we will print the json error object that we returned from the server.nsfs_io_state_handler
tonsfs_io_stats_handler
.start_server
inprometheus_reporting.js
and set theif (req.url === '/metrics/nsfs_stats') {
because it doesn't usemetrics
.Issues:
The node is configured with forks. the error comes from
cluster.js
line 59:const err = new Error('Operation timed out.');
which is part of functionclusterMetrics
.clusterMetrics
is called only inprometheus_reporting.js
instart_server
if the conditionfork_enabled
istrue
.prom_reporting.start_server
is called from theendpoint
withfork_enabled
false
(so it's not relevant), and in thefork_utils.js
withfork_enabled
true
.GAP - I'm not sure in which cases a node might face this PANIC error (before this suggested change) - this was the first time I saw that I already run long tests on GPFS machine with forks... we have opened an issue in prom-client (see comment).
Testing Instructions:
To make sure that we didn't harm the metrics in NC please run:
sudo node src/cmd/nsfs --debug 5 --forks 2
start the NSFS server with multiple forks.sudo node src/cmd/manage_nsfs diagnose metrics
see that you get an output (includes 'MetricsStatus', see example here).(1) (quick) we can add an error
Throw new Error (SDSD error)
aftermetrics = await aggregatorRegistry.clusterMetrics();
and restart the server (ctrl + c and rerun it).(2) (demonstrates the timeout) add a sync wait in the code that would create a delay longer than 5 milliseconds: in
fork_utils
undernsfs_io_stats_handler
add this part:and restart the server (ctrl + c and rerun it).
4. Fetch the response with sudo node src/cmd/manage_nsfs diagnose metrics
and also from another tab with
curl -s http://127.0.0.1:7004/metrics/nsfs_stats | jq .`, expect to see: