-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault while using node_exporter 1.14.3.0 or 1.12.1.0 #16
Comments
noticed that just before the segmentation fault we see this error - Note that we are using Veritas DMP for managing SAN paths. |
Hi, Could you see if you can find the location of the segfault? You may see line numbers in the 'errpt -a' output, or you could enable core dumps and take a look using a debugger like gdb. I can give you some better details if needed, and if you are able to share the core dump could take a look myself. Did this work for you before an upgrade to 1.12.1.0, or is this a new deployent? Do the number of disks/adapters change frequently? Does this happen at startup of the exporter, or does it run for some time beofre it happens? Thor |
Core get generated under (/) root directory. root@or1xx003[/]# ls -l core Behavior is same with 1.12 or 1.14 node exporter agent. root@or1xx003[/]# lslpp -l|grep -i node The no of disks, adapters are not changed frequently, they remains pretty static. After running the node exporter with or without arguments, it crashes immediately after 1-2 minutes leaving a error "Error calling perfstat_diskpath: Invalid argument" on the console. I can share you core file, let me where I can upload it for you. majority of our systems are with veritas cluster, veritas Volume manager and veritas dynamic multipath VxDMP where this issue is observed. Thank you for a quick reply Regards |
Could you please gzip the core and attach it to this case? |
I have noticed some txt in core stating root login is disabled. We have disabled direct root login all AIX. After enabling direct root in sshd_config, I see following in errpt. root@or1xxx[/]# errpt root@or1xxx[/]# errpt -a -j A924A5FCLABEL: CORE_DUMP Date/Time: Fri May 27 00:41:08 2022 Description Probable Causes User Causes
Failure Causes
Detail Data PROCESSOR ID ADDITIONAL INFORMATION Symptom Data
|
Do you have access to gdb on AIX? If so, could you copy the core and the node_exporter_aix binary to a server with gdb available and run 'gdb node_exporter_aix core'. When in gdb, please run 'where' to get a stack trace from the core file. You could also try to run node_exporter_aix with one module enabled at a time, to try to zero in on where the issue is. Also, what version of AIX are you running where it is crashing? (oslevel -s) |
I have installed gdb and got following, did gdb on other cores as well they all states that Program terminated with signal SIGSEGV. Let me know if you need more details from core. root@or1xxx001# /opt/freeware/bin/gdb aix_node_exporter core recently I deployed 1.12.1.0 all over AIX LPARs. do you recommend to upgrade 1.14.3.0? if I exclude (D) and (d) from /usr/local/bin/node_exporter_aix command lines, I see agent doesn't crash, but I end up loosing data related to disk queue, disks timers etc.. Now, It seems I have to run this agent with two diff command line arguments e.g. AIX LPARs without VxDMP AIX LPARs with VxDMP Let me know if you need more info from core. Thank you for all the help! Yash |
Hmmm, interesting. I was expecting the names of the functions to be displayed. Could you please try to run this version, it will output some debugging data that could help me locate the issue. Also, try to execute the exporter under gdb: then type run to execute. If it crashes, run 'where' and maybe 'list' as well. I have attached the debug build of the exporter in this comment. |
We are having a similar issue, but with "stock" aix file systems and a large number of disks. This is the program output. Under gdb it doesn't tell us much more... For help, type "help". Thread 2.1 received signal SIGTRAP, Trace/breakpoint trap. |
On AIX 7.2 TL5, we see segmentation fault on 1.12.1.0, upgraded to latest 1.14.3.0 but that too remains the same.
[1]+ Segmentation fault (core dumped) /usr/local/bin/node_exporter_aix -acMdiPf -p 10051
LPAR has many disks and 6 fiber adapters and quite busy ...Can someone help?
The text was updated successfully, but these errors were encountered: