forked from lausser/check_hpasm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
executable file
·346 lines (270 loc) · 13.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
check_hpasm Nagios Plugin README
---------------------
This plugin checks the hardware health of HP Proliant servers with the
hpasm software installed. It uses the hpasmcli command to acquire the
condition of the system's critical components like cpus, power supplies,
temperatures, fans and memory modules. Newer versions also use SNMP.
* For instructions on installing this plugin for use with Nagios,
see below. In addition, generic instructions for the GNU toolchain
can be found in the INSTALL file.
* For major changes between releases, read the CHANGES file.
* For information on detailed changes that have been made,
read the Changelog file.
* This plugins is self documenting. All plugins that comply with
the basic guidelines for development will provide detailed help when
invoked with the '-h' or '--help' options.
You can check for the latest plugin at:
http://www.consol.de/opensource/nagios/check-hpasm
Send mail to [email protected] for assistance.
Please include the OS type and version that you are using.
Also, run the plugin with the '-v' option and provide the resulting
version information. Of course, there may be additional diagnostic information
required as well. Use good judgment.
How to "compile" the check_hpasm script.
--------------------------------------------------------
1) Run the configure script to initialize variables and create a Makefile, etc.
./configure --prefix=BASEDIRECTORY --with-nagios-user=SOMEUSER --with-nagios-group=SOMEGROUP --with-perl=PATH_TO_PERL --with-noinst-level=LEVEL --with-degrees=UNIT --with-perfdata --with-hpacucli
a) Replace BASEDIRECTORY with the path of the directory under which Nagios
is installed (default is '/usr/local/nagios')
b) Replace SOMEUSER with the name of a user on your system that will be
assigned permissions to the installed plugins (default is 'nagios')
c) Replace SOMEGRP with the name of a group on your system that will be
assigned permissions to the installed plugins (default is 'nagios')
d) Replace PATH_TO_PERL with the path where a perl binary can be found.
Besides the system wide perl you might have installed a private perl
just for the nagios plugins (default is the perl in your path).
e) Replace LEVEL with one of ok, warning, critical or unknown.
If the required hpasm-rpm is not installed, the check_hpasm plugin
will exit with the level specified. If you chose ok, the message
will say "ok - .... hpasm is not installed". This is different from
the "ok - hardware working fine" if hpasm was found.
The default is to treat a missing hpasm package as ok.
f) Replace UNIT with one of celsius or fahrenheit. The hpasmcli "show temp"
prints temperatures both in units of celsius and fahrenheit. With the
--with-degrees option you can decide which units will be shown in an
alarm message.
The default is "celsius".
g) You can tell check_hpasm to output performance data by default if
you call configure with the --enable-perfdata option.
h) You can tell check_hpasm to check the raid status with the hpacucli command
if you call configure with the --enable-hpacucli option.
You need the hpacucli rpm.
2) "Compile" the plugin with the following command:
make
This will produce a "check_hpasm" script. You will also find
a "check_hpasm.pl" which you better ignore. It is the base for
the compilation filled with placeholders. These will be replaced during
the make process.
3) Install the compiled plugin script with the following command:
make install
The installation procedure will attempt to place the plugin in a
'libexec/' subdirectory in the base directory you specified with
the --prefix argument to the configure script.
4) Verify that your configuration files for Nagios contains
the correct paths to the new plugin.
5) Add this line to /etc/sudoers:
nagios ALL=NOPASSWD: /sbin/hpasmcli
or ths, if you also installed the hpacu package
nagios ALL=NOPASSWD: /sbin/hpasmcli, /usr/sbin/hpacucli
Command line parameters
-----------------------
-v, --verbose
Increased verbosity will print how check_hpasm communicates with the
hpasm daemon and which values were acquired.
-t, --timeout
The number of seconds after which the plugin will abort.
-b, --blacklist
If some components of your system are missing (mostly the secondary
power supply bay is empty) and you tolerate this, then blacklist the
missing/failed component to avoid false alarms.
The value for this option is a slash-separated list of components to
ignore.
Example: -b p:1,2/f:2/t:3,4/c:1/d:0-1,0-2
means: ignore power supplies #1 and #2, fan #2, temperature #3 and #4,
cpu #1 and dimms #1 and #2 in cartridge #0.
-c, --customthresh
Override the machine-default temperature thresholds.
Example: -c 1:60/4:80/5:50
Sets limit for temperature 1 to 60 degrees, temperature 4 to 80 degrees
and temperature 5 to 50 degrees. You get the consecutive numbers by
calling check_hpasm -v
...
checking temperatures
1 processor_zone temperature is 46 (62 max)
2 cpu#1 temperature is 43 (73 max)
3 i/o_zone temperature is 54 (68 max)
4 cpu#2 temperature is 46 (73 max)
5 power_supply_bay temperature is 38 (55 max)
-p, --perfdata
Add performance data to the output even if you did not compile check_hpasm
with --with-perfdata in step 1.
SNMP and Memory Modules
-----------------------
Older hardware does not always show valuable information when queried for
the health of memory modules. Maybe it's because older modules do not support
error checking at all.
1. no cpqHeResMemModule
---------------------------------------------------------------------------
2. collapsed cpqHeResMemModule
---------------------------------------------------------------------------
Some (older) systems do not support the cpqHeResMemModuleEntry table.
Either there is no oid with 1.3.6.1.4.1.232.6.2.14.11.1 at all
or there is a single oid like
Example:
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 524288
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 524288
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0
^-- module number
^-- cartridge number (0 = system board)
^-- size
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0
I compared 300 systems and found out that with
1.3.6.1.4.1.232.6.2.14.11.1.<no1>.<no2>.<no3> = <no4>
no1 is always 1
no2 is always 0
no3 is the number of memory slots (including the empty ones).
no4 is always 0. It is probably the health status of the
overall memory subsystem. I don't know.
I will implement 0 = ok, not 0 = ask compaq
cpqSiMemECCStatus provides no usable information. All my test systems
showed 0 which is an undocumented value.
function get_size(cpqHeResMemModuleEntry) will return 1.
3. cpqHeResMemModule containing crap
---------------------------------------------------------------------------
grepping for cpqSiMemBoardSize shows 4 modules
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144
iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0
grepping for cpqHeResMemEntry shows one module with zero values
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.2.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.3.0.0 = ""
iso.3.6.1.4.1.232.6.2.14.11.1.4.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.5.0.0 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.6.0.0 = Hex-STRING: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
4. cpqHeResMemModuleEntry and cpqSiMemModuleEntry use different table indexes
---------------------------------------------------------------------------
cpqSiMemBoardIndex 1.3.6.1.4.1.232.2.2.4.5.1.1
cpqSiMemModuleIndex 1.3.6.1.4.1.232.2.2.4.5.1.2
cpqHeResMemBoardIndex 1.3.6.1.4.1.232.6.2.14.11.1.1
cpqHeResMemModuleIndex 1.3.6.1.4.1.232.6.2.14.11.1.2
cpqSiMemBoardIndex
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.1 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.2 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.3 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.4 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.5 = INTEGER: 0
SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.6 = INTEGER: 0
cpqHeResMemBoardIndex
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.1 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.2 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.3 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.4 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.5 = INTEGER: 0
SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.6 = INTEGER: 0
It is not possible to use the SNMP-table-indices to identify the
corresponding he-entry. Matching is done with nested loops.
5. even worse: cpqHeResMemBoardIndex and cpqSiMemBoardIndex don't match
---------------------------------------------------------------------------
cpqSiMemBoardIndex
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.1 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.2 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.3 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.4 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.5 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.6 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.7 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.1.8 = INTEGER: 1
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.1 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.2 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.3 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.4 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.5 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.6 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.7 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.2.8 = INTEGER: 2
iso.3.6.1.4.1.232.2.2.4.5.1.1.3.1 = INTEGER: 3
cpqHeResMemBoardIndex
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.1 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.2 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.3 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.4 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.5 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.7 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.0.8 = INTEGER: 0
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.1 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.2 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.3 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.4 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.5 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.6 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.7 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.1.8 = INTEGER: 1
iso.3.6.1.4.1.232.6.2.14.11.1.1.2.1 = INTEGER: 2
Redundant fans
-----------------------
I saw one old server which had only half of the possible fans installed.
Fan# 1 2 3 4 5 6
cpqHeFltTolFanPresent yes no yes no yes no
cpqHeFltTolFanRedundant no no no no no no
cpqHeFltTolFanRedundantPartner 2 1 4 3 6 5
cpqHeFltTolFanCondition ok other ok other ok other
cpqHeFltTolFanLocation cpu cpu cpu cpu io io
Normally this would result in
...
fan #1 (cpu) is not redundant
fan #2 (cpu) is not redundant
fan #3 (cpu) is not redundant
fan #4 (cpu) is not redundant
fan #5 (ioboard) is not redundant
fan #6 (ioboard) is not redundant
WARNING - fan #1 (cpu) is not redundant, fan #2 (cpu) is not redundant, fan #3 (cpu) is not redundant, fan #4 (cpu) is not redundant, fan #5 (ioboard) is not redundant, fan #6 (ioboard) is not redundant
However it was the server's owner decision not to install fan pairs but only one fan per location, so for him this is a false alert.
By using --ignore-fan-redundancy check_hpasm only looks at the cpqHeFltTolFanCondition and ignores dependencies between two fans, so the result is:
fan 1 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 2
fan 3 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 4
fan 5 speed is normal, pctmax is 50%, location is ioboard, redundance is no, partner is 6
OK - System: 'proliant ml370 g3', ...
A snmp forwarding trick
-----------------------
local - where check_hpasm runs
remote - where a proliant can be reached
proliant - where the snmp agent runs
remote:
ssh -R6667:localhost:6667 local
socat tcp4-listen:6667,reuseaddr,fork UDP:proliant:161
local:
socat udp4-listen:161,reuseaddr,fork tcp:localhost:6667
check_hpasm --hostname 127.0.0.1
Sample data from real machines
------------------------------
hpasmcli=$(which hpasmcli)
hpacucli=$(which hpacucli)
for i in server powersupply fans temp dimm
do
$hpasmcli -s "show $i" | while read line
do
printf "%s %s\n" $i "$line"
done
done
if [ -x "$hpacucli" ]; then
for i in config status
do
$hpacucli ctrl all show $i | while read line
do
printf "%s %s\n" $i "$line"
done
done
fi
If you think check_hpasm is not working correctly, please run the above script
and send me the output. It's also helpful to see the output of snmpwalk
snmpwalk .... 1.3.6.1.4.1.232
--
Gerhard Lausser <[email protected]>