-
Notifications
You must be signed in to change notification settings - Fork 5
/
md_monitor.man
309 lines (301 loc) · 8.52 KB
/
md_monitor.man
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
.TH "md_monitor" "8" "Thu Nov 5 2015" "md_monitor 6.6"
.de bu
.IP \(bu
..
.SH NAME
md_monitor \- MD device monitor
.SH SYNOPSIS
.B md_monitor
[\fI-d\fR|\fI--daemonize\fR]
[\fI-f \fBfile\fR|\fI--logfile \fBfile\fR]
[\fI-s\fR|\fI--syslog\fR]
[\fI-e \fBnum\fR|\fI--expires=\fBnum\fR]
[\fI-P \fBnum\fR|\fI--process-limit=\fBnum\fR]
[\fI-O \fBnum\fR|\fI--open-file-limit=\fBnum\fR]
[\fI-m\fR|\fI--fail-mirror\fR]
[\fI-o\fR|\fI--fail-disk\fR]
[\fI-r \fBnum\fR|\fI--retries=\fBnum\fR]
[\fI-p \fBprio\fR|\fI--log-priority=\fBprio\fR]
[\fI-v\fR|\fI--verbose\fR]
[\fI-y\fR|\fI--check-in-sync\fR]
[\fI-c \fBcmd\fR|\fI--command=\fBcmd\fR]
[\fI-V\fR|\fI--version\fR]
[\fI-h\fR|\fI--help\fR]
.SH DESCRIPTION
.PP
The \fBmd_monitor\fR monitors the component devices of each MD array
for I/O issues. It will update the monitored MD arrays on each status
change, setting devices to 'faulty' or re-integrate working devices.
.SH OPTIONS
.PP
\fBmd_monitor\fR recognizes the following command-line options:
.TP
\fI-c \fBcmd\fR, \fI--command=\fBcmd\fR
Send command \fBcmd\fR to daemon.
.TP
\fI-d\fR, \fI--daemonize\fR
Start \fBmd_monitor\fR in background
.TP
\fI-e \fBnum\fR, \fI--expires=\fBnum\fR
Set failfast_expires to \fBnum\fR.
.TP
\fI-f \fIfile\fR, \fI--logfile=\fBfile\fR
Write logging information into \fBfile\fR instead of stdout
.TP
\fI-h\fR, \fI--help\fR
Display md_monitor usage information.
.TP
\fI-m\fR, \fI--fail-mirror\fR
Fail and reset the entire mirror half when one device failed.
This is the default.
.TP
\fI-O \fBnum\fR, \fI--open-file-limit=\fBnum\fR
Set maximum number of open files (RLIMIT_NOFILE, see \fBgetrlimit\fR(2))
to \fBnum\fR. Default is 4096.
.TP
\fI-o\fR, \fI--fail-disk\fR
Only fail the affected disk when one device failed.
This is the opposite of \fI--fail-mirror\fR.
.TP
\fI-P \fBnum\fR, \fI--process-limit=\fBnum\fR
Set maximum number of processes (RLIMIT_NPROC, see \fBgetrlimit\fR(2))
to \fBnum\fR.
.TP
\fI-p \fBprio\fR, \fI--log-priority=\fBprio\fR
Set logging priority to \fBprio\fR.
.TP
\fI-r \fBnum\fR, \fI--retries=\fBnum\fR
Set failfast_retries to \fBnum\fR.
.TP
\fI-s\fR, \fI--syslog\fR
Write logging information to syslog.
.TP
\fI-t \fBsecs\fR, \fI--check-timeout=\fBsecs\fR
Run path checker every \fBsecs\fR seconds. Default is 1.
.TP
\fI-v\fR, \fI--verbose\fR
Increase logging priority
.TP
\fI-y\fR, \fI--check-in-sync\fR
Run path checkers for 'in_sync' devices. Without this option
path checkers will be stopped whenever a device is detected
to be 'in_sync'. They will be re-started once a device has
been marked as 'faulty' or 'timeout'.
.TP
\fI-V\fR, \fI--version\fR
Display md_monitor version information.
.SH MD_MONITOR COMMAND MODE
When specifying \fI--command\fR the \fBmd_monitor\fR program connects
to a already running \fBmd_monitor\fR program and send a pre-defined
command. The command has the following syntax:
.TP
\fIcmd\fR:\fImd\fR@\fIdev\fR
.PP
The following values for \fIcmd\fR are recognised. If not specified
otherwise, \fImd\fR needs to be the device node of an existing MD array.
.TP
\fBShutdown\fR
Shutdown \fBmd_monitor\fR; \fImd\fR argument should be /dev/console
.TP
\fBRebuildStarted\fR
Rebuild has started on array \fImd\fR.
.TP
\fBRebuildFinished\fR
Rebuild has finished on array \fImd\fR.
.TP
\fBDeviceDisappeared\fR
Array \fImd\fR has been stopped; \fBmd_monitor\fR will stop
monitoring the component devices for array \fImd\fR.
.TP
\fBNewArray\fR
Array \fImd\fR has been started. This event is ignored
by md_monitor; new arrays will be detected via uevents.
.TP
\fBFail\fR
MD detected a failure on the component device \fIdev\fR of array
\fImd\fR. \fBmd_monitor\fR will re-check the device every
\fIfailfast_expires\fR seconds.
.TP
\fBFailSpare\fR
MD detected a failure on the spare device \fIdev\fR of array
\fImd\fR. \fBmd_monitor\fR will re-check the device every
\fIfailfast_expires\fR seconds.
.TP
\fBRemove\fR
The component device \fIdev\fR has been removed
from the MD array \fImd\fR. \fBmd_monitor\fR will stop
monitoring this device.
.TP
\fBSpareActive\fR
MD has integrated the device \fIdev\fR into array
\fImd\fR. \fBmd_monitor\fR will re-start monitoring of this device
every \fIfailfast_expires\fR seconds. The check interval will be
increased for each successful check up to a maximum of
\fIfailfast_expires\fR * \fIfailfast_retries\fR seconds.
.TP
\fBArrayStatus\fR
Return the current internal status of the monitored devices.
.TP
\fBMirrorStatus\fR
Return the status of the MD component devices in abbreviated form.
Each character represents the status of the MD component device
at that position. For the possible states see the next paragraph.
.TP
\fBMonitorStatus\fR
Return the current I/O status of the monitored devices in
abbreviated form. Each character represents the I/O status
of the monitored device in abbreviated form.
.SH DEVICE STATUS DISPLAY
\fBmd_monitor\fR will be displaying state information about the
monitored devices when the CLI command \fIMirrorStatus\fR or
\fIMonitorStatus\fR is sent. Each character of the returned string
represents the state of the device at that location.
.PP
The possible states for \fIMirrorStatus\fR are:
.TP
\fB.\fR
Unknown
.TP
\fBA\fR
In_Sync
.TP
\fBW\fR
Faulty
.TP
\fBT\fR
Timeout
.TP
\fBS\fR
Spare
.TP
\fB-\fR
Removed
.TP
\fBR\fR
Recovery pending
.TP
\fBP\fR
Removal pending
.TP
\fBB\fR
Blocked
.PP
\fBR\fR and \fBP\fR are intermediate states, which are set by
\fBmd_monitor\fR whenever a command has been sent to mdadm, but no
notification has been received yet.
\fBB\fR is set when MD attempts to fail the second half of the mirror
when the first half is already failed. MD will hold off I/O to the
entire mirror until the second half is useable again.
.PP
The possible states for \fIMonitorStatus\fR are:
.TP
\fB.\fR
Unknown
.TP
\fBX\fR
MD will be stopped
.TP
\fBA\fR
I/O ok
.TP
\fBW\fR
I/O failed
.TP
\fBP\fR
I/O pending
.TP
\fBT\fR
I/O timeout
\fB-\fR
Removed
\fBS\fR
Spare
\fBR\fR
Recovery
.PP
\fBP\fR and \fBT\fR describe the same condition, ie I/O has been
stalled. The state will switch from \fBP\fR to \fBT\fR when the
timeout as set by \fIfailfast_expires\fR * \fIfailfast_retries\fR
seconds has expired. \fB-\fR, \fBS\fR, and \fBR\fR, are steps MD
takes to recover a device; first the device will be removed, then
it will be re-added as a 'spare' device, and then recovery will be
starting for re-adding the spare device into the MD array.
.SH THEORY OF OPERATION
\fBmd_monitor\fR sets up a path checker thread for each MD component
device. This path checker will issue every \fIcheck-time\fR seconds an
asynchronous I/O request to the device. It will then wait up to
\fIfailfast_expires\fR * \fIfailfast_retries\fR seconds for this I/O
to complete.
If no response has been received during that time, the monitor status
for this path is set to \fII/O timeout\fR. If the I/O completed the
monitor status for this path will be set to \fII/O ok\fR or \fII/O failed\fR,
depending on whether the I/O completed without error or not.
If the path checker has been interrupted during waiting, the monitor
status for this path will be set to \fII/O pending\fR.
After the monitor status has been updated, the path checker thread will
update the MD status for this device and invoke an action, depending on
these two states.
If \fIcheck-in-sync\fR has been specified the path checker continue to
run even for \fIin_sync\fR paths. Otherwise the path checker be stopped
when a path is marked as \fIin_sync\fR.
Path checkers will be restarted whenever a device is marked
as \fIfaulty\fR or \fItimeout\fR.
.SH MDADM INTEGRATION
\fBmd_monitor\fR listens to udev events for any device changes. It
is designed to integrate into MD via the \fI\-\-monitor\fR
functionality of \fBmdadm\fR.
.PP
To use this function \fBmdadm\fR needs to be started with
.TP
\fBmdadm --monitor --scan --program=\fImd_script\fR
.PP
where \fImd_script\fR is a bash script containing eg:
.RS 1
#!/bin/bash
.br
# MD monitor script
.br
#
.br
.br
EVENT=$1
.br
MD=$2
.br
DEV=$3
.br
.br
/sbin/md_monitor -c "${EVENT}:${MD}@${DEV}"
.br
.RE
.PP
A default \fImd_script\fR is installed at
\fR/usr/share/misc/md_notify_device.sh\fR.
.PP
It is recommended to use an \fI/etc/mdadm.conf\fR configuration file
when using \fBmd_monitor\fR to monitor MD arrays.
To enable automatic device assembly into MD arrays the configuration
file should include the lines:
.RS 1
.fC
.br
POLICY action=re-add
.br
AUTO -all
.br
.fR
.RE
.PP
It is recommended to include these line when using md_monitor.
.SH VERSIONS
This manual page documents md_monitor version 4.26.
.SH FILES
.TP
.I /usr/share/misc/md_notify_device.sh
Default \fBmd_monitor\fR script.
.TP
.I /etc/mdadm.conf
MD configuration file
.SH SEE ALSO
.IR
mdadm(8), mdadm.conf(7)