LTM shard monitoring timeouts on a per-test basis #10

tytso · 2018-06-05T00:50:47Z

The shard monitoring in the LTM currently assumes a fixed timeout of 1 hour between status updates on the child test appliance. If the last status update occurred more than an hour ago, the monitor process will assume that the test appliance crashed/wedged, and will create a serial port dump in place of test results.
The reason for setting a fixed 1 hour timeout is that generic/027 and a few other tests in xfstests are very IOPS bound, and take a while (some runs take longer than 3000 seconds)

If the test appliance were more diligent in reporting the latest test being run, custom timeouts could be set for each test in xfstests, and a kernel crash would be detected much sooner. For example, generic/001 is usually quite fast to run, so if the LTM is aware that the test appliance is running generic/001, the timeout could be somewhere in the range of 20-30 seconds rather than the fixed hour.

It could be also estimated that tests fall into several categories of size, e.g. "xsmall", "small", "medium", "large", and "xlarge"

To be even more sophisticated, the timeouts could be modified based on the number of CPUs/size of the scratch disk of that particular test appliance, and whether the test is more CPU/IOPS bound.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LTM shard monitoring timeouts on a per-test basis #10

LTM shard monitoring timeouts on a per-test basis #10

tytso commented Jun 5, 2018

LTM shard monitoring timeouts on a per-test basis #10

LTM shard monitoring timeouts on a per-test basis #10

Comments

tytso commented Jun 5, 2018