You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The shard monitoring in the LTM currently assumes a fixed timeout of 1 hour between status updates on the child test appliance. If the last status update occurred more than an hour ago, the monitor process will assume that the test appliance crashed/wedged, and will create a serial port dump in place of test results.
The reason for setting a fixed 1 hour timeout is that generic/027 and a few other tests in xfstests are very IOPS bound, and take a while (some runs take longer than 3000 seconds)
If the test appliance were more diligent in reporting the latest test being run, custom timeouts could be set for each test in xfstests, and a kernel crash would be detected much sooner. For example, generic/001 is usually quite fast to run, so if the LTM is aware that the test appliance is running generic/001, the timeout could be somewhere in the range of 20-30 seconds rather than the fixed hour.
It could be also estimated that tests fall into several categories of size, e.g. "xsmall", "small", "medium", "large", and "xlarge"
To be even more sophisticated, the timeouts could be modified based on the number of CPUs/size of the scratch disk of that particular test appliance, and whether the test is more CPU/IOPS bound.
The text was updated successfully, but these errors were encountered:
The shard monitoring in the LTM currently assumes a fixed timeout of 1 hour between status updates on the child test appliance. If the last status update occurred more than an hour ago, the monitor process will assume that the test appliance crashed/wedged, and will create a serial port dump in place of test results.
The reason for setting a fixed 1 hour timeout is that generic/027 and a few other tests in xfstests are very IOPS bound, and take a while (some runs take longer than 3000 seconds)
If the test appliance were more diligent in reporting the latest test being run, custom timeouts could be set for each test in xfstests, and a kernel crash would be detected much sooner. For example, generic/001 is usually quite fast to run, so if the LTM is aware that the test appliance is running generic/001, the timeout could be somewhere in the range of 20-30 seconds rather than the fixed hour.
It could be also estimated that tests fall into several categories of size, e.g. "xsmall", "small", "medium", "large", and "xlarge"
To be even more sophisticated, the timeouts could be modified based on the number of CPUs/size of the scratch disk of that particular test appliance, and whether the test is more CPU/IOPS bound.
The text was updated successfully, but these errors were encountered: