-
Notifications
You must be signed in to change notification settings - Fork 102
Runtime Failure Tolerance
Author: Stefan Eilemann
State: Design
[TOC]
The goal of this feature is to extend Equalizer to handle the failure of resources at runtime. The most common cause is the failure of a node process, either due to a hardware or programming failure.
All blocking operations in Equalizer will throw a
eq::base::timeoutError
exception. The client and server library
will catch the appropriate exceptions and handle them as needed. The
timeout time is a global parameter.
The following subsections list the blocking operations in various parts of the Equalizer code and the necessary changes to handle timeout situations.
The RequestHandler::waitRequest
and Monitor::wait
methods
throw a timeout exception. This is a prerequisite for many of the
higher-level operations listed below. Locks will keep blocking
indefinitely. A TimedLock should be used for operations which may time
out, as opposed to the traditional lock usage for mutual exclusion.
The request handler already uses a timed lock for blocking operations. A
new value EQ_TIMEOUT_DEFAULT
will be the new default value for
timeout parameters. The default timeout value is a global parameter. The
request is deregistered upon timeout.
The monitor wait operations shall use pthread_cond_timedwait
with
the given timeout. Attn: the timeout is an absolute time (cf. TimedLock).
On most timeout errors, the corresponding Node has to be disconnected. The individual operations mentioned below qualify how this disconnect is affected.
All connections perform blocking reads (readSync
) use a timeout
and throw a timeout exception. The timeout applies to an atomic read,
that is, a full packet read may take longer than the timeout, as long as
reading an individual chunk happens within the timeout.
When Node::handleData
detects an incomplete receive or catches a
timeout exception, it drops the packet and disconnects the node. The
timeout is not rethrown, since the ReceiverThread is an internal
thread. The same applies to disconnect events from the ConnectionSet
.
All Connection::write
operations use a timeout and throw a timeout
exception or return false on a closed connection (to be checked what is
more feasible). All sends have to pass through the appropriate node to be
able to handle the timeout exception in the proper place, i.e., no
direct call to Connection::send
is made. The Node::send
catches the timout exception and disconnects the node.
The timeout is rethrown and caught by all internal threads, i.e, the CommandThread. Note that the ReceiverThread should never send data. Methods writing data should catch the exception and
The enter
function catches the timeout exception from its wait
operation, disconnects the master node and rethrows the exception. The
barrier code has to handle late enter requests.
Currently the RSP implementation closes the multicast group when a timeout during send operations is exceeded. The new implementation shall simply disconnect the peers from which acknowledgements are missing. The exit of the node will be announced on the multicast group for a faster disconnect on other nodes.
Nodes which do not receive any packets for a repeated number of NAcks will disconnect themselves.
When using more than 63 connections in a Windows cluster, the
ConnectionSet
spawns worker threads, each operating on a subset of
63 connections each. When signalling data to the main thread, they wait
for the main thread to process the event. Since this wait does not block
the application, they will retry the operation indefinitely upon timeout
and consume the exception.
The session waits during during mapping for the initial data. The
exception is passed through, and handled by mapObject
.
Rollback connection in progress. Rethrow exception.
Local operation, can't fail. Remote node will detect closed connection.
Catch the exception, disconnect the node and rethrow it.
A number of operations are blocking during ```mapObject```, i.e., object master query and connect as well as the map request.Write timeout-aware master query when refactoring. Catch the timeout exception from the waitRequest on map, handle it by detaching the object if needed, and adapt the command handlers to handle commands for timeout/cancelled map requests.
Commit is local operation and can't fail.
The thread main loops will catch all non-caught exceptions and output a warning for each. This allows the application to catch and act on the exception by overwriting the appropriate task methods, while providing a sensible default behaviour.
Equalizer should not need to find and disconnect dead nodes, since Collage shall perform all necessary actions.
One main issue is timeout accumulation. Multiple blocking operations served by a single node each have a separate timeout, which causes the application to block multiple times before the rendering will be interactive again.
The FrameData
ready Lock
will be replaced by a
TimedLock
. Both the monitor used for waiting on a group of frames
and the lock used to wait on a single frame will let the timeout
exception through to the callee. The compositor will catch these
exceptions, ignore the failed images and rethrow it.
Ignore the exception and let it be caught by the thread main loop.
These are local operations which do not fail.
Server::chooseConfig, releaseConfig
Config::init, exit, update, finishFrame
These functions catch the timeout exception of their waitRequest, clean up pending data and rethrow the exception.
The server::Node
needs to check its co::Node
state at the
beginning of each operation. If the network node has failed, it is set in the
failed state. All actions should already perform the appropriate actions
based on the init reliability feature.
Failure of the application node has to cause a Config::exit
and
release of the configuration for further application runs.
TBD
Multiple operations served by the same failed node will have to time out independently. In the worst typical use case this multiplication is ```2
- numGPUS * latency``` (one input frame and one swap barrier per GPU, latency render frames queued).
Both the swap barrier and input frames do not know which node will provide the necessary data to finish the operation.
Hardware swapbarrier timeout support is not part of this feature. Node failures in a HW sync group may cause deadlocks (to be tested).
...
4 in eq::Pipe::waitFrameFinished (this=0x2805c00, frameNumber=520) at client/pipe.cpp:453
5 in eq::Node::_finishFrame (this=0x36023a0, frameNumber=520) at client/node.cpp:232
6 in eq::Node::_cmdFrameFinish (this=0x36023a0, command=@0x3624800) at client/node.cpp:559
..
13 in eq::fabric::Client::processCommand (this=0x2801000) at fabric/client.cpp:100
14 in eq::Config::finishFrame (this=0x2006000) at client/config.cpp:286
..
16 in main (argc=5, argv=0xbffff11c) at /Users/eile/Software/eq-git/src/examples/eqPly/main.cpp:90
..
3 in eq::base::Monitor<unsigned int>::waitGE (this=0xb0490934, value=@0xb0490950) at monitor.h:348
4 in eq::Compositor::assembleFramesUnsorted (frames=@0x186ab1c, channel=0x186a800, accum=0x3206b80) at client/compositor.cpp:423
5 in eq::Compositor::assembleFrames (frames=@0x186ab1c, channel=0x186a800, accum=0x3206b80) at client/compositor.cpp:217
6 in eqPly::Channel::frameAssemble (this=0x186a800, frameID=521) at /Users/eile/Software/eq-git/src/examples/eqPly/channel.cpp:228
..
13 in eq::Pipe::PipeThread::run (this=0x3203df0) at pipe.h:428