Skip to content
eile edited this page Oct 7, 2011 · 12 revisions

Keep-Alive

Author: Lucas Peetz Dulley

State: Design

Overview

The goal of this feature is to instrument Collage LocalNodes to be aware of its connected remote notes status/lastAliveTime.

The current timeout value as now accompanied by a keep alive setting. Blocking operations check every keep alive ms if the peer node is still alive, otherwise they throw the timeout exception early. The timeout value determines the maximum wait time, even if the remote peer is still communicating (catches deadlocks).

Requirements

CO

  • Keep-alive feature is always-on.
  • Whenever receiver thread receives data (EVENT_DATA) from a remote node, the lastAliveTime for the node is updated.
  • eile: A node sends every keepalive ms a dummy keepalive packet to its peers if it has not send data to them. ** eile: TBD internal or external keepalive default value?

EQ

  • eile: Equalizer times out blocking operations after keepalive ms (or remaining timeout, if smaller), sends ping, blocks again for up to keepalive ms, then checks the peer node status
  • E.g.: EqPly can open heavy .ply files (e.g., lucy.ply) without failing due to frame rendering timeouts while the model is being distributed.

API

co::Node:

/** The node is responding (is alive) */
int64_t _lastReceive; //!< last time packets where received
int64_t co::Node::getLastReceiveTime() const // access to nodes LastAliveTime

co::LocalNode:

bool co::LocalNode::_ping( NodePtr remoteNode ); // requests keepalive from remote node
// it just sends the packets when we need them to be sent.

/** process ping request. called from receiver thread (not queued in command queue) */
// updates lastAliveTime in receiver thread for the node which sent the packet
// sends a ping reply packet to "local" node
bool co::LocalNode::_cmdPing( Command& command );

/** process ping reply response. called from receiver thread (not queued in command queue) */
// eile: not needed, use _cmdDiscard and no queue. _lastReceiveTime was already updated by _handleData
    //updates lastAliveTime for the node which replied the packet
// remoteNode->_lastAliveTime = getTime64();
bool co::LocalNode::_cmdPingReply( Command& command ); 

Ping Packets:

/** NEW: node ping packet */
co::NodePingPacket
    // eile: no need: uint64_t transmitTime;
/** NEW: node ping reply packet */
co::NodePingReplyPacket( const NodePingPacket* request): transmitTime( request->transmitTime );
    // eile:dito: uint64_t transmitTime;

eq::

Config::setTimout ( int64_t timeout = EQ_TIMEOUT_DEFAULT ) // where should it be set? eile: same place as today?

Usage

  • From the Application PoV, it only needs to set the Timeout and potentially keepalive time. (so the App sets Timeout to a big value) This will define the number of retries in the critical EQ timeouts to timeout / keepalive / 2 (like finishFrame... ).
  • The retry is only considered if the node is considered alive. The number of retries goes from zero ( Timeout < EQ_TIMEOUT_KEEPALIVE ) to infinite ( Timeout == EQ_TIMEOUT_INDEFINITE ) Exception is thrown in "Timeout/EQ_TIMEOUT_KEEPALIVE" ms if node is dead (considered after X retries).

File Format

No changes.

Restrictions

The Collage-based keep-alive signal does not take any action deciding whether a remote node is responsive or not. It only gathers and provides information about the actual remote nodes alive states from the local node perspective.

  • EQ acts on the keep alive information agreeing to retry a timed-out operation if the node seems to be alive.

Issues

Q: Name of the timeouts to make it less confusing.

eile: see proposal above.