Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CellTunnel unrecoverably terminates #7707

Open
kofemann opened this issue Dec 2, 2024 · 0 comments
Open

CellTunnel unrecoverably terminates #7707

kofemann opened this issue Dec 2, 2024 · 0 comments
Labels
bug messaging Concerning dCache-internal messaging

Comments

@kofemann
Copy link
Member

kofemann commented Dec 2, 2024

Connection to pool is not possible:

admin > \c dcache-xfel357-06
(1) Cell does not exist.

The affected pool can send heartbeat, thus enabled in the pool manager:

admin > \sp psu ls pool dcache-xfel357-06 -l
dcache-xfel357-06  (enabled=true;active=23;rdOnly=false;links=0;pgroups=1;hsm=[cta, osm, osm2];mode=enabled)
 linkList   :

The System cell is accessible:

admin > \sp psu ls pool dcache-xfel357-06 -l
dcache-xfel357-06  (enabled=true;active=23;rdOnly=false;links=0;pgroups=1;hsm=[cta, osm, osm2];mode=enabled)
 linkList   :

In the System cell of the affected domain, only two core connections are active. The third one in Terminated state

admin > ps -f
Name                                                       State Queue Q-time/ms Threads Class                    Additional info
c-dcache-head-xfel03_messageDomain-AAYn4k7pJ7g               A       0         0       1 LocationManagerConnector Connected                                                
lm                                                           A       0         0       0 LocationManager                                                                   
dcache-xfel357-06                                            A       0         0     104 UniversalSpringCell                                                               
c-dcache-head-xfel03_messageDomain-AAYn4k7pJ7g-AAYn4k7pqKA   A       0         0       2 LocationMgrTunnel        Connected to dcache-head-xfel03_messageDomain            
c-dcache-head-xfel02_messageDomain-AAYn4k7oxhA               A       0         0       0 LocationManagerConnector Terminated                                               
c-dcache-head-xfel01_messageDomain-AAYn4k7oXJg-AAYn4k7pSuA   A       0         0       2 LocationMgrTunnel        Connected to dcache-head-xfel01_messageDomain            
RoutingMgr                                                   A       0         0       1 CoreRoutingManager                                                                
c-dcache-head-xfel01_messageDomain-AAYn4k7oXJg               A       0         0       1 LocationManagerConnector Connected                                                
System                                                       A       0         0       3 SystemCell               dcache-xfel357-06Domain:IOrec=5286;IOexc=0;MEM=1341588128

The issue related to tunnel reconnect logic:

    @Override
    public void run() {
        /* Thread for creating the tunnel. There is a grace period of
         * 4 to 20 seconds between connection attempts. The thread is
         * terminated by interrupting it.
         */
        Args args = getArgs();
        String name = getCellName() + '*';
        Random random = new Random();
        NDC.push(_address.toString());
        try {
            while (_isRunning) {
                try {
                    _retries++;

                    LocationMgrTunnel tunnel = new LocationMgrTunnel(name, connect(), args);
                    try {
                        tunnel.start().get();
                        _retries = 0;
                        _status = "Connected";
                        getNucleus().join(tunnel.getCellName());
                    } finally {
                        getNucleus().kill(tunnel.getCellName());
                    }
                } catch (InterruptedIOException | ClosedByInterruptException e) {
                    throw e;
                } catch (ExecutionException | IOException e) {
                    String error = Exceptions.meaningfulMessage(Throwables.getRootCause(e));
                    _log.warn(AlarmMarkerFactory.getMarker(PredefinedAlarm.LOCATION_MANAGER_FAILURE,
                                name,
                                _domain,
                                    error),
                          "Failed to connect to " + _domain + ": " + e);
                }

                _status = "Sleeping";
                long sleep = random.nextInt(16000) + 4000;
                _log.warn("Sleeping {} seconds", sleep / 1000);
                Thread.sleep(sleep);
            }
        } catch (InterruptedIOException | InterruptedException | ClosedByInterruptException ignored) {
        } finally {
            NDC.pop();
            _status = "Terminated";
        }
    }

Unhandled InterruptedIOException, InterruptedException and ClosedByInterruptException can be the source of such a situation.

This is probably the main reason of #5326

@kofemann kofemann added bug messaging Concerning dCache-internal messaging labels Dec 2, 2024
kofemann added a commit that referenced this issue Dec 6, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
kofemann added a commit that referenced this issue Dec 6, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
kofemann added a commit that referenced this issue Dec 6, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
kofemann added a commit that referenced this issue Dec 6, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
kofemann added a commit that referenced this issue Dec 6, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
khys95 pushed a commit to khys95/dcache that referenced this issue Dec 6, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: dCache#7707, dCache#5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
khys95 pushed a commit that referenced this issue Dec 12, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
khys95 pushed a commit that referenced this issue Dec 12, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
khys95 pushed a commit that referenced this issue Dec 12, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
khys95 pushed a commit that referenced this issue Dec 12, 2024
Motivation:
As long as cell tunnel is not explicitly stopped by calling
dmg.cells.network.LocationManagerConnector#stopped, other interrupts
should be ignored.

Modification:
Update retry logic to never give up, unless _isRunning flag set to
false.

Result:
More robust tulles in case of network issues.

Issue: #7707, #5326
Acked-by: Dmitry Litvintsev
Target: master, 10.2, 10.1, 10.0, 9.2
Require-book: no
Require-notes: yes
(cherry picked from commit 600ed1f)
Signed-off-by: Tigran Mkrtchyan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug messaging Concerning dCache-internal messaging
Projects
None yet
Development

No branches or pull requests

1 participant