Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlusterFS Recovery Issue After Power Outage #4385

Open
techtronix-868 opened this issue Jun 26, 2024 · 5 comments
Open

GlusterFS Recovery Issue After Power Outage #4385

techtronix-868 opened this issue Jun 26, 2024 · 5 comments

Comments

@techtronix-868
Copy link

Description of problem:
I have configured a GlusterFS setup with three storage nodes in a replica configuration. Recently, I observed unexpected behavior when two of the nodes were power cycled. After the power cycle, I noticed that the .glusterfs directory and other files under the volume mount point were missing. Additionally, the GlusterFS brick did not come up as expected, which was evident from the logs in bricks/datastore3.log.

[2024-06-26 14:17:39.816328 +0000] W [MSGID: 106061] [glusterd-handler.c:3488:glusterd_transport_inet_options_build] 0-glusterd: Failed to get tcp-user-timeout
[2024-06-26 14:17:39.819094 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2024-06-26 14:17:42.052066 +0000] I [MSGID: 106493] [glusterd-rpc-ops.c:474:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 17c50526-c5bb-47bf-a547-d844a445eac6, host: storage01.g01.internal.net, port: 0
[2024-06-26 14:17:42.052726 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}, {errno=2}, {error=No such file or directory}]
[2024-06-26 14:17:42.052822 +0000] E [MSGID: 106061] [glusterd-server-quorum.c:260:glusterd_is_volume_in_server_quorum] 0-management: Dict get failed [{Key=cluster.server-quorum-type}]
[2024-06-26 14:17:42.052894 +0000] I [glusterd-utils.c:6970:glusterd_brick_start] 0-management: starting a fresh brick process for brick /datastore3
[2024-06-26 14:17:42.054690 +0000] I [rpc-clnt.c:1010:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2024-06-26 14:17:42.057757 +0000] I [rpc-clnt.c:1010:rpc_clnt_connection_init] 0-quotad: setting frame-timeout to 600
[2024-06-26 14:17:42.057892 +0000] I [MSGID: 106131] [glusterd-proc-mgmt.c:86:glusterd_proc_stop] 0-management: quotad already stopped
[2024-06-26 14:17:42.057910 +0000] I [MSGID: 106568] [glusterd-svc-mgmt.c:272:glusterd_svc_stop] 0-management: quotad service is stopped

The exact command to reproduce the issue:

  1. Set up a GlusterFS volume with three storage nodes in a replica configuration.
  2. Power cycle two of the storage nodes.
  3. Check the volume mount point for the presence of the .glusterfs directory and other files.
  4. Check the logs at bricks/datastore3.log for any errors or failure messages.

The full output of the command that failed:

Expected results:
Its expected that the power cycled nodes , the bricks should come up and should be able to access the mountpoint.

Mandatory info:
- The output of the gluster volume info command:

Volume Name: internaldatastore3
Type: Replicate
Volume ID: f98b1e4a-5e6f-4075-9339-c81dcab84868
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: storage01.g01.internal.net:/datastore3
Brick2: storage02.g01.internal.net:/datastore3
Brick3: storage03.g01.internal.net:/datastore3
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet6
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
storage.owner-uid: 36
storage.owner-gid: 36
server.allow-insecure: on

- The output of the gluster volume status command:

Gluster process TCP Port RDMA Port Online Pid

Brick storage01.g01.internal.net:/datastore3 49152 0 Y 76073
Brick storage02.g01.internal.net:/datastore3 N/A N/A N N/A
Brick storage03.g01.internal.net:/datastore3 N/A N/A N N/A
Self-heal Daemon on localhost N/A N/A Y 3992
Self-heal Daemon on storage01.g01.internal.net N/A N/A Y 76090
Self-heal Daemon on storage02.g01.internal.ne N/A N/A Y 2489

Task Status of Volume internaldatastore3

There are no active volume tasks

- The output of the gluster volume heal command:

Launching heal operation to perform index self heal on volume internaldatastore3 has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump
bricks/datastore3.log

[glusterfsd.c:1429:cleanup_and_exit] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x823) [0x55dd8c8d5423] -->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x243) [0x55dd8c8ce223] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x58) [0x55dd8c8c9a48] ) 0-: received signum (-1), shutting down
[2024-06-26 14:17:42.057661 +0000] I [MSGID: 100030] [glusterfsd.c:2683:main] 0-/usr/sbin/glusterfsd: Started running version [{arg=/usr/sbin/glusterfsd}, {version=9.4}, {cmdlinestr=/usr/sbin/glusterfsd -s storage03.g01.internal.net --volfile-id internaldatastore3.storage03.g01.internal.net.datastore3 -p /var/run/gluster/vols/internaldatastore3/storage03.g01.internal.net-datastore3.pid -S /var/run/gluster/360c1523341b2a4f.socket --brick-name /datastore3 -l /var/log/glusterfs/bricks/datastore3.log --xlator-option *-posix.glusterd-uuid=2cc95c8f-f83f-4827-a3d0-84891cba2dc7 --process-name brick --brick-port 49152 --xlator-option internaldatastore3datastore3-server.listen-port=49152 --xlator-option transport.address-family=inet6}]
[2024-06-26 14:17:42.058206 +0000] I [glusterfsd.c:2418:daemonize] 0-glusterfs: Pid of current running process is 3981
[2024-06-26 14:17:42.061046 +0000] I [socket.c:929:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2024-06-26 14:17:42.063863 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2024-06-26 14:17:42.063982 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2024-06-26 14:17:43.065553 +0000] I [glusterfsd-mgmt.c:2171:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: storage01.g01.internal.net:24007 storage02.g01.internal.net:24007
[2024-06-26 14:17:43.081985 +0000] I [rpcsvc.c:2701:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2024-06-26 14:17:43.082524 +0000] I [io-stats.c:3708:ios_sample_buf_size_configure] 0-/datastore3: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0
[2024-06-26 14:17:43.082605 +0000] E [MSGID: 138001] [index.c:2429:init] 0-internaldatastore3-index: Failed to find parent dir (/datastore3/.glusterfs) of index basepath /datastore3/.glusterfs/indices. [No such file or directory]
[2024-06-26 14:17:43.082624 +0000] E [MSGID: 101019] [xlator.c:643:xlator_init] 0-internaldatastore3-index: Initialization of volume failed. review your volfile again. [{name=internaldatastore3-index}]
[2024-06-26 14:17:43.082631 +0000] E [MSGID: 101066] [graph.c:425:glusterfs_graph_init] 0-internaldatastore3-index: initializing translator failed
[2024-06-26 14:17:43.082637 +0000] E [MSGID: 101176] [graph.c:777:glusterfs_graph_activate] 0-graph: init failed
[2024-06-26 14:17:43.082678 +0000] I [io-stats.c:4038:fini] 0-/datastore3: io-stats translator unloaded
[2024-06-26 14:17:43.083206 +0000] W [glusterfsd.c:1429:cleanup_and_exit] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x823) [0x5561546e0423] -->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x243) [0x5561546d9223] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x58) [0x5561546d4a48] ) 0-: received signum (-1), shutting down

Additional info:

- The operating system / glusterfs version: glusterfs 9.4
OS Release:- ALMA 8.6

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

@babipanghang
Copy link

Did you check the filesystem hosting your brick is actually in good condition and mounted?

@techtronix-868
Copy link
Author

The filesystem in good state and mounted based on that I figured the data is lost under the mountpoint.

The filesystem in good state and mounted based on that I figured the data is lost under the mountpoint.

@aravindavk
Copy link
Member

Looks like Bricks were not mounted while starting the volume (After reboot). If the backend brick paths are mounted, please try gluster volume start internaldatastore3 force. GlusterFS will not delete the .glusterfs directory even after Volume delete, so the issue is most likely brick mount issue. Please check df /datastore3 or mount | grep datastore3 in each node.

@techtronix-868
Copy link
Author

The /dev/sda is mounted on the same point. I performed 12 iteration on my rented dell Baremetal. This only happens when gluster is not able to gracefully exit . Does gluster have a write-cache that get written on the mounted points. Are there transactions that can be used to ensure that the data has been written on the disk.

@anon314159
Copy link

anon314159 commented Jul 9, 2024

GlusterFS has several performance translators (performance.write-behind) that could cause files not currently written to the underlying brick to be lost during a powerless event. I would be more concerned with the underlying storage subsystem's I/O mode, do your systems leverage RAID and does the controller/HBA have a battery backup? If yes, what operating modes are the configured as write-back or write-through? Another thing to look out for is journaled file systems like XFS may not mount properly or in a timely manner after any sudden or unexpected shutdown/reboot. This could cause issues with GlusterFSD not being able to attach to the affected storage device. Is this problem isolated to a single host's bricks or is sporadic (i.e. random bricks in the volume fail to start after an unexpected shutdown/reboot)? Everything points to the underlying storage configuration as the culprit and Gluster's inability to start properly is merely a consequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants