-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GlusterFS Recovery Issue After Power Outage #4385
Comments
Did you check the filesystem hosting your brick is actually in good condition and mounted? |
The filesystem in good state and mounted based on that I figured the data is lost under the mountpoint. |
Looks like Bricks were not mounted while starting the volume (After reboot). If the backend brick paths are mounted, please try |
The /dev/sda is mounted on the same point. I performed 12 iteration on my rented dell Baremetal. This only happens when gluster is not able to gracefully exit . Does gluster have a write-cache that get written on the mounted points. Are there transactions that can be used to ensure that the data has been written on the disk. |
GlusterFS has several performance translators (performance.write-behind) that could cause files not currently written to the underlying brick to be lost during a powerless event. I would be more concerned with the underlying storage subsystem's I/O mode, do your systems leverage RAID and does the controller/HBA have a battery backup? If yes, what operating modes are the configured as write-back or write-through? Another thing to look out for is journaled file systems like XFS may not mount properly or in a timely manner after any sudden or unexpected shutdown/reboot. This could cause issues with GlusterFSD not being able to attach to the affected storage device. Is this problem isolated to a single host's bricks or is sporadic (i.e. random bricks in the volume fail to start after an unexpected shutdown/reboot)? Everything points to the underlying storage configuration as the culprit and Gluster's inability to start properly is merely a consequence. |
Description of problem:
I have configured a GlusterFS setup with three storage nodes in a replica configuration. Recently, I observed unexpected behavior when two of the nodes were power cycled. After the power cycle, I noticed that the .glusterfs directory and other files under the volume mount point were missing. Additionally, the GlusterFS brick did not come up as expected, which was evident from the logs in bricks/datastore3.log.
The exact command to reproduce the issue:
The full output of the command that failed:
Expected results:
Its expected that the power cycled nodes , the bricks should come up and should be able to access the mountpoint.
Mandatory info:
- The output of the
gluster volume info
command:Volume Name: internaldatastore3
Type: Replicate
Volume ID: f98b1e4a-5e6f-4075-9339-c81dcab84868
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: storage01.g01.internal.net:/datastore3
Brick2: storage02.g01.internal.net:/datastore3
Brick3: storage03.g01.internal.net:/datastore3
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet6
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: on
storage.owner-uid: 36
storage.owner-gid: 36
server.allow-insecure: on
- The output of the
gluster volume status
command:Gluster process TCP Port RDMA Port Online Pid
Brick storage01.g01.internal.net:/datastore3 49152 0 Y 76073
Brick storage02.g01.internal.net:/datastore3 N/A N/A N N/A
Brick storage03.g01.internal.net:/datastore3 N/A N/A N N/A
Self-heal Daemon on localhost N/A N/A Y 3992
Self-heal Daemon on storage01.g01.internal.net N/A N/A Y 76090
Self-heal Daemon on storage02.g01.internal.ne N/A N/A Y 2489
Task Status of Volume internaldatastore3
There are no active volume tasks
- The output of the
gluster volume heal
command:Launching heal operation to perform index self heal on volume internaldatastore3 has been unsuccessful:
Glusterd Syncop Mgmt brick op 'Heal' failed. Please check glustershd log file for details.
**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/
**- Is there any crash ? Provide the backtrace and coredump
bricks/datastore3.log
[glusterfsd.c:1429:cleanup_and_exit] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x823) [0x55dd8c8d5423] -->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x243) [0x55dd8c8ce223] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x58) [0x55dd8c8c9a48] ) 0-: received signum (-1), shutting down
[2024-06-26 14:17:42.057661 +0000] I [MSGID: 100030] [glusterfsd.c:2683:main] 0-/usr/sbin/glusterfsd: Started running version [{arg=/usr/sbin/glusterfsd}, {version=9.4}, {cmdlinestr=/usr/sbin/glusterfsd -s storage03.g01.internal.net --volfile-id internaldatastore3.storage03.g01.internal.net.datastore3 -p /var/run/gluster/vols/internaldatastore3/storage03.g01.internal.net-datastore3.pid -S /var/run/gluster/360c1523341b2a4f.socket --brick-name /datastore3 -l /var/log/glusterfs/bricks/datastore3.log --xlator-option *-posix.glusterd-uuid=2cc95c8f-f83f-4827-a3d0-84891cba2dc7 --process-name brick --brick-port 49152 --xlator-option internaldatastore3datastore3-server.listen-port=49152 --xlator-option transport.address-family=inet6}]
[2024-06-26 14:17:42.058206 +0000] I [glusterfsd.c:2418:daemonize] 0-glusterfs: Pid of current running process is 3981
[2024-06-26 14:17:42.061046 +0000] I [socket.c:929:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2024-06-26 14:17:42.063863 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2024-06-26 14:17:42.063982 +0000] I [MSGID: 101190] [event-epoll.c:670:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2024-06-26 14:17:43.065553 +0000] I [glusterfsd-mgmt.c:2171:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: storage01.g01.internal.net:24007 storage02.g01.internal.net:24007
[2024-06-26 14:17:43.081985 +0000] I [rpcsvc.c:2701:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2024-06-26 14:17:43.082524 +0000] I [io-stats.c:3708:ios_sample_buf_size_configure] 0-/datastore3: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0
[2024-06-26 14:17:43.082605 +0000] E [MSGID: 138001] [index.c:2429:init] 0-internaldatastore3-index: Failed to find parent dir (/datastore3/.glusterfs) of index basepath /datastore3/.glusterfs/indices. [No such file or directory]
[2024-06-26 14:17:43.082624 +0000] E [MSGID: 101019] [xlator.c:643:xlator_init] 0-internaldatastore3-index: Initialization of volume failed. review your volfile again. [{name=internaldatastore3-index}]
[2024-06-26 14:17:43.082631 +0000] E [MSGID: 101066] [graph.c:425:glusterfs_graph_init] 0-internaldatastore3-index: initializing translator failed
[2024-06-26 14:17:43.082637 +0000] E [MSGID: 101176] [graph.c:777:glusterfs_graph_activate] 0-graph: init failed
[2024-06-26 14:17:43.082678 +0000] I [io-stats.c:4038:fini] 0-/datastore3: io-stats translator unloaded
[2024-06-26 14:17:43.083206 +0000] W [glusterfsd.c:1429:cleanup_and_exit] (-->/usr/sbin/glusterfsd(mgmt_getspec_cbk+0x823) [0x5561546e0423] -->/usr/sbin/glusterfsd(glusterfs_process_volfp+0x243) [0x5561546d9223] -->/usr/sbin/glusterfsd(cleanup_and_exit+0x58) [0x5561546d4a48] ) 0-: received signum (-1), shutting down
Additional info:
- The operating system / glusterfs version: glusterfs 9.4
OS Release:- ALMA 8.6
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration
The text was updated successfully, but these errors were encountered: