Skip to content

Commit

Permalink
Re-introduce vmcore creation notification to kdump
Browse files Browse the repository at this point in the history
Motivation
==========

People may forget to recheck to ensure kdump works, which as a result, a
possibility of no vmcores generated after a real system crash. It is
unexpected for kdump.

It is highly recommended people to test kdump after any system modification,
such as:

a. after kernel patching or whole yum update, as it might break something
   on which kdump is dependent, maybe due to introduction of any new bug etc.
b. after any change at hardware level, maybe storage, networking,
   firmware upgrading etc.
c. after implementing any new application, like which involves 3rd party modules
   etc.

Though these exceed the range of kdump, however a simple vmcore creation
status notification is good to have for now.

Design
======

Kdump currently will check any relating files/fs/drivers modified before
determine if initrd should rebuild when (re)start. A rebuild is an
indicator of such modification, and kdump need to be tested. This will
clear the vmcore creation status specified in $VMCORE_CREATION_STATUS,
and as a result, a notification of vmcore creation test will be
outputted.

To test kdump, there is an entry for doing that by "kdumpctl test". It
will generate a timestamp string as the ID of the current test, along
with a "pending" status in $VMCORE_CREATION_STATUS, then a real crash &
dump process will be triggered.

After system reboot back to normal, a vmcore creation check will start at
"kdumpctl (re)start/status", and will report the results as
success/fail/manual status to users.

To achieve that, program will first check the status in $VMCORE_CREATION_STATUS.
If "pending" status if found, which means the test result is
undetermined and need a retrive from remote/local dump folder. Then if test
id is found in the dump folder and vmcore is complete, then "pending"
would be overwritten by "success", which indicates a successful kdump
test. If test id is found in the dump folder but vmcore is incomplete,
then it is a "fail" kdump test. If no test id is found, then it is a "manual"
status, which indicates users should check the test results manually.

If $VMCORE_CREATION_STATUS is already success/fail/manual status, it indicates
the test result has already been determined, so the program will not access
the remote/local dump folder again. This can limite any unnecessary
access to dump target, shorten the time consumption.

User should check for the root cause of fail/manual status when get
reports.

$VMCORE_CREATION_STATUS is used for recording the vmcore creation status of
the current env. The format is like:

   <status> kdump_test_id=<timestamp sec>-<timestamp nanosec>
e.g:
   success kdump_test_id=1729823462-938751820

Which means, there has been a successful kdump test at
$(date -d "@1729823462") timestamp for the current env. Timestamp
nanosec is only meaningful for uniquify id string.

Difference
==========
Previously there is one commit 88525eb ("Introduce vmcore creation
notification to kdump") merged and addressing the same issue, but
implemented differently:

The prev one:
Save the $VMCORE_CREATION_STATUS to local drive during the 2nd kernel
dumping. If vmcore dumping target is different from $VMCORE_CREATION_STATUS's
drive, then the latter one need to be mounted in 2nd kernel.

This one:
Save the $VMCORE_CREATION_STATUS to local drive only in 1nd kernel, that
is, the test result is retrived after 2nd kernel dumping. So it doesn't
load or mount other drive in 2nd kernel.

The advantage:
Extra mounting in 2nd kernel will introduce higher risk of failure,
as a result, lower the success of vmcore dumping, which is
unaccepted. So keep the code for 2nd kernel as simple is preferred.

Usage
=====
[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Notice: No vmcore creation test performed!

[root@localhost ~]# kdumpctl status
kdump: Kdump is operational
kdump: Notice: No vmcore creation test performed!

[root@localhost ~]# kdumpctl test

[root@localhost ~]# cat /var/lib/kdump/vmcore-creation.status
pending kdump_test_id=1729823462-938751820

[root@localhost ~]# kdumpctl status
kdump: Kdump is operational
kdump: Notice: Last successful vmcore creation on Fri Oct 25 02:31:02 AM UTC 2024

[root@localhost ~]# cat /var/lib/kdump/vmcore-creation.status
success kdump_test_id=1729823462-938751820

[root@localhost ~]# kdumpctl restart
kdump: kexec: unloaded kdump kernel
kdump: Stopping kdump: [OK]
kdump: kexec: loaded kdump kernel
kdump: Starting kdump: [OK]
kdump: Notice: Last successful vmcore creation on Fri Oct 25 02:31:02 AM UTC 2024

Note: the notification for kdumpctl (re)start/status can be disabled by
setting VMCORE_CREATION_NOTIFICATION in /etc/sysconfig/kdump. And fadump
is NOT supported for this feature.

Signed-off-by: Tao Liu <[email protected]>
  • Loading branch information
liutgnu committed Nov 22, 2024
1 parent fb2c426 commit 4958972
Show file tree
Hide file tree
Showing 4 changed files with 216 additions and 1 deletion.
19 changes: 19 additions & 0 deletions dracut/99kdumpbase/kdump.sh
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ save_log() {
# $1: dump path, must be a mount point
dump_fs() {
ddebug "dump_fs _mp=$1"
_test_id=$(getarg kdump_test_id)

if ! is_mounted "$1"; then
dinfo "dump path '$1' is not mounted, trying to mount..."
Expand Down Expand Up @@ -168,6 +169,11 @@ dump_fs() {

mkdir -p "$_dump_fs_path" || return 1

if [ -n "$_test_id" ]; then
echo "fail kdump_test_id=$_test_id" > "$_dump_fs_path/vmcore-creation.status"
sync -f "$_dump_fs_path/vmcore-creation.status"
fi

save_vmcore_dmesg_fs ${DMESG_COLLECTOR} "$_dump_fs_path"
save_opalcore_fs "$_dump_fs_path"

Expand All @@ -183,6 +189,10 @@ dump_fs() {
if [ $_sync_exitcode -eq 0 ]; then
mv "$_dump_fs_path/vmcore-incomplete" "$_dump_fs_path/vmcore"
dinfo "saving vmcore complete"
if [ -n "$_test_id" ]; then
echo "success kdump_test_id=$_test_id" > "$_dump_fs_path/vmcore-creation.status"
sync -f "$_dump_fs_path/vmcore-creation.status"
fi
else
derror "sync vmcore failed, exitcode:$_sync_exitcode"
return 1
Expand Down Expand Up @@ -405,6 +415,7 @@ dump_ssh() {
fi

dinfo "saving to $2:$_ssh_dir"
_test_id=$(getarg kdump_test_id)

cat /var/lib/random-seed > /dev/urandom
# shellcheck disable=SC2086 # ssh_opts needs to be split
Expand Down Expand Up @@ -443,11 +454,19 @@ dump_ssh() {
derror "moving vmcore failed, exitcode:$_ret"
else
dinfo "saving vmcore complete"
if [ -n "$_test_id" ]; then
ssh $_ssh_opts "$2" "echo \"success kdump_test_id=$_test_id\" > '$_ssh_dir/vmcore-creation.status'"
fi
return $_ret
fi
else
derror "saving vmcore failed, exitcode:$_ret"
fi

if [ -n "$_test_id" ]; then
ssh $_ssh_opts "$2" "echo \"fail kdump_test_id=$_test_id\" > '$_ssh_dir/vmcore-creation.status'"
fi

return $_ret
}

Expand Down
4 changes: 4 additions & 0 deletions gen-kdump-sysconfig.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ KDUMP_IMG="vmlinuz"
#What is the images extension. Relocatable kernels don't have one
KDUMP_IMG_EXT=""
# Enable vmcore creation notification by default, disable by setting
# VMCORE_CREATION_NOTIFICATION=""
VMCORE_CREATION_NOTIFICATION="yes"
# Logging is controlled by following variables in the first kernel:
# - @var KDUMP_STDLOGLVL - logging level to standard error (console output)
# - @var KDUMP_SYSLOGLVL - logging level to syslog (by logger command)
Expand Down
184 changes: 183 additions & 1 deletion kdumpctl
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ KDUMP_INITRD=""
TARGET_INITRD=""
#kdump shall be the default dump mode
DEFAULT_DUMP_MODE="kdump"
VMCORE_CREATION_STATUS="/var/lib/kdump/vmcore-creation.status"

# Some default values in case /etc/sysconfig/kdump doesn't include
KDUMP_COMMANDLINE_REMOVE="hugepages hugepagesz slub_debug"
Expand Down Expand Up @@ -45,8 +46,10 @@ if ! dlog_init; then
fi

KDUMP_TMPDIR=$(mktemp --tmpdir -d kdump.XXXX)
TMPMNT="$KDUMP_TMPDIR/target"
trap '
ret=$?;
is_mounted $TMPMNT && umount -f $TMPMNT;
rm -rf "$KDUMP_TMPDIR"
exit $ret;
' EXIT
Expand Down Expand Up @@ -173,6 +176,7 @@ rebuild_kdump_initrd()

rebuild_initrd()
{
local _ret
dinfo "Rebuilding $TARGET_INITRD"

if [[ ! -w $(dirname "$TARGET_INITRD") ]]; then
Expand All @@ -185,6 +189,10 @@ rebuild_initrd()
else
rebuild_kdump_initrd
fi
_ret=$?

set_vmcore_creation_status 'clear'
return $_ret
}

#$1: the files to be checked with IFS=' '
Expand Down Expand Up @@ -1043,6 +1051,7 @@ start()
start_dump || return

dinfo "Starting kdump: [OK]"
check_vmcore_creation_status
return 0
}

Expand Down Expand Up @@ -1666,6 +1675,174 @@ _should_reset_crashkernel() {
[[ $(kdump_get_conf_val auto_reset_crashkernel) != no ]] && systemctl is-enabled kdump &> /dev/null
}

set_kdump_test_id()
{
local _id=$1

KDUMP_COMMANDLINE_APPEND+=" $_id "
reload >& /dev/null

if [[ "$?" -ne 0 ]]; then
derror "Set kdump test id fail."
exit 1
fi
}

# $1: success/fail/pending/manual/clear
# $2: test id
set_vmcore_creation_status()
{
local _status=$1
local _kdump_test_id
_dir=$(dirname "$VMCORE_CREATION_STATUS")

[[ -d "$_dir" ]] || mkdir -p "$_dir"
[[ -w "$_dir" ]] || chmod +w "$_dir"

case "$_status" in
pending)
_kdump_test_id="kdump_test_id=$(date +%s-%N)"
set_kdump_test_id "$_kdump_test_id"
echo "$_status $_kdump_test_id" > "$VMCORE_CREATION_STATUS"
;;
success | fail | manual)
sed -E -i "s/^\w+/$_status/" "$VMCORE_CREATION_STATUS"
;;
clear)
rm -f "$VMCORE_CREATION_STATUS"
;;
*)
return
esac
sync -f "$_dir"
}

# $1: vmcore path
# $2: test id
fetch_target()
{
target_status_files=$(find "$1" -name "vmcore-creation.status" 2> /dev/null)
[ -z "$target_status_files" ] && return 2
target_status_files=$(ls -t $target_status_files | head -20)
target_status_file=$(grep -l "$2" $target_status_files | uniq)

case $(echo -n "$target_status_file" | wc -w) in
1)
grep -q "success" "$target_status_file"
case "$?" in
0)
# Success
return 0
;;
*)
# Fail
return 1
;;
esac
;;
*)
# None or 2 more files found containing the test id
return 2
;;
esac
}

fetch_test_status()
{
local _test_id="$1" _mnt

if is_nfs_dump_target || is_local_target; then
_mnt=$(get_mntpoint_from_target "${OPT[_target]}")
if [[ -z "$_mnt" ]] || ! is_mounted "$_mnt"; then
mkdir -p $TMPMNT
mount "${OPT[_target]}" "$TMPMNT" -t "${OPT[_fstype]}" -o defaults || \
{ dwarn "Failed to mount ${OPT[_target]}" && return 2; }
_mnt="$TMPMNT"
fi
fetch_target "$_mnt/${OPT[path]}" "$_test_id"
elif is_ssh_dump_target; then
ssh -i "${OPT[sshkey]}" -o BatchMode=yes "${OPT[_target]}" \
"$(typeset -f fetch_target); fetch_target ${OPT[path]} $_test_id"
elif is_raw_dump_target; then
return 2
else
fetch_target "${OPT[path]}" "$_test_id"
fi
}

check_vmcore_creation_status()
{
local _status _test_id _timestamp _status_date

[[ ${VMCORE_CREATION_NOTIFICATION,,} == "yes" ]] || return

[[ "$DEFAULT_DUMP_MODE" == "kdump" ]] || return

if [[ ! -s "$VMCORE_CREATION_STATUS" ]]; then
dwarn "Notice: No vmcore creation test performed!"
return
fi

[[ "${#OPT[@]}" -eq 0 ]] && { parse_config || return; }

read -r _status _test_id < "$VMCORE_CREATION_STATUS"
_timestamp="$(echo "$_test_id" | awk -F '[-=]' '{print $2}')"
_status_date=$(date -d "@$_timestamp")

if [[ "$_status" == "pending" ]]; then
fetch_test_status "$_test_id"
case "$?" in
0)
set_vmcore_creation_status "success"
dinfo "Notice: Last successful vmcore creation on $_status_date"
;;
1)
set_vmcore_creation_status "fail"
dwarn "Notice: Last NOT successful vmcore creation on $_status_date"
;;
*)
set_vmcore_creation_status "manual"
dwarn "Notice: Require manual check for kdump test of $_status_date"
;;
esac
elif [[ "$_status" == "success" ]]; then
dinfo "Notice: Last successful vmcore creation on $_status_date"
elif [[ "$_status" == "fail" ]]; then
dwarn "Notice: Last NOT successful vmcore creation on $_status_date"
elif [[ "$_status" == "manual" ]]; then
dwarn "Notice: Require manual check for kdump test of $_status_date"
fi
}

kdump_test()
{
if ! is_kernel_loaded "$DEFAULT_DUMP_MODE"; then
derror "Kdump needs be operational before test."
exit 1
fi

if [[ ! "$DEFAULT_DUMP_MODE" == "kdump" ]]; then
derror "Only kdump is supported for test."
exit 1
fi

if [[ ! "$1" == "--force" ]]; then
read -p "DANGER!!! Will perform a kdump test by crashing the system, proceed? (y/N): " input
case $input in
[Yy] )
dinfo "Start kdump test..."
;;
* )
dinfo "Operation cancelled."
exit 0
;;
esac
fi

set_vmcore_creation_status 'pending'
echo c > /proc/sysrq-trigger
}

main()
{
# Determine if the dump mode is kdump or fadump
Expand Down Expand Up @@ -1697,6 +1874,7 @@ main()
EXIT_CODE=3
;;
esac
check_vmcore_creation_status
exit $EXIT_CODE
;;
reload)
Expand Down Expand Up @@ -1741,8 +1919,12 @@ main()
reset_crashkernel_for_installed_kernel "$2"
fi
;;
test)
shift
kdump_test "$@"
;;
*)
dinfo $"Usage: $0 {estimate|start|stop|status|restart|reload|rebuild|reset-crashkernel|propagate|showmem}"
dinfo $"Usage: $0 {estimate|start|stop|status|restart|reload|rebuild|reset-crashkernel|propagate|showmem|test}"
exit 1
;;
esac
Expand Down
10 changes: 10 additions & 0 deletions kdumpctl.8
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,16 @@ Note: The memory requirements for kdump varies heavily depending on the
used hardware and system configuration. Thus the recommended
crashkernel might not work for your specific setup. Please test if
kdump works after resetting the crashkernel value.
.TP
.I test [--force]
Test the kdump by actually trigger the system crash & dump, and check if a
vmcore can really be generated successfully based on current config and
environment. After system reboot back to normal, check the test result
by "kdumpctl status".

If the optional parameter [--force] is provided, there will be no interact
before triggering the system crash. Dangerous though, this option is meant
for automation testing.

.SH "SEE ALSO"
.BR kdump.conf (5),
Expand Down

0 comments on commit 4958972

Please sign in to comment.