Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nohang with post_kill_exe #132

Open
bearbear781 opened this issue Aug 10, 2023 · 13 comments
Open

Nohang with post_kill_exe #132

bearbear781 opened this issue Aug 10, 2023 · 13 comments

Comments

@bearbear781
Copy link

I have configured post_kill_exe in nohang.conf to execute certain actions after killing a process. However, I've noticed that post_kill_exe is not triggered consistently for every kill event.
It seems that it is triggered when the message "The victim became a zombie in 0.01s" is displayed, but not when "The victim doesn't respond on corrective action in 0.031s" is displayed.

@bearbear781
Copy link
Author

below is my nohang.conf

nohang -c /etc/nohang/nohang.conf --check
Starting nohang with config /etc/nohang/nohang.conf

  1. Check kernel messages for OOM events
    @check_kmsg:
    @debug_kmsg:

  2. Common zram settings
    zram_checking_enabled: False

  3. Common PSI settings
    psi_checking_enabled: False
    psi_path: /proc/pressure/memory
    psi_metrics: full_avg10
    psi_excess_duration: 30.0 sec
    psi_post_action_delay: 15.0 sec

  4. Poll rate
    fill_rate_mem: 6000.0
    fill_rate_swap: 2000.0
    fill_rate_zram: 4000.0
    max_sleep: 3.0 sec
    min_sleep: 0.1 sec

  5. Warnings and notifications
    post_action_gui_notifications: False
    hide_corrective_action_type: False
    low_memory_warnings_enabled: True
    warning_exe:
    warning_threshold_min_mem: 1536 MiB, 20.0 %
    warning_threshold_min_swap: 25 %
    warning_threshold_max_zram: 3456 MiB, 45.0 %
    warning_threshold_max_psi: 10.0
    min_post_warning_delay: 60.0 sec
    env_cache_time: 300.0

  6. Soft threshold
    soft_threshold_min_mem: 384 MiB, 5.0 %
    soft_threshold_min_swap: 10 %
    soft_threshold_max_zram: 4224 MiB, 55.0 %
    soft_threshold_max_psi: 40.0

  7. Hard threshold
    hard_threshold_min_mem: 154 MiB, 2.0 %
    hard_threshold_min_swap: 4 %
    hard_threshold_max_zram: 4608 MiB, 60.0 %
    hard_threshold_max_psi: 90.0

  8. Customize victim selection: adjusting badness of processes

7.1. Ignore positive oom_score_adj
ignore_positive_oom_score_adj: False

7.2. Adjusting badness of processes by matching with regular expressions
7.2.1. Matching process names with RE patterns
(not set)
7.2.2. Matching CGroup_v1-line with RE patterns
(not set)
7.2.3. Matching CGroup_v2-line with RE patterns
(not set)
7.2.4. Matching eUIDs with RE patterns
(not set)
7.2.5. Matching realpath with RE patterns
(not set)
7.2.6. Matching /proc/[pid]/cwd realpath with RE patterns
(not set)
7.2.7. Matching cmdlines with RE patterns
(not set)
7.2.8. Matching environ with RE patterns
(not set)

  1. Customize soft corrective actions
    (not set)

  2. Misc
    max_soft_exit_time: 10.0 sec
    post_kill_exe: (my_action)
    min_badness: 1
    post_soft_action_delay: 3.0 sec
    post_zombie_delay: 0.1 sec
    victim_cache_time: 10.0 sec
    exe_timeout: 20.0 sec

  3. Verbosity
    print_config_at_startup: False
    print_mem_check_results: True
    min_mem_report_interval: 60.0 sec
    print_proc_table: False
    extra_table_info: None
    print_victim_status: True
    print_victim_cmdline: False
    max_victim_ancestry_depth: 3
    print_statistics: True
    debug_gui_notifications: False
    debug_psi: False
    debug_sleep: False
    debug_threading: False
    separate_log: True

@hakavlad
Copy link
Owner

hakavlad commented Aug 10, 2023

The post_kill_exe command starts only after the hard corrective action.

Sorry it's not mentioned in the documentation.

You can configure part 8: Customize soft corrective actions
to CHANGE soft corrective action: a command will be executed instead of sending the SIGTERM signal.

For example:

@SOFT_ACTION_RE_NAME  /// your_command...

@bearbear781
Copy link
Author

I don't need to change the soft correction action, my post_kill_exe is to execute curl to send a notification to telegram after hard correction

@bearbear781
Copy link
Author

Here is the log

2023-08-10 16:53:26,460: >>=== STARTING implement_corrective_action() ====>>
2023-08-10 16:53:26,460: Memory status that requires corrective actions:
MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %]
SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %]
2023-08-10 16:53:26,462: Found 9 tasks with non-zero oom_score (except init and self) in 2ms
2023-08-10 16:53:26,463: TOP-9 tasks by badness:
2023-08-10 16:53:26,463: Name PID badness
2023-08-10 16:53:26,463: --------------- ------- -------
2023-08-10 16:53:26,463: mysqld 903309 849
2023-08-10 16:53:26,463: monit 7469 17
2023-08-10 16:53:26,463: vmagent 329736 5
2023-08-10 16:53:26,463: metricbeat 7217 4
2023-08-10 16:53:26,463: pmm-agent 329710 3
2023-08-10 16:53:26,463: ssm-agent-worke 3352 1
2023-08-10 16:53:26,463: wazuh-modulesd 51276 1
2023-08-10 16:53:26,463: mysqld_exporter 329735 1
2023-08-10 16:53:26,463: node_exporter 329737 1
2023-08-10 16:53:26,463: TOP printed in 1ms; process with highest badness:
PID: 903309, name: mysqld, badness: 849
2023-08-10 16:53:26,463: Recheck memory levels...
2023-08-10 16:53:26,464: Memory status that requires corrective actions:
MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %]
SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %]
2023-08-10 16:53:26,464: Victim status (found in 0ms):
PID: 903309, name: mysqld, state: S (sleeping), EUID: 27, SID: 902753 (), lifetime: 4d 3h 32min 2s
badness: 849, oom_score: 849, oom_score_adj: 0
Vm, MiB: Size: 10762, RSS: 6507 (Anon: 6507, File: 0, Shmem: 0), Swap: 1736
cgroup_v1: /system.slice/mysqld.service
cgroup_v2:
ancestry: PID 902781 (mysqld_safe) <= PID 1 (systemd)
exe realpath: /usr/sbin/mysqld
cwd realpath: /var/lib/mysql
2023-08-10 16:53:26,464: Memory info, MiB:
2023-08-10 16:53:26,464: total=7680, used=7029, free=118, available=383, shared=0, buffers=0, cache=534,
2023-08-10 16:53:26,464: swap_total=2048, swap_used=1944, swap_free=104
2023-08-10 16:53:26,464: Implementing a corrective action:
Sending SIGTERM to the victim
2023-08-10 16:53:26,464: OK; total response time: 4ms
2023-08-10 16:53:26,464: What happened in the last 14d 10h 11min 19s:
[ OK ] Sending SIGTERM to mysqld: 7
[ OK ] Sending SIGKILL to mysqld: 1
2023-08-10 16:53:26,495: The victim doesn't respond on corrective action in 0.031s
2023-08-10 16:53:26,495: Memory status after implementing a corrective action:
MemAvailable: 382.8 MiB, SwapFree: 104.5 MiB
2023-08-10 16:53:26,595: <<=== FINISHING implement_corrective_action() ===<<

@bearbear781
Copy link
Author

Here is the log

2023-08-10 16:53:26,460: >>=== STARTING implement_corrective_action() ====>> 2023-08-10 16:53:26,460: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,462: Found 9 tasks with non-zero oom_score (except init and self) in 2ms 2023-08-10 16:53:26,463: TOP-9 tasks by badness: 2023-08-10 16:53:26,463: Name PID badness 2023-08-10 16:53:26,463: --------------- ------- ------- 2023-08-10 16:53:26,463: mysqld 903309 849 2023-08-10 16:53:26,463: monit 7469 17 2023-08-10 16:53:26,463: vmagent 329736 5 2023-08-10 16:53:26,463: metricbeat 7217 4 2023-08-10 16:53:26,463: pmm-agent 329710 3 2023-08-10 16:53:26,463: ssm-agent-worke 3352 1 2023-08-10 16:53:26,463: wazuh-modulesd 51276 1 2023-08-10 16:53:26,463: mysqld_exporter 329735 1 2023-08-10 16:53:26,463: node_exporter 329737 1 2023-08-10 16:53:26,463: TOP printed in 1ms; process with highest badness: PID: 903309, name: mysqld, badness: 849 2023-08-10 16:53:26,463: Recheck memory levels... 2023-08-10 16:53:26,464: Memory status that requires corrective actions: MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %] SwapFree [104 MiB, 5.1 %] <= soft_threshold_min_swap [205 MiB, 10.0 %] 2023-08-10 16:53:26,464: Victim status (found in 0ms): PID: 903309, name: mysqld, state: S (sleeping), EUID: 27, SID: 902753 (), lifetime: 4d 3h 32min 2s badness: 849, oom_score: 849, oom_score_adj: 0 Vm, MiB: Size: 10762, RSS: 6507 (Anon: 6507, File: 0, Shmem: 0), Swap: 1736 cgroup_v1: /system.slice/mysqld.service cgroup_v2: ancestry: PID 902781 (mysqld_safe) <= PID 1 (systemd) exe realpath: /usr/sbin/mysqld cwd realpath: /var/lib/mysql 2023-08-10 16:53:26,464: Memory info, MiB: 2023-08-10 16:53:26,464: total=7680, used=7029, free=118, available=383, shared=0, buffers=0, cache=534, 2023-08-10 16:53:26,464: swap_total=2048, swap_used=1944, swap_free=104 2023-08-10 16:53:26,464: Implementing a corrective action: Sending SIGTERM to the victim 2023-08-10 16:53:26,464: OK; total response time: 4ms 2023-08-10 16:53:26,464: What happened in the last 14d 10h 11min 19s: [ OK ] Sending SIGTERM to mysqld: 7 [ OK ] Sending SIGKILL to mysqld: 1 2023-08-10 16:53:26,495: The victim doesn't respond on corrective action in 0.031s 2023-08-10 16:53:26,495: Memory status after implementing a corrective action: MemAvailable: 382.8 MiB, SwapFree: 104.5 MiB 2023-08-10 16:53:26,595: <<=== FINISHING implement_corrective_action() ===<<

I compared and found that when post_kill_exe is executed normally, there will be this line in the log, but the log I provided does not have this line, so post_kill_exe is not executed
max_soft_exit_time (value=10.0s) is exceeded for the victim, hard corrective action will be applied

@hakavlad
Copy link
Owner

hakavlad commented Aug 10, 2023

does not have this line, because it is soft corrective action:

MemAvailable [383 MiB, 5.0 %] <= soft_threshold_min_mem [384 MiB, 5.0 %]
Sending SIGTERM to the victim

@hakavlad
Copy link
Owner

hakavlad commented Aug 10, 2023

The victim doesn't respond on corrective action in 0.031s

The process takes some time to complete. The hard threshold was not exceeded (in this log).

@hakavlad
Copy link
Owner

Could you please provide logs with HARD corrective action and post_kill_exe was not executed?

@bearbear781
Copy link
Author

Is there a trigger like post_kill_exe that can be used in soft corrective action?

@hakavlad
Copy link
Owner

trigger like post_kill_exe that can be used in soft corrective action

@SOFT_ACTION_RE_NAME  /// sh -c "kill -TERM $PID && your_command..."

@bearbear781
Copy link
Author

trigger like post_kill_exe that can be used in soft corrective action

@SOFT_ACTION_RE_NAME  /// sh -c "kill -TERM $PID && your_command..."

I will try for it, any problem will update again, thank you

@bearbear781
Copy link
Author

trigger like post_kill_exe that can be used in soft corrective action

@SOFT_ACTION_RE_NAME  /// sh -c "kill -TERM $PID && your_command..."

Can this be placed at the end of the config file? Or does it have to be in part 8?

@hakavlad
Copy link
Owner

Can be placed anywhere in the configuration file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants