Falco 0.36.1 and 0.36.0 segfault sometimes with syscall (evt null) #2878

max-frank · 2023-10-18T03:05:37Z

Describe the bug

After the upgrade to 0.36.0 (and consecutively 0.36.1) we noticed that our falco daemonsets were restarting multiple times overnight. Note that this is in a low traffic cluster only used for testing the falco deployment so should be unrelated to K8S metadata performance issues.

Pods seem to exit with CODE: 139 seemingly randomly. To investigate we built falco with debug release in the orignal pod environment and let it run until it crashed. See the stack trace and logs from one such below.

As you can see in the stack trace the cause of the segfault seems to be that the evt pointer seems to be null during the get_source_idx call in the main inspect loop. Looking at the code the only reason this could happen is if the sinsp::next at the beginning of the loop set the pointer to null. If this is not expected behaviour then this might be a bug in the libs rather then a bug in falco caused by a missing null check.

How to reproduce it

Start falco with syscall source and wait. The pod will eventually crash.

Expected behaviour

No crashes happen.

Environment

GKE 1.26.6-gke.1700 (COS)

Falco version:

{"default_driver_version":"6.0.1+driver","driver_api_version":"5.0.0","driver_schema_version":"2.0.0","engine_version":"26","falco_version":"0.36.1","libs_version":"0.13.2","plugin_api_version":"3.1.0"}

Falco version (debug build):

{"default_driver_version":"6.0.1+driver","driver_api_version":"5.0.0","driver_schema_version":"2.0.0","engine_version":"26","falco_version":"0.36.1","libs_version":"0.13.2","plugin_api_version":"3.1.0"}

System info:

{
  "machine": "x86_64",
  "nodename": "some node",
  "release": "5.15.107+",
  "sysname": "Linux",
  "version": "#1 SMP Sat May 20 09:38:40 UTC 2023"
}

Cloud provider or hardware configuration:
OS:

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel:

Linux somenode 5.15.107+ #1 SMP Sat May 20 09:38:40 UTC 2023 x86_64 GNU/Linux

Installation method:
Custom K8S manifest using standard container image
For debugging built falco from source as DEBUG release (with ASSERTIONS stripped)

Additional context

Debug run output

Wed Oct 18 01:15:43 2023: Falco version: 0.36.1 (x86_64)
Wed Oct 18 01:15:43 2023: Falco initialized with configuration file: /tmp/make/falco/falco.yaml
Wed Oct 18 01:15:43 2023: Loading rules from file /etc/falco/falco_rules.yaml
Wed Oct 18 01:15:43 2023: Loading rules from file /etc/falco/falco_rules.local.yaml
Wed Oct 18 01:15:43 2023: Loading rules from file /etc/falco/rules.d/custom.yaml
Wed Oct 18 01:15:43 2023: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Wed Oct 18 01:15:43 2023: Starting health webserver with threadiness 2, listening on port 8765
Wed Oct 18 01:15:43 2023: Loaded event sources: syscall
Wed Oct 18 01:15:43 2023: Enabled event sources: syscall
Wed Oct 18 01:15:43 2023: Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-bpf.o

Stack trace
Is too long so pastebin here

The text was updated successfully, but these errors were encountered:

FedeDP · 2023-10-18T07:45:33Z

Hi! Thanks for the detailed research!
Indeed, there is a single point in sinsp::next where we can return SCAP_SUCCESS without setting *puevt pointer, here: https://github.com/falcosecurity/libs/blame/master/userspace/libsinsp/sinsp.cpp#L1439
As you can see we return res (that is SCAP_SUCCESS over there since this piece of code is after the if res == ... statements) but puevt was not touched at all.
Are you willing to test if changing that if statement to:

if(ptinfo)
{
	for(uint32_t j = 0; j < nfdr; j++)
	{
		ptinfo->remove_fd(m_fds_to_remove->at(j));
	}
	m_fds_to_remove->clear();
}

fixes the issue?

max-frank · 2023-10-18T07:55:42Z

@FedeDP I've replaced https://github.com/falcosecurity/libs/blob/5b4e28d44f9c87aa23f24c53c6bfb696b2b47635/userspace/libsinsp/sinsp.cpp#L1436-L1447 with the suggested lines and compiled the debug build.

I've got it running now now I can just wait and see if it will crash or not.

FedeDP · 2023-10-18T07:56:58Z

@max-frank you are the best tester out there! Thank you very much! Fingers 🤞

FedeDP · 2023-10-19T07:26:21Z

@max-frank any new crash after applying the patch? :)

max-frank · 2023-10-19T08:57:13Z

@FedeDP Ok so its been officially around 24h and can report not crash yet

FedeDP · 2023-10-24T10:23:23Z

/milestone 0.36.2

We are arranging a 0.36.2 patch release for Falco to fix this :)
See #2885

max-frank · 2023-10-24T10:46:24Z

Very nice much appreciated

FedeDP · 2023-10-25T12:43:44Z

Hi! I just released Falco 0.36.2-rc1 that should solve this issue; care to try it?
You can use docker images or packages just like you would do for a normal Falco release!

max-frank · 2023-10-26T01:20:53Z

Just deployed it in our local deployment test cluster. No crashes for syscall falco yet with the release candidate.

We are still experiencing issues with gvisor crashing on startup on our GKE 1.26 configuration, but that is a pre-existing issue I did not have time to debug yet. So I don't know if its a configuration issue on our end or an issue in falco. (I will open an issue though for it if I have time to figure it out.)

FedeDP · 2023-10-26T06:58:29Z

Thanks for your feedback, it is really valuable! And thanks for helping us spot this issue!
About gvisor, yes please let's track it in a new issue eventually!
Thank you very much!

FedeDP · 2023-10-27T12:16:22Z

This is fixed in Falco 0.36.2, just released!
Thank you very much!
/close

poiana · 2023-10-27T12:16:25Z

@FedeDP: Closing this issue.

In response to this:

This is fixed in Falco 0.36.2, just released!
Thank you very much!
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

max-frank added the kind/bug label Oct 18, 2023

Andreagit97 assigned FedeDP Oct 18, 2023

Andreagit97 added this to the 0.37.0 milestone Oct 18, 2023

FedeDP mentioned this issue Oct 18, 2023

fix(userspace/libsinsp): avoid a possible source of segfault in sinsp::next falcosecurity/libs#1423

Merged

poiana modified the milestones: 0.37.0, 0.36.2 Oct 24, 2023

poiana closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falco 0.36.1 and 0.36.0 segfault sometimes with syscall (evt null) #2878

Falco 0.36.1 and 0.36.0 segfault sometimes with syscall (evt null) #2878

max-frank commented Oct 18, 2023 •

edited

Loading

FedeDP commented Oct 18, 2023 •

edited

Loading

max-frank commented Oct 18, 2023

FedeDP commented Oct 18, 2023

FedeDP commented Oct 19, 2023

max-frank commented Oct 19, 2023

FedeDP commented Oct 24, 2023

max-frank commented Oct 24, 2023

FedeDP commented Oct 25, 2023

max-frank commented Oct 26, 2023 •

edited

Loading

FedeDP commented Oct 26, 2023

FedeDP commented Oct 27, 2023

poiana commented Oct 27, 2023

Falco 0.36.1 and 0.36.0 segfault sometimes with syscall (evt null) #2878

Falco 0.36.1 and 0.36.0 segfault sometimes with syscall (evt null) #2878

Comments

max-frank commented Oct 18, 2023 • edited Loading

FedeDP commented Oct 18, 2023 • edited Loading

max-frank commented Oct 18, 2023

FedeDP commented Oct 18, 2023

FedeDP commented Oct 19, 2023

max-frank commented Oct 19, 2023

FedeDP commented Oct 24, 2023

max-frank commented Oct 24, 2023

FedeDP commented Oct 25, 2023

max-frank commented Oct 26, 2023 • edited Loading

FedeDP commented Oct 26, 2023

FedeDP commented Oct 27, 2023

poiana commented Oct 27, 2023

max-frank commented Oct 18, 2023 •

edited

Loading

FedeDP commented Oct 18, 2023 •

edited

Loading

max-frank commented Oct 26, 2023 •

edited

Loading