Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falco 0.36.1 and 0.36.0 segfault sometimes with syscall (evt null) #2878

Closed
max-frank opened this issue Oct 18, 2023 · 12 comments
Closed

Falco 0.36.1 and 0.36.0 segfault sometimes with syscall (evt null) #2878

max-frank opened this issue Oct 18, 2023 · 12 comments
Assignees
Labels
Milestone

Comments

@max-frank
Copy link

max-frank commented Oct 18, 2023

Describe the bug

After the upgrade to 0.36.0 (and consecutively 0.36.1) we noticed that our falco daemonsets were restarting multiple times overnight. Note that this is in a low traffic cluster only used for testing the falco deployment so should be unrelated to K8S metadata performance issues.

Pods seem to exit with CODE: 139 seemingly randomly. To investigate we built falco with debug release in the orignal pod environment and let it run until it crashed. See the stack trace and logs from one such below.

As you can see in the stack trace the cause of the segfault seems to be that the evt pointer seems to be null during the get_source_idx call in the main inspect loop. Looking at the code the only reason this could happen is if the sinsp::next at the beginning of the loop set the pointer to null. If this is not expected behaviour then this might be a bug in the libs rather then a bug in falco caused by a missing null check.

How to reproduce it

Start falco with syscall source and wait. The pod will eventually crash.

Expected behaviour

No crashes happen.

Environment

GKE 1.26.6-gke.1700 (COS)

  • Falco version:
{"default_driver_version":"6.0.1+driver","driver_api_version":"5.0.0","driver_schema_version":"2.0.0","engine_version":"26","falco_version":"0.36.1","libs_version":"0.13.2","plugin_api_version":"3.1.0"}

Falco version (debug build):

{"default_driver_version":"6.0.1+driver","driver_api_version":"5.0.0","driver_schema_version":"2.0.0","engine_version":"26","falco_version":"0.36.1","libs_version":"0.13.2","plugin_api_version":"3.1.0"}
  • System info:
{
  "machine": "x86_64",
  "nodename": "some node",
  "release": "5.15.107+",
  "sysname": "Linux",
  "version": "#1 SMP Sat May 20 09:38:40 UTC 2023"
}
  • Cloud provider or hardware configuration:
  • OS:
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel:
Linux somenode 5.15.107+ #1 SMP Sat May 20 09:38:40 UTC 2023 x86_64 GNU/Linux
  • Installation method:

  • Custom K8S manifest using standard container image

  • For debugging built falco from source as DEBUG release (with ASSERTIONS stripped)

Additional context

Debug run output

Wed Oct 18 01:15:43 2023: Falco version: 0.36.1 (x86_64)
Wed Oct 18 01:15:43 2023: Falco initialized with configuration file: /tmp/make/falco/falco.yaml
Wed Oct 18 01:15:43 2023: Loading rules from file /etc/falco/falco_rules.yaml
Wed Oct 18 01:15:43 2023: Loading rules from file /etc/falco/falco_rules.local.yaml
Wed Oct 18 01:15:43 2023: Loading rules from file /etc/falco/rules.d/custom.yaml
Wed Oct 18 01:15:43 2023: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Wed Oct 18 01:15:43 2023: Starting health webserver with threadiness 2, listening on port 8765
Wed Oct 18 01:15:43 2023: Loaded event sources: syscall
Wed Oct 18 01:15:43 2023: Enabled event sources: syscall
Wed Oct 18 01:15:43 2023: Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-bpf.o

Stack trace
Is too long so pastebin here

@FedeDP
Copy link
Contributor

FedeDP commented Oct 18, 2023

Hi! Thanks for the detailed research!
Indeed, there is a single point in sinsp::next where we can return SCAP_SUCCESS without setting *puevt pointer, here: https://github.com/falcosecurity/libs/blame/master/userspace/libsinsp/sinsp.cpp#L1439
As you can see we return res (that is SCAP_SUCCESS over there since this piece of code is after the if res == ... statements) but puevt was not touched at all.
Are you willing to test if changing that if statement to:

if(ptinfo)
{
	for(uint32_t j = 0; j < nfdr; j++)
	{
		ptinfo->remove_fd(m_fds_to_remove->at(j));
	}
	m_fds_to_remove->clear();
}

fixes the issue?

@max-frank
Copy link
Author

@FedeDP I've replaced https://github.com/falcosecurity/libs/blob/5b4e28d44f9c87aa23f24c53c6bfb696b2b47635/userspace/libsinsp/sinsp.cpp#L1436-L1447 with the suggested lines and compiled the debug build.

I've got it running now now I can just wait and see if it will crash or not.

@FedeDP
Copy link
Contributor

FedeDP commented Oct 18, 2023

@max-frank you are the best tester out there! Thank you very much! Fingers 🤞

@FedeDP
Copy link
Contributor

FedeDP commented Oct 19, 2023

@max-frank any new crash after applying the patch? :)

@max-frank
Copy link
Author

@FedeDP Ok so its been officially around 24h and can report not crash yet
image

@FedeDP
Copy link
Contributor

FedeDP commented Oct 24, 2023

/milestone 0.36.2

We are arranging a 0.36.2 patch release for Falco to fix this :)
See #2885

@poiana poiana modified the milestones: 0.37.0, 0.36.2 Oct 24, 2023
@max-frank
Copy link
Author

Very nice much appreciated

@FedeDP
Copy link
Contributor

FedeDP commented Oct 25, 2023

Hi! I just released Falco 0.36.2-rc1 that should solve this issue; care to try it?
You can use docker images or packages just like you would do for a normal Falco release!

@max-frank
Copy link
Author

max-frank commented Oct 26, 2023

Just deployed it in our local deployment test cluster. No crashes for syscall falco yet with the release candidate.

We are still experiencing issues with gvisor crashing on startup on our GKE 1.26 configuration, but that is a pre-existing issue I did not have time to debug yet. So I don't know if its a configuration issue on our end or an issue in falco. (I will open an issue though for it if I have time to figure it out.)

@FedeDP
Copy link
Contributor

FedeDP commented Oct 26, 2023

Thanks for your feedback, it is really valuable! And thanks for helping us spot this issue!
About gvisor, yes please let's track it in a new issue eventually!
Thank you very much!

@FedeDP
Copy link
Contributor

FedeDP commented Oct 27, 2023

This is fixed in Falco 0.36.2, just released!
Thank you very much!
/close

@poiana poiana closed this as completed Oct 27, 2023
@poiana
Copy link
Contributor

poiana commented Oct 27, 2023

@FedeDP: Closing this issue.

In response to this:

This is fixed in Falco 0.36.2, just released!
Thank you very much!
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants