Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falco crash after few minutes on GKE 1.27 on COS #3278

Closed
judikag03 opened this issue Jul 10, 2024 · 26 comments
Closed

Falco crash after few minutes on GKE 1.27 on COS #3278

judikag03 opened this issue Jul 10, 2024 · 26 comments
Labels
Milestone

Comments

@judikag03
Copy link

judikag03 commented Jul 10, 2024

Describe the bug
Hi

We are instaling Falco on one of cluster in 1.27.11-gke.1062004 with container optimized OS for GKE kernel 5.15.146+ and we face regular CrashLoopBackOff of each falco-no-driver:0.38.1 to install driver ebpf.

--

How to reproduce it

It is deployed using the Helm chart latest (falco 0.38.1) as deamonset, on a GKE cluster running Kubernetes 1.27..
The falco config is

        initContainers:
        - name: falco-driver-loader
          image: docker.io/falcosecurity/falco-driver-loader:0.38.1
          imagePullPolicy: IfNotPresent
          args:
            - ebpf
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /root/.falco
              name: root-falco-fs
            - mountPath: /host/proc
              name: proc-fs
              readOnly: true
            - mountPath: /host/boot
              name: boot-fs
              readOnly: true
            - mountPath: /host/lib/modules
              name: lib-modules
            - mountPath: /host/usr
              name: usr-fs
              readOnly: true
            - mountPath: /host/etc
              name: etc-fs
              readOnly: true
          env:
            - name: HOST_ROOT
              value: /host
            - name: FALCOCTL_DRIVER_CONFIG_UPDATE_FALCO
              value: "false"

After few minutes, the container crash (exitCode: 1), here is a container log:
2024-07-10 02:23:08 ERROR no supported driver found for distro: cos, kernelrelease , kernelversion #1 SMP Sat Feb 17 13:12:02 UTC 2024, arch x86_64
2024-07-10 02:23:08 INFO Running falcoctl driver install
├ driver version: 7.2.0+driver
├ driver type: modern_ebpf
├ driver name: falco
├ compile: true
├ download: true
├ target: cos
├ arch: x86_64
├ kernel release:
└ kernel version: #1 SMP Sat Feb 17 13:12:02 UTC 2024`

Expected behaviour

No crash :)

Screenshots
Screenshot 2024-07-10 at 09 43 43

Environment

Helm chart latest (falco 0.38.1 ) as deamonset, on a GKE cluster running Kubernetes 1.27.

Wed Jul 10 09:33:17 2024: Falco version: 0.38.1 (x86_64)
Wed Jul 10 09:33:17 2024: Falco initialized with configuration files:
Wed Jul 10 09:33:17 2024: /etc/falco/falco.yaml
Wed Jul 10 09:33:17 2024: System info: Linux version 5.15.0-113-generic (buildd@lcy02-amd64-017) (gcc (Ubuntu 9.4.0-1ubuntu120.04.2) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #12320.04.1-Ubuntu SMP Wed Jun 12 17:33:13 UTC 2024
Falco version: 0.38.1
Libs version: 0.17.2
Plugin API: 3.6.0
Engine: 0.40.0
Driver:
API version: 8.0.0
Schema version: 2.0.0
Default driver: 7.2.0+driver

Cloud provider or hardware configuration: GKE cluster running Kubernetes 1.27.
OS: cos_containerd
Kernel: Linux 5.15.146+

@FedeDP
Copy link
Contributor

FedeDP commented Jul 15, 2024

Hi! Thanks for opening this issue! Can you share the logs from the falco container, if any?

@judika03
Copy link

judika03 commented Jul 16, 2024

Tue Jul 16 06:40:20 2024: Using deprecated config key 'rules_file' (singular form). Please use new 'rules_files' config key (plural form).
Tue Jul 16 06:40:20 2024: Falco version: 0.38.1 (x86_64)
Tue Jul 16 06:40:20 2024: Falco initialized with configuration files:
Tue Jul 16 06:40:20 2024:    /etc/falco/falco.yaml
Tue Jul 16 06:40:20 2024: System info: Linux version 5.15.154+ (builder@98a8fd0ef88f) (Chromium OS 15.0_pre458507_p20220602-r18 clang version 15.0.0 (/var/tmp/portage/sys-devel/llvm-15.0_pre458507_p20220602-r18/work/llvm-15.0_pre458507_p20220602/clang a58d0af058038595c93de961b725f86997cf8d4a), LLD 15.0.0) #1 SMP Thu Jun 27 20:43:36 UTC 2024
Tue Jul 16 06:40:20 2024: Loading rules from file /etc/falco/falco_rules.yaml
Tue Jul 16 06:40:21 2024: Loading rules from file /etc/falco/falco_rules.local.yaml
Tue Jul 16 06:40:21 2024: Hostname value has been overridden via environment variable to: gke-falco-cluster-default-pool-ca977e8c-qrcf
Tue Jul 16 06:40:21 2024: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Tue Jul 16 06:40:21 2024: Starting health webserver with threadiness 2, listening on 0.0.0.0:8765
Tue Jul 16 06:40:21 2024: Loaded event sources: syscall
Tue Jul 16 06:40:21 2024: Enabled event sources: syscall
Tue Jul 16 06:40:21 2024: Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-probe-bpf.o
Tue Jul 16 06:40:21 2024: An error occurred in an event source, forcing termination...
Error: can't open BPF probe '/root/.falco/falco-probe-bpf.o'
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:

this error because pods falco-driver-loader is't running to building driver. for log falco-driver-loader, this

2024-07-16 06:39:54 ERROR no supported driver found for distro: cos, kernelrelease , kernelversion #1 SMP Thu Jun 27 20:43:36 UTC 2024, arch x86_64
2024-07-16 06:39:55 INFO  Running falcoctl driver install
                      ├ driver version: 7.2.0+driver
                      ├ driver type: modern_ebpf
                      ├ driver name: falco
                      ├ compile: true
                      ├ download: true
                      ├ target: cos
                      ├ arch: x86_64
                      ├ kernel release:
                      └ kernel version: #1 SMP Thu Jun 27 20:43:36 UTC 2024
2024-07-16 06:39:55 INFO  No artifacts needed for the selected driver.

@FedeDP please help me for finding solution

@FedeDP
Copy link
Contributor

FedeDP commented Jul 16, 2024

Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-probe-bpf.o

The weird thing is that falco container is using bpf driver, while falco-driver-loader is (correctly) using modern_ebpf driver.
You only need to deploy falco specifying either the default (automatic driver selection) or

initContainers:
        - name: falco-driver-loader
          image: docker.io/falcosecurity/falco-driver-loader:0.38.1
          imagePullPolicy: IfNotPresent
          args:
            - modern_ebpf

See https://github.com/falcosecurity/charts/tree/master/charts/falco#deploying-falco-in-kubernetes (the modern_ebpf part):

helm install falco falcosecurity/falco \
    --create-namespace \
    --namespace falco \
    --set driver.kind=modern_ebpf

@judika03
Copy link

judika03 commented Jul 16, 2024

I already to deploy at initcontainer falco-driver-loader choose ebpf, but i dont know this container showing modern_ebpf. This setup manifest on initContainers.

      initContainers:
        - name: falco-driver-loader
          image: docker.io/falcosecurity/falco-driver-loader:0.38.1
          imagePullPolicy: IfNotPresent
          args:
            - ebpf
          securityContext:
            privileged: true

i already use with helm like

helm install falco falcosecurity/falco \
    --create-namespace \
    --namespace falco \
    --set driver.kind=ebpf

2024-07-16 07:59:35 ERROR no supported driver found for distro: cos, kernelrelease , kernelversion #1 SMP Thu Jun 27 20:43:36 UTC 2024, arch x86_64

it's same problem.

@Anthares101
Copy link

Anthares101 commented Jul 17, 2024

Having a similar issue with a Raspberry Pi cluster after upgrading both the Kernel and Falco. After downgrading the Helm chart to v3.8.4 (Falco 0.36.2) the pods are running and everything looks right again.

@judika03
Copy link

judika03 commented Jul 17, 2024

I'm trying to downgrading version falco(v0.36.2) the pods falco-driver-loader:0.36.2 are running to building driver but falco container getting error(not running). the pods are running on gke 1.27.13-gke.1070002

Wed Jul 17 04:20:48 2024: Opening 'syscall' source with BPF probe. BPF probe path: /root/.falco/falco-bpf.o
-- BEGIN PROG LOAD LOG --
processed 43798 insns (limit 1000000) max_states_per_insn 1 total_states 4061 peak_states 4061 mark_read 1921

-- END PROG LOAD LOG --
Wed Jul 17 04:20:51 2024: An error occurred in an event source, forcing termination...
Error: libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted

@Anthares101

@Anthares101
Copy link

Looks like you may have a permissions issue? Some kind of admission control in place/restrictive seccomp config to limit what syscalls the pods are allowed to use perhaps?

At least looks like the driver not compiling is fixed, I dont know what changed in the updated Falco chart for this to happen

@alacuku
Copy link
Member

alacuku commented Jul 17, 2024

@judikag03 @judika03 could you share your helm values file? or at least the variables that you customized, if any.

@judika03
Copy link

@alacuku this container manifest falco

 - name: falco
          image: docker.io/falcosecurity/falco-no-driver:0.37.0
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              cpu: 1000m
              memory: 1024Mi
            requests:
              cpu: 100m
              memory: 512Mi
          securityContext:
            privileged: true
          args:
            - /usr/bin/falco
            - --cri
            - /var/run/docker.sock
            - --cri
            - /run/containerd/containerd.sock
            - --cri
            - /run/crio/crio.sock
            - -pk
          env:
            - name: HOST_ROOT
              value: /host
            - name: FALCO_HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: FALCO_K8S_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: FALCO_BPF_PROBE
              value: "/root/.falco/falco-probe-bpf.o"
            - name: SYSDIG_BPF_PROBE
              value: "/root/.falco/falco-bpf.o"
          tty: false
          livenessProbe:
            initialDelaySeconds: 60
            timeoutSeconds: 5
            periodSeconds: 15
            httpGet:
              path: /healthz
              port: 8765
          readinessProbe:
            initialDelaySeconds: 30
            timeoutSeconds: 5
            periodSeconds: 15
            httpGet:
              path: /healthz
              port: 8765

@alacuku
Copy link
Member

alacuku commented Jul 17, 2024

No, I mean the values.yaml file or the helm command you are using to install falco.

@judika03
Copy link

judika03 commented Jul 17, 2024

i use command to install falco,

helm install falco falcosecurity/falco \
    --create-namespace \
    --namespace falco \
    --set driver.kind=ebpf

To verify again, I checked the manifest like

-        name: falco
          image: docker.io/falcosecurity/falco-no-driver:0.37.0
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              cpu: 1000m
              memory: 1024Mi
            requests:
              cpu: 100m
              memory: 512Mi
          securityContext:
            privileged: true
          args:
            - /usr/bin/falco
            - --cri
            - /var/run/docker.sock
            - --cri
            - /run/containerd/containerd.sock
            - --cri
            - /run/crio/crio.sock
            - -pk
          env:
            - name: HOST_ROOT
              value: /host
            - name: FALCO_HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: FALCO_K8S_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: FALCO_BPF_PROBE
              value: "/root/.falco/falco-probe-bpf.o"
          tty: false
          livenessProbe:
            initialDelaySeconds: 60
            timeoutSeconds: 5
            periodSeconds: 15
            httpGet:
              path: /healthz
              port: 8765
          readinessProbe:
            initialDelaySeconds: 30
            timeoutSeconds: 5
            periodSeconds: 15
            httpGet:
              path: /healthz
              port: 8765

this manifest have tested on GKE cluster running Kubernetes 1.27.13-gke.1070002 with Container optimized OS

@Anthares101
Copy link

I have been testing a bit in my cluster, I'm able to install the Falco chart up to the 4.3.1 version. Using any chart version higher than that results in the Falco pods crashing with the initial error reported in this issue while building the driver:

* Setting up /usr/src links from host
2024-07-17 18:19:43 ERROR no supported driver found for distro: debian, kernelrelease , kernelversion #1642 SMP PREEMPT Mon Apr  3 17:24:16 BST 2023, arch aarch64
2024-07-17 18:19:43 ERROR no supported driver found for distro: debian, kernelrelease , kernelversion #1642 SMP PREEMPT Mon Apr  3 17:24:16 BST 2023, arch aarch64

Are you able to replicate this @judika03?

@judika03
Copy link

judika03 commented Jul 18, 2024

yes. it similar issue if using version 0.38.1(latest)

2024-07-16 06:39:54 ERROR no supported driver found for distro: cos, kernelrelease , kernelversion #1 SMP Thu Jun 27 20:43:36 UTC 2024, arch x86_64

i testing if using falco-driver-loader:0.36.2 to building driver is succes, not have issue like that. but falco-no-driver:0.36.1 container getting error(not running). @Anthares101

@FedeDP
Copy link
Contributor

FedeDP commented Sep 9, 2024

The problem is that, as it can be seen from your outputs, kernelrelease is being discovered empty.
Fact is, falcoctl uses standard syscall uname to fetch the kernelrelease (https://github.com/falcosecurity/falcoctl/blob/main/pkg/driver/kernel/kernel_linux.go#L30) , unless someone is enforcing an empty kernelrelease (https://github.com/falcosecurity/falcoctl/blob/main/pkg/driver/kernel/kernel_linux.go#L44).

Can you share the kernelrelease from the nodes? uname -r will do the trick!
Thank you, perhaps we got a bug somewhere in that helper function!

@FedeDP
Copy link
Contributor

FedeDP commented Sep 9, 2024

Uh i think i found out the bug; is your kernelrelease similar to 6.1.85+? In this case, our aforementioned helper function is not able to properly decode it, thus kernelrelease will be mapped as empty. Will fix it asap :)

@judika03
Copy link

judika03 commented Sep 9, 2024

I hope this bug can be fixed, thank you

@FedeDP
Copy link
Contributor

FedeDP commented Sep 9, 2024

falcosecurity/driverkit#355 fixes our kernelrelease matching regex to support COS kernels ;) I also added a test to avoid future failures.
Once that is merged i will port it to falcoctl and next Falco version will be released with a fixed falcoctl (it is expected by the end of the month!)

@FedeDP
Copy link
Contributor

FedeDP commented Sep 9, 2024

/milestone 0.39.0

@FedeDP
Copy link
Contributor

FedeDP commented Sep 9, 2024

Falcoctl PR with the driverkit update: falcosecurity/falcoctl#632

@FedeDP
Copy link
Contributor

FedeDP commented Sep 10, 2024

Me and @alacuku just tested on cos version 1.30.3-gke.1639000 with kernel 6.1.90+ and the new falcoctl worked fine, we were able to build the ebpf probe.
Therefore this issue will be fixed by Falco 0.39.0!

@judika03
Copy link

I have tested it too, and have successfully built the driver. Thank you @FedeDP

@FedeDP
Copy link
Contributor

FedeDP commented Sep 11, 2024

You are welcome, thanks for spotting the bug in the first place :D

@FedeDP
Copy link
Contributor

FedeDP commented Sep 17, 2024

So, Falco 0.39.0-rc1 is out; this is the first Release Candidate for the new release that is expected to be released in a couple of weeks. Are you willing to test it? @judika03
You can just use the 0.39.0-rc1 image tag ;)

@judika03
Copy link

i already tested. it's running as well. thanks
Screenshot 2024-09-22 at 21 12 24

@FedeDP
Copy link
Contributor

FedeDP commented Sep 22, 2024

Super Happy to hear it! Thanks for spotting the issue and patiently helping us debug it :)
I think we can safely close this one now!
/close

@poiana poiana closed this as completed Sep 22, 2024
@poiana
Copy link
Contributor

poiana commented Sep 22, 2024

@FedeDP: Closing this issue.

In response to this:

Super Happy to hear it! Thanks for spotting the issue and patiently helping us debug it :)
I think we can safely close this one now!
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants