Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drivers build not triggered for x86_64 #1158

Open
maxgio92 opened this issue May 17, 2023 · 21 comments
Open

Drivers build not triggered for x86_64 #1158

maxgio92 opened this issue May 17, 2023 · 21 comments
Labels

Comments

@maxgio92
Copy link
Member

Describe the bug

Not all driver building-postsubmit jobs have been triggered after new driverkit configurations have been introduced - as usual weekly crawling result from kernel-crawler.

How to reproduce it

Please check:

Expected behaviour

All the jobs for x86 to have been triggered accordingly.

@FedeDP
Copy link
Contributor

FedeDP commented May 19, 2023

Note: all the jobs have always been instead triggered correctly during normal weekly kernel-crawler CI.
This time, instead, as we added new driver version (therefore copy/pasting an existing driver version to a new folder) and removed the oldest one (2.0.0+driver), not all jobs were triggered.

At the same time, PR CI correctly triggered all presubmit jobs instead.

@FedeDP
Copy link
Contributor

FedeDP commented Jul 4, 2023

I might have found a good candidate; note the difference between the presubmit and postsubmit Prow documentation about run_if_changed:

Pre:

Presubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified in any of the commits in the pull request.

Post:

Postsubmit jobs apply run_if_changed and skip_if_only_changed filters based on which files were modified by the commits included in the specific push event from github.

So, postsubmit are based on the specific push event triggered by github.
What if, given the large size of the diff/high number of files changed, the push event is somewhat truncated, ie: not all changed files are pushed?

Payloads are capped to 25M: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads#push

Note: Payloads are capped at 25 MB. If your event generates a larger payload, a webhook will not be fired. This may happen, for example, on a create event if many branches or tags are pushed at once. We suggest monitoring your payload size to ensure delivery.

But this is not our case imho because Prow receives the event with a full payload (otherwise it won't be able to decode it); but there might be some other hard limits on the number of files changes pushed with the event.

Prow code confirms this behavior: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/push.go#L28: as you can see, it uses PushEvent data to fetch all changed files.
Note how the presubmit instead uses normal git client to fetch list of changed files: https://github.com/kubernetes/test-infra/blob/b7c54c4d7991d27ee926d184d3a09c7b7738c65b/prow/plugins/trigger/pull-request.go#L358C37-L358C37 (https://github.com/kubernetes/test-infra/blob/master/prow/config/jobs.go#L537)

@FedeDP
Copy link
Contributor

FedeDP commented Jul 10, 2023

So, github http api does limit list of files changed to 3000: https://docs.github.com/en/rest/pulls/pulls?apiVersion=2022-11-28#list-pull-requests-files

This is the method used to fetch pull request changes; therefore i'd expect them to be cut at 3000 files too. Instead, we are experiencing the issue only on push to master. Push events have the same 3000 changes hard limit.

I don't get why this is working on pull request but not on push to master :/ I'd expect the issue to be present on pull request too!

@FedeDP
Copy link
Contributor

FedeDP commented Jul 10, 2023

I now get why it is working on presubmits!
Our presubmits jobs are not arch dependent; therefore we are effectively mostly triggering on aarch64 configs, BUT we trigger presubmit jobs for both arch (since they are not arch dependent).
On postsubmit, instead, given that we have aarch64 and x86_64 build jobs, we mostly trigger only aarch64 build and some x86_64 until we reach the 3000 files changed limit.

@FedeDP
Copy link
Contributor

FedeDP commented Jul 10, 2023

I see 2 main paths forward:

  • open a PR on kubernetes Prow to properly fetch changed files through git client using before and after refs from Push github event: https://docs.github.com/en/webhooks-and-events/webhooks/webhook-events-and-payloads#push
  • split our workflow to open up a PR on test-infra for each crawled distro; this does not require any change to Prow (so, much quicker fix), but at the same time, it means we will open (and have to check/approve) like 15PRs each week...

@FedeDP
Copy link
Contributor

FedeDP commented Jul 17, 2023

Adding 2 more possible "fixes":

  • Run the update-kernels workflow daily instead of weekly: running it more frequently will generate less diff thus we would be more confident that all jobs get triggered; it has also the benefit that we would be much quicker to support new kernels
  • having a single post submit job that just triggers whenever driverkit/config path is changed, and its entrypoint would diff HEAD~1 on master to manually check changed files and then trigger the correct job for each changed <distro,arch>.

@FedeDP
Copy link
Contributor

FedeDP commented Sep 5, 2023

split our workflow to open up a PR on test-infra for each crawled distro; this does not require any change to Prow (so, much quicker fix), but at the same time, it means we will open (and have to check/approve) like 15PRs each week...

Run the update-kernels workflow daily instead of weekly: running it more frequently will generate less diff thus we would be more confident that all jobs get triggered; it has also the benefit that we would be much quicker to support new kernels

Note: these ones would not fix the "new driverversion added" issues.

@FedeDP
Copy link
Contributor

FedeDP commented Sep 7, 2023

A short term solution would be to change all post submit triggers to trigger to any change to driverkit/config/... folder, ie from this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'

to this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

PROs:

  • super easy solution

CONs:

  • we would spawn useless pods
  • we would spawn more nodes than needed

Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods).
The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).

@maxgio92 wdyt?

@FedeDP
Copy link
Contributor

FedeDP commented Sep 7, 2023

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

We could also let the entrypoint of each job do the diff HEAD~1 and decide whether it should do anything, even if i'd avoid adding too much logic in those scripts.

FedeDP added a commit that referenced this issue Sep 12, 2023
Signed-off-by: Federico Di Pierro <[email protected]>
@maxgio92
Copy link
Member Author

A short term solution would be to change all post submit triggers to trigger to any change to driverkit/config/... folder, ie from this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/almalinux_.+'

to this:

run_if_changed: '^driverkit/config/[a-z0-9.+-]{5,}/x86_64/.+'

In this case, we would trigger ALL jobs everytime, even for unmodified configs, then each job would build its own configs.

PROs:

* super easy solution

CONs:

* we would spawn useless pods

* we would spawn more nodes than needed

Let's assume that each week we build likely ~90% of supported distro, it would not be that impactful (ie: we would just spawn 10% useless pods). The "big" issue would rise when we manually trigger kernel-crawler on a single distro; in that case, instead of just building new drivers for that distro, we would spawn all pods to build all drivers (whose build should be skipped anyway because they are already present on download.falco.org).

@maxgio92 wdyt?

Thank you for this detailed investigation and sorry for my late response.

I'd try as hard as possible to avoid triggering all distros jobs.

@maxgio92
Copy link
Member Author

maxgio92 commented Sep 14, 2023

What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example, @FedeDP?

PS: in any case IMHO are both short term solutions.

@FedeDP
Copy link
Contributor

FedeDP commented Sep 14, 2023

What about instead of reducing the granularity of the grid to increase the jobs frequency as you suggested, from weekly to one every two days for example

That would unfortunately not solve the issues when we add a completely new driver version from scratch.

@FedeDP
Copy link
Contributor

FedeDP commented Sep 14, 2023

PS: in any case IMHO are both short term solutions.

Agree btw :/

FedeDP added a commit that referenced this issue Sep 19, 2023
Signed-off-by: Federico Di Pierro <[email protected]>
@poiana
Copy link
Contributor

poiana commented Dec 13, 2023

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@FedeDP
Copy link
Contributor

FedeDP commented Dec 13, 2023

/remove-lifecycle stale

@poiana
Copy link
Contributor

poiana commented Mar 12, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@FedeDP
Copy link
Contributor

FedeDP commented Mar 12, 2024

/remove-lifecycle stale

@poiana
Copy link
Contributor

poiana commented Jun 10, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@FedeDP
Copy link
Contributor

FedeDP commented Jun 11, 2024

/remove-lifecycle stale

@poiana
Copy link
Contributor

poiana commented Sep 9, 2024

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

@FedeDP
Copy link
Contributor

FedeDP commented Sep 9, 2024

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants