Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set exact time of first audit #2587

Closed
ikusomr opened this issue Feb 16, 2023 · 16 comments
Closed

Set exact time of first audit #2587

ikusomr opened this issue Feb 16, 2023 · 16 comments
Labels
enhancement New feature or request stale

Comments

@ikusomr
Copy link

ikusomr commented Feb 16, 2023

Describe the solution you'd like
To be able to specify a flag for the audit process at what exact time it must start the first audit cycle.

Anything else you would like to add:
We mostly rely on the audit part of Gatekeeper functionality, providing teams across many clusters with audit results. The audit interval is 86400 (once per 24 h). The interval is the same across the clusters, but the time we install or update the audit manager across the clusters is not. The way of consuming audit results is through parsing and aggregating logs. Since the time of installing Gatekeeper within different clusters is different, the time when audit cycles occur is also different (as far as I understand the code, the first audit cycle takes place after audit interval passes - how Golang "time.NewTicker" works). The latter is also inconvinient, because in case of having 86400 we must wait 24 hours until the first results. Or install with shorter period and reinstall after getting the results.
Having the flag with exact audit time will give us:

  • consistency of logging time across all the clusters;
  • ability to have almost instant results in case of long audit interval;
  • in case of emitting audit results via K8s events it will be possible to set a night time and not make a spike of events which may hurt Kube-Api.

Environment:

  • Gatekeeper version: 3.11
@ikusomr ikusomr added the enhancement New feature or request label Feb 16, 2023
@maxsmythe
Copy link
Contributor

To be clear, you want to set a specific start time for audit, correct?

Do you still want audit to run at regular intervals as well? What happens if one cluster takes longer for audit to finish and misses its next start time? Should it run ASAP? Not run at all?

Are users expected to keep the flag up-to-date, or is it "start audit the next time the pod is alive and the clock crosses the START_TIME + N * INTERVAL threshold? If the later, what happens if the pod was temporarily down for the most recent tick? Should it wait for the next one?

Is it important for how you collate the results that the start times line up exactly? "Scheduled start time" we can probably get consistent, but "real start time", probably not.

TBH because of the time skew involved with running a process across multiple systems (there is no guarantee all audits will finish at the same time, even if they start at the same time, for example), I'm a bit leery of leaning too heavily into a synchronization approach. That being said, maybe it's only rough correlation you're looking for?

The idea of blackout periods seems like a good one, and probably something we'd use separate from logs.

For consistency, having audit start immediately or having the intervals be defined by something outside of the running process seems like a good idea (e.g. write out a "next audit at" resource to the API server). That would also allow audit to restart if the audit pod was killed mid-previous-run.

@ikusomr
Copy link
Author

ikusomr commented Feb 17, 2023

Hi Max,

To be clear, you want to set a specific start time for audit, correct?

I want to set a specific start time for the first cycle. Then it goes periodically using specified "--auditInterval" flag.

Do you still want audit to run at regular intervals as well? What happens if one cluster takes longer for audit to finish and misses its next start time? Should it run ASAP? Not run at all?

Yes, I consider audit to run at regular intervals specified by "--auditInterval" flag as usual after the first cycle. Of course there will be a drift across clusters, but it should be no longer than several minutes.

Are users expected to keep the flag up-to-date, or is it "start audit the next time the pod is alive and the clock crosses the START_TIME + N * INTERVAL threshold? If the later, what happens if the pod was temporarily down for the most recent tick? Should it wait for the next one?

Users are expected to provide the flag once while installing/updating, understanding that the first cycle will be executed at the specified time, then cycles occur using the specified inerval flag. If the pod is temporarily down, the next tick will launch it, as it works now.

Is it important for how you collate the results that the start times line up exactly? "Scheduled start time" we can probably get consistent, but "real start time", probably not.
TBH because of the time skew involved with running a process across multiple systems (there is no guarantee all audits will finish at the same time, even if they start at the same time, for example), I'm a bit leery of leaning too heavily into a synchronization approach. That being said, maybe it's only rough correlation you're looking for?

I do not expect a precise start/stop time with a precision of milliseconds. Drift of several minutes is completely fine. If the first cycle has been failed, we expect the next one to be executed next day at the same time (86400 as an audit interval). In a long distance we expect to have results daily at the same time across the clusters with a drift of several minutes. E.g. scheduling audit at 9:00 and having a daily log query at the timeframe of "from 9:00 to 10:00". To achieve this now, we must install/update Gatekeeper at the exact same time (~9:00) and than wait 24h until the first results, which is way harder than providing a flag.

I agree that our case with daily audits may be pretty unique, but I also believe that providing an exact time of first start should bring more predictability for more or less long intervals than how it works now. I see it should work with hourly intervals pretty good as well.

e.g. write out a "next audit at" resource to the API server...

Great idea, it works for my case and provide even more flexibility than specyfying a start flag once.

@maxsmythe
Copy link
Contributor

maxsmythe commented Feb 17, 2023

Yes, I consider audit to run at regular intervals specified by "--auditInterval" flag as usual after the first cycle. Of course there will be a drift across clusters, but it should be no longer than several minutes.

This depends on how homogenous the clusters are. A large size difference and the spread could approach O(hours). This could also be true if a long-running audit gets restarted due to something like a pod eviction.

To achieve this now, we must install/update Gatekeeper at the exact same time (~9:00) and than wait 24h until the first results, which is way harder than providing a flag.

+1, though I think there may be easier approaches than a flag (this sounds similar to how cronjobs work).

I agree that our case with daily audits may be pretty unique

I don't think the scheduling requirement is that out there. I could see users have differing assumptions WRT error recovery though.

One short term workaround would to have audit run as a cron job (of course this would require audit to start immediately).

CronJob may even be a fair permanent solution, though that does mean each audit run would have a bootstrapping delay while the new audit pod loads state (constraints, templates, cached data).

@ikusomr
Copy link
Author

ikusomr commented Feb 20, 2023

Good point on pod eviction, agree with you.

One short term workaround would to have audit run as a cron job (of course this would require audit to start immediately).

Does that mean it is completely fine to have audit pod existing as a cron job (only short-term appearances)? I mean aren't there any heartbeat-styled interactions between audit and controller-manager pods that could be broken? Scraping metrics from audit may be broken from the first view.

@maxsmythe
Copy link
Contributor

maxsmythe commented Feb 21, 2023

Does that mean it is completely fine to have audit pod existing as a cron job

Yes, though there is the cost of repeatedly incurring bootstrapping. Also, the infrastructure would need to be re-architected slightly, since there are other operations that are meant to be long-running singletons running on the pod as-configured, namely status/mutation-status (see this doc for more info).

The "start immediately" functionality doesn't exist, however. Also, I don't think audit has a "run once and exit" mode, so we'd need to add that too, come to think of it.

aren't there any heartbeat-styled interactions between audit and controller-manager pods that could be broken?

Gatekeeper pods do not currently talk directly to each other, as such a design wouldn't scale as cleanly. There is some indirect coordination that happens via the *PodStatus resources, which are consumed by the status-type operations (see above link).

Scraping metrics from audit may be broken from the first view.

Correct, trying to scrape metrics from a transient pod would be a poor experience. I'd rely on stdout in that case, or scraping the status of constraints if you're okay with potentially partial results.

@stek29
Copy link
Contributor

stek29 commented Mar 22, 2023

I think having audit start immediately after gatekeeper is ready, instead of starting it after audit interval passes, would be beneficial for different use cases, and might help with this issue too.

How about adding "start immediately" for now? Should there be a flag for it? If yes, what should the default value be -- start immediately or start after first interval as before?

@maxsmythe
Copy link
Contributor

IMO we should just switch to starting immediately (immediately being defined as "as soon as all policy is loaded")

IIRC there was some race condition that arose when we tried doing this before. @ritazh Do you remember?

@stale
Copy link

stale bot commented May 23, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 23, 2023
@stek29
Copy link
Contributor

stek29 commented May 23, 2023

this is still an issue

@stale stale bot removed the stale label May 23, 2023
@stale
Copy link

stale bot commented Jul 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 22, 2023
@stek29
Copy link
Contributor

stek29 commented Jul 24, 2023

this is still an issue

@stale stale bot removed the stale label Jul 24, 2023
@sozercan
Copy link
Member

sozercan commented Jul 24, 2023

@ikusomr @stek29 we are introducing pubsub for audit in GK 3.13 (WIP docs PR: https://github.com/open-policy-agent/gatekeeper/pull/2808/files), would this address this issue since you don't have to parse and aggregate logs?

@stek29
Copy link
Contributor

stek29 commented Jul 27, 2023

@sozercan thanks for bringing it up, I’ve totally missed this feature!

I think this issue is still relevant even with PubSub — there are more reasons to start audit without waiting.

One more use case — currently I have to wait for interval before I can see audit results and metrics after updating gatekeeper, which makes updates more difficult.
Another use case I can think of is being able to turn Audit into a CronJob instead of Deployment, so it doesn’t “waste” resources if someone wants to have infrequent audits.

However, I believe it would solve #1037.

@stale
Copy link

stale bot commented Sep 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 25, 2023
@stale stale bot closed this as completed Oct 10, 2023
@stek29
Copy link
Contributor

stek29 commented Oct 15, 2023

this is still an issue.

@ritazh ritazh reopened this Oct 16, 2023
@stale stale bot removed the stale label Oct 16, 2023
Copy link

stale bot commented Dec 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 15, 2023
@stale stale bot closed this as completed Dec 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants