Set exact time of first audit #2587

ikusomr · 2023-02-16T19:11:31Z

Describe the solution you'd like
To be able to specify a flag for the audit process at what exact time it must start the first audit cycle.

Anything else you would like to add:
We mostly rely on the audit part of Gatekeeper functionality, providing teams across many clusters with audit results. The audit interval is 86400 (once per 24 h). The interval is the same across the clusters, but the time we install or update the audit manager across the clusters is not. The way of consuming audit results is through parsing and aggregating logs. Since the time of installing Gatekeeper within different clusters is different, the time when audit cycles occur is also different (as far as I understand the code, the first audit cycle takes place after audit interval passes - how Golang "time.NewTicker" works). The latter is also inconvinient, because in case of having 86400 we must wait 24 hours until the first results. Or install with shorter period and reinstall after getting the results.
Having the flag with exact audit time will give us:

consistency of logging time across all the clusters;
ability to have almost instant results in case of long audit interval;
in case of emitting audit results via K8s events it will be possible to set a night time and not make a spike of events which may hurt Kube-Api.

Environment:

Gatekeeper version: 3.11

maxsmythe · 2023-02-17T00:28:20Z

To be clear, you want to set a specific start time for audit, correct?

Do you still want audit to run at regular intervals as well? What happens if one cluster takes longer for audit to finish and misses its next start time? Should it run ASAP? Not run at all?

Are users expected to keep the flag up-to-date, or is it "start audit the next time the pod is alive and the clock crosses the START_TIME + N * INTERVAL threshold? If the later, what happens if the pod was temporarily down for the most recent tick? Should it wait for the next one?

Is it important for how you collate the results that the start times line up exactly? "Scheduled start time" we can probably get consistent, but "real start time", probably not.

TBH because of the time skew involved with running a process across multiple systems (there is no guarantee all audits will finish at the same time, even if they start at the same time, for example), I'm a bit leery of leaning too heavily into a synchronization approach. That being said, maybe it's only rough correlation you're looking for?

The idea of blackout periods seems like a good one, and probably something we'd use separate from logs.

For consistency, having audit start immediately or having the intervals be defined by something outside of the running process seems like a good idea (e.g. write out a "next audit at" resource to the API server). That would also allow audit to restart if the audit pod was killed mid-previous-run.

ikusomr · 2023-02-17T10:47:21Z

Hi Max,

To be clear, you want to set a specific start time for audit, correct?

I want to set a specific start time for the first cycle. Then it goes periodically using specified "--auditInterval" flag.

Do you still want audit to run at regular intervals as well? What happens if one cluster takes longer for audit to finish and misses its next start time? Should it run ASAP? Not run at all?

Yes, I consider audit to run at regular intervals specified by "--auditInterval" flag as usual after the first cycle. Of course there will be a drift across clusters, but it should be no longer than several minutes.

Are users expected to keep the flag up-to-date, or is it "start audit the next time the pod is alive and the clock crosses the START_TIME + N * INTERVAL threshold? If the later, what happens if the pod was temporarily down for the most recent tick? Should it wait for the next one?

Users are expected to provide the flag once while installing/updating, understanding that the first cycle will be executed at the specified time, then cycles occur using the specified inerval flag. If the pod is temporarily down, the next tick will launch it, as it works now.

Is it important for how you collate the results that the start times line up exactly? "Scheduled start time" we can probably get consistent, but "real start time", probably not.
TBH because of the time skew involved with running a process across multiple systems (there is no guarantee all audits will finish at the same time, even if they start at the same time, for example), I'm a bit leery of leaning too heavily into a synchronization approach. That being said, maybe it's only rough correlation you're looking for?

I do not expect a precise start/stop time with a precision of milliseconds. Drift of several minutes is completely fine. If the first cycle has been failed, we expect the next one to be executed next day at the same time (86400 as an audit interval). In a long distance we expect to have results daily at the same time across the clusters with a drift of several minutes. E.g. scheduling audit at 9:00 and having a daily log query at the timeframe of "from 9:00 to 10:00". To achieve this now, we must install/update Gatekeeper at the exact same time (~9:00) and than wait 24h until the first results, which is way harder than providing a flag.

I agree that our case with daily audits may be pretty unique, but I also believe that providing an exact time of first start should bring more predictability for more or less long intervals than how it works now. I see it should work with hourly intervals pretty good as well.

e.g. write out a "next audit at" resource to the API server...

Great idea, it works for my case and provide even more flexibility than specyfying a start flag once.

maxsmythe · 2023-02-17T22:40:53Z

Yes, I consider audit to run at regular intervals specified by "--auditInterval" flag as usual after the first cycle. Of course there will be a drift across clusters, but it should be no longer than several minutes.

This depends on how homogenous the clusters are. A large size difference and the spread could approach O(hours). This could also be true if a long-running audit gets restarted due to something like a pod eviction.

To achieve this now, we must install/update Gatekeeper at the exact same time (~9:00) and than wait 24h until the first results, which is way harder than providing a flag.

+1, though I think there may be easier approaches than a flag (this sounds similar to how cronjobs work).

I agree that our case with daily audits may be pretty unique

I don't think the scheduling requirement is that out there. I could see users have differing assumptions WRT error recovery though.

One short term workaround would to have audit run as a cron job (of course this would require audit to start immediately).

CronJob may even be a fair permanent solution, though that does mean each audit run would have a bootstrapping delay while the new audit pod loads state (constraints, templates, cached data).

ikusomr · 2023-02-20T11:47:36Z

Good point on pod eviction, agree with you.

One short term workaround would to have audit run as a cron job (of course this would require audit to start immediately).

Does that mean it is completely fine to have audit pod existing as a cron job (only short-term appearances)? I mean aren't there any heartbeat-styled interactions between audit and controller-manager pods that could be broken? Scraping metrics from audit may be broken from the first view.

maxsmythe · 2023-02-21T20:59:13Z

Does that mean it is completely fine to have audit pod existing as a cron job

Yes, though there is the cost of repeatedly incurring bootstrapping. Also, the infrastructure would need to be re-architected slightly, since there are other operations that are meant to be long-running singletons running on the pod as-configured, namely status/mutation-status (see this doc for more info).

The "start immediately" functionality doesn't exist, however. Also, I don't think audit has a "run once and exit" mode, so we'd need to add that too, come to think of it.

aren't there any heartbeat-styled interactions between audit and controller-manager pods that could be broken?

Gatekeeper pods do not currently talk directly to each other, as such a design wouldn't scale as cleanly. There is some indirect coordination that happens via the *PodStatus resources, which are consumed by the status-type operations (see above link).

Scraping metrics from audit may be broken from the first view.

Correct, trying to scrape metrics from a transient pod would be a poor experience. I'd rely on stdout in that case, or scraping the status of constraints if you're okay with potentially partial results.

stek29 · 2023-03-22T18:58:54Z

I think having audit start immediately after gatekeeper is ready, instead of starting it after audit interval passes, would be beneficial for different use cases, and might help with this issue too.

How about adding "start immediately" for now? Should there be a flag for it? If yes, what should the default value be -- start immediately or start after first interval as before?

maxsmythe · 2023-03-24T01:57:11Z

IMO we should just switch to starting immediately (immediately being defined as "as soon as all policy is loaded")

IIRC there was some race condition that arose when we tried doing this before. @ritazh Do you remember?

stale · 2023-05-23T03:18:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stek29 · 2023-05-23T19:59:02Z

this is still an issue

stale · 2023-07-22T22:45:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stek29 · 2023-07-24T15:18:44Z

this is still an issue

sozercan · 2023-07-24T16:50:01Z

@ikusomr @stek29 we are introducing pubsub for audit in GK 3.13 (WIP docs PR: https://github.com/open-policy-agent/gatekeeper/pull/2808/files), would this address this issue since you don't have to parse and aggregate logs?

stek29 · 2023-07-27T17:16:40Z

@sozercan thanks for bringing it up, I’ve totally missed this feature!

I think this issue is still relevant even with PubSub — there are more reasons to start audit without waiting.

One more use case — currently I have to wait for interval before I can see audit results and metrics after updating gatekeeper, which makes updates more difficult.
Another use case I can think of is being able to turn Audit into a CronJob instead of Deployment, so it doesn’t “waste” resources if someone wants to have infrequent audits.

However, I believe it would solve #1037.

stale · 2023-09-25T23:45:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stek29 · 2023-10-15T12:21:53Z

this is still an issue.

stale · 2023-12-15T17:57:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

ikusomr added the enhancement New feature or request label Feb 16, 2023

acpana mentioned this issue Apr 14, 2023

Gatekeeper should be able to perform audit at a given specific time stamp #2699

Closed

stale bot added the stale label May 23, 2023

stale bot removed the stale label May 23, 2023

stale bot added the stale label Jul 22, 2023

stale bot removed the stale label Jul 24, 2023

stale bot added the stale label Sep 25, 2023

stale bot closed this as completed Oct 10, 2023

ritazh reopened this Oct 16, 2023

stale bot removed the stale label Oct 16, 2023

stale bot added the stale label Dec 15, 2023

stale bot closed this as completed Dec 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set exact time of first audit #2587

Set exact time of first audit #2587

ikusomr commented Feb 16, 2023 •

edited

Loading

maxsmythe commented Feb 17, 2023

ikusomr commented Feb 17, 2023 •

edited

Loading

maxsmythe commented Feb 17, 2023 •

edited

Loading

ikusomr commented Feb 20, 2023

maxsmythe commented Feb 21, 2023 •

edited

Loading

stek29 commented Mar 22, 2023

maxsmythe commented Mar 24, 2023

stale bot commented May 23, 2023

stek29 commented May 23, 2023

stale bot commented Jul 22, 2023

stek29 commented Jul 24, 2023

sozercan commented Jul 24, 2023 •

edited

Loading

stek29 commented Jul 27, 2023

stale bot commented Sep 25, 2023

stek29 commented Oct 15, 2023

stale bot commented Dec 15, 2023

Set exact time of first audit #2587

Set exact time of first audit #2587

Comments

ikusomr commented Feb 16, 2023 • edited Loading

maxsmythe commented Feb 17, 2023

ikusomr commented Feb 17, 2023 • edited Loading

maxsmythe commented Feb 17, 2023 • edited Loading

ikusomr commented Feb 20, 2023

maxsmythe commented Feb 21, 2023 • edited Loading

stek29 commented Mar 22, 2023

maxsmythe commented Mar 24, 2023

stale bot commented May 23, 2023

stek29 commented May 23, 2023

stale bot commented Jul 22, 2023

stek29 commented Jul 24, 2023

sozercan commented Jul 24, 2023 • edited Loading

stek29 commented Jul 27, 2023

stale bot commented Sep 25, 2023

stek29 commented Oct 15, 2023

stale bot commented Dec 15, 2023

ikusomr commented Feb 16, 2023 •

edited

Loading

ikusomr commented Feb 17, 2023 •

edited

Loading

maxsmythe commented Feb 17, 2023 •

edited

Loading

maxsmythe commented Feb 21, 2023 •

edited

Loading

sozercan commented Jul 24, 2023 •

edited

Loading