-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to add/enable collections of optional monitors #53
Comments
Adding a single verbose checks option sounds good but think they should just be part of logs/warnings instead of being considered for setting the go/no-go signal IMHO. For example, checking if the master nodes are marked as unscheduled or not should just be logged as info as the user might intentionally mark them as schedulable depending on the need. Any check which is taken into account for setting the go/no-go signal should be exposed as an option to the user in order to be able to disable it in case there's a know problem and the user is fine with ignoring it. As we add more checks, the monitor time is going to increase especially on a large scale cluster like @mffiedler mentioned. We might want to take a look at making Cerberus checks concurrent - #23. Thoughts? |
I agree with Naga Ravi, I think that the verbose checks should just be able to log information about the current specific states of the cluster. I think that this will be enough helpful information for the user to verify their certain checkpoints or be able to narrow down what went wrong. I definitely think that as we add more checks and options that we are going to need the Cerberus checks to be concurrent. |
I would go with the idea of adding one optional collection of "verbose health checks". As there can be a lot of detailed things that could be monitored on a cluster, all the checks which are not taken into account for setting the go/no-go signal can be placed under “verbose health checks” by default. The user can select the checks according to his needs. For example, it becomes redundant to check if the master nodes are marked as unscheduled in every iteration. There might be things which needn’t be monitored always and things which needn't be monitored in every iteration as it increases the monitor loop time. |
Think we are all in agreement as per the discussion on slack. The idea is to add a way in Cerberus to be able to run user provided checks ( bring your own checks ) and consider/not-consider them when setting the go/no-go signal based on the requirement of the user. This should accommodate the verbose/optional checks as well provided the output of the checks is in a format understandable by Cerberus. |
This might be stretching the original intent of Cerberus, but I see a trend. As we add additional checks, the patterns we could follow are a) make the new check a default and always run it b) give the new check an option in the config or c) introduce the idea of collections of optional checks - or maybe just one optional collection of "verbose health checks" for simplicity.
There are a lot of detailed things that could be monitored on a cluster - whether Cerberus should monitor them is open for discussion (issue #42 ). As new checks are added the monitor loop time grows at least linearly with the number of monitored namespaces and higher when pod checks are included (PR #52 ).
For discussion, should we identify a core set of critical checks and enable some mechanism for optional/verbose checks without adding a config flag for everyone of them?
/cc: @paigerube14 @chaitanyaenr @yashashreesuresh
The text was updated successfully, but these errors were encountered: