Ability to add/enable collections of optional monitors #53

mffiedler · 2020-05-14T19:23:02Z

This might be stretching the original intent of Cerberus, but I see a trend. As we add additional checks, the patterns we could follow are a) make the new check a default and always run it b) give the new check an option in the config or c) introduce the idea of collections of optional checks - or maybe just one optional collection of "verbose health checks" for simplicity.

There are a lot of detailed things that could be monitored on a cluster - whether Cerberus should monitor them is open for discussion (issue #42 ). As new checks are added the monitor loop time grows at least linearly with the number of monitored namespaces and higher when pod checks are included (PR #52 ).

For discussion, should we identify a core set of critical checks and enable some mechanism for optional/verbose checks without adding a config flag for everyone of them?

/cc: @paigerube14 @chaitanyaenr @yashashreesuresh

chaitanyaenr · 2020-05-14T20:39:07Z

Adding a single verbose checks option sounds good but think they should just be part of logs/warnings instead of being considered for setting the go/no-go signal IMHO. For example, checking if the master nodes are marked as unscheduled or not should just be logged as info as the user might intentionally mark them as schedulable depending on the need. Any check which is taken into account for setting the go/no-go signal should be exposed as an option to the user in order to be able to disable it in case there's a know problem and the user is fine with ignoring it.

As we add more checks, the monitor time is going to increase especially on a large scale cluster like @mffiedler mentioned. We might want to take a look at making Cerberus checks concurrent - #23.

Thoughts?

paigerube14 · 2020-05-15T12:19:19Z

I agree with Naga Ravi, I think that the verbose checks should just be able to log information about the current specific states of the cluster. I think that this will be enough helpful information for the user to verify their certain checkpoints or be able to narrow down what went wrong.
I think it might be nice to have all the options that are for the go/no-go signal set in the config file so you know that these are the possible options that I really care about. But then all of your own verbose extra checks could be passed in through command line options. Thoughts?

I definitely think that as we add more checks and options that we are going to need the Cerberus checks to be concurrent.

yashashreesuresh · 2020-05-19T08:47:21Z

I would go with the idea of adding one optional collection of "verbose health checks". As there can be a lot of detailed things that could be monitored on a cluster, all the checks which are not taken into account for setting the go/no-go signal can be placed under “verbose health checks” by default. The user can select the checks according to his needs. For example, it becomes redundant to check if the master nodes are marked as unscheduled in every iteration. There might be things which needn’t be monitored always and things which needn't be monitored in every iteration as it increases the monitor loop time.

chaitanyaenr · 2020-05-20T13:18:17Z

Think we are all in agreement as per the discussion on slack. The idea is to add a way in Cerberus to be able to run user provided checks ( bring your own checks ) and consider/not-consider them when setting the go/no-go signal based on the requirement of the user. This should accommodate the verbose/optional checks as well provided the output of the checks is in a format understandable by Cerberus.

mffiedler mentioned this issue May 14, 2020

Verify application pods not scheduled on master nodes #42

Closed

mffiedler added the enhancement New feature or request label May 14, 2020

chaitanyaenr mentioned this issue May 27, 2020

Cerberus scalability issues #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to add/enable collections of optional monitors #53

Ability to add/enable collections of optional monitors #53

mffiedler commented May 14, 2020 •

edited

Loading

chaitanyaenr commented May 14, 2020 •

edited

Loading

paigerube14 commented May 15, 2020

yashashreesuresh commented May 19, 2020 •

edited

Loading

chaitanyaenr commented May 20, 2020

Ability to add/enable collections of optional monitors #53

Ability to add/enable collections of optional monitors #53

Comments

mffiedler commented May 14, 2020 • edited Loading

chaitanyaenr commented May 14, 2020 • edited Loading

paigerube14 commented May 15, 2020

yashashreesuresh commented May 19, 2020 • edited Loading

chaitanyaenr commented May 20, 2020

mffiedler commented May 14, 2020 •

edited

Loading

chaitanyaenr commented May 14, 2020 •

edited

Loading

yashashreesuresh commented May 19, 2020 •

edited

Loading