-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add monitoring instrumentation around dhcp-disable
state.
#133
base: master
Are you sure you want to change the base?
Add monitoring instrumentation around dhcp-disable
state.
#133
Conversation
PROBLEM ======= We're tracking reported issues in our environment where Kea servers are acting like they are being left disabled by unkonwn causes. Discovering the root cause of that problem has been hampered by poor visibility; * It is normal for servers to go without receiving packets. When the `pkt?-received` counter stops incrementing, we cannot determine reliably if this is due to e.g. DHCP relay agent misconfiguration or a Kea service issue. It frustrates debugging. * Kea does not increment any counter to indicate it is disabled, providing us with no clear monitoring predicate to detect the situation early. * It is consequently hours or days before this is noticed, by which time debug logs have rotated. This results in the situation of there being no outward indication of disabled state. SOLUTION ======== In this change, we resolve our visibility problems; * A new `pkt?-raw-received` counter is incremented at the earliest point of execution we know a packet has been received (in `runOne()`). * A new `pkt?-dhcp-disabled` counter counts packets dropped due to the service being in a disabled state at the time of reception. * A WARN level log is emitted for packets dropped while disabled. To alleviate spam concerns, a 1-in-100 sample is used. * To facilitate this change, the `StatsMgr` now provides a `getInteger()` method. * `NetworkState` internal boolean constructs are exposed as monitoring metrics. This allows us to observe the normal process of disabling service during e.g. HA reconnects and recovery, and separately detect extended unexpected disabled states prior to sufficient packet level impacts (consider e.g. DHCP anycast servers which may not receive packets to count as disabled for extended periods - we want to know it had been disabled for several days before the anycast route changes, not after). This "belt and suspenders" approach should close our monitoring gaps and provide attention to faulty services as they present. DISCUSSION ========== The `pkt?-raw-received` counters are a glaring admission of guilt, and we should use `pkt?-received` in these places. The exsisting `pkt?-received` counters should be renamed to `pkt?-processed`, as they indicate execution of the respective `processPacket()` function - still a useful monitoring indicator that e.g. the thread handoff succeeded. Probably there should also be a counter for the case where the callback addition failed instead of the present debug log. **HOWEVER**, existing Kea users may rely on the present behavior and names of metrics. A monitoring predicate using the current `pkt?-received` metrics may trigger automatic remediation processes for Kea servers unintentionally left disabled for too long; these predicate configurations would have to change or else the remediation would be defeated. This may not be an ideal user experience. Consequently, I've offered the unfortunate solution you see here; `pkt?-raw-received`, but I'm extremely unhappy with it; I'd be happy to modify the situation to whatever ISC prefers to offer their users in this case.
Added gitlab#3228 to track this. |
Ah, I adapted this from a patch on Kea 2.2, and thought I had an easy way to adapt for the new origin. But now I see the number of origins for HA groups is expected to be more numerous ( |
Hub and spoke model for HA is about to go out in 2.5.5 in a week. |
The contains() method was added in C++20, while count() will always return 0 or 1 and goes back to C++11.
I am unhappy with my brute force solution to network_state monitoring; it turns the average O(constant) disable / enable by origin into a O(D) operation for D disabled origins. This may be temporarily satisfactory considering our expectation for small populations of disabled origins during routine operation (at e.g. startup we may experience limit-populations (2,001 currently?)). After thinking about it, to resolve the complexity issue I would need feedback on which approach to take; Add Support for Metrics with Labels and Lambdas.Using the popular Prometheus data model for time series data as our example, we refer to a single metric by name which can have multiple values differentiated by collections of labels; https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels For this solution, which would be a large undertaking bear in mind, we would first extend the isc::stats module:
Then, we repurpose the O(D) This removes all the non-constant complexity from network_state's normal operation and shifts it to once-a-metric-sample, and compared with some other solutions would require the NetworkState mutex less frequently (but for longer). Implement Lambdas and Move the monitoring of HA group disable states into the HA hook.We add a method to network_state to test if a specific Origin is disabled or not ( We borrow again the idea to instrument metrics by Lambda callback functions; in the HA hook, for each HA group, we create a monitoring metric by name ( This check is on average O(constant) and would be executed when the metric is sampled (with H metrics), and each metric sampled in this way would wind up briefly taking the NetworkState mutex independently. Reorganize the
|
PROBLEM
We're tracking reported issues in our environment where Kea servers are acting like they are being left disabled by unkonwn causes. Discovering the root cause of that problem has been hampered by poor visibility;
It is normal for servers to go without receiving packets. When the
pkt?-received
counter stops incrementing, we cannot determine reliably if this is due to e.g. DHCP relay agent misconfiguration or a Kea service issue. It frustrates debugging.Kea does not increment any counter to indicate it is disabled, providing us with no clear monitoring predicate to detect the situation early.
It is consequently hours or days before this is noticed, by which time debug logs have rotated.
This results in the situation of there being no outward indication of disabled state.
SOLUTION
In this change, we resolve our visibility problems;
A new
pkt?-raw-received
counter is incremented at the earliest point of execution we know a packet has been received (inrunOne()
).A new
pkt?-dhcp-disabled
counter counts packets dropped due to the service being in a disabled state at the time of reception.A WARN level log is emitted for packets dropped while disabled. To alleviate spam concerns, a 1-in-100 sample is used.
StatsMgr
now provides agetInteger()
method.NetworkState
internal boolean constructs are exposed as monitoring metrics. This allows us to observe the normal process of disabling service during e.g. HA reconnects and recovery, and separately detect extended unexpected disabled states prior to sufficient packet level impacts (consider e.g. DHCP anycast servers which may not receive packets to count as disabled for extended periods - we want to know it had been disabled for several days before the anycast route changes, not after).This "belt and suspenders" approach should close our monitoring gaps and provide attention to faulty services as they present.
DISCUSSION
The
pkt?-raw-received
counters are a glaring admission of guilt, and we should usepkt?-received
in these places. The exsistingpkt?-received
counters should be renamed topkt?-processed
, as they indicate execution of the respectiveprocessPacket()
function - still a useful monitoring indicator that e.g. the thread handoff succeeded. Probably there should also be a counter for the case where the callback addition failed instead of the present debug log.HOWEVER, existing Kea users may rely on the present behavior and names of metrics. A monitoring predicate using the current
pkt?-received
metrics may trigger automatic remediation processes for Kea servers unintentionally left disabled for too long; these predicate configurations would have to change or else the remediation would be defeated. This may not be an ideal user experience.Consequently, I've offered the unfortunate solution you see here;
pkt?-raw-received
, but I'm extremely unhappy with it; I'd be happy to modify the situation to whatever ISC prefers to offer their users in this case.