Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitoring instrumentation around dhcp-disable state. #133

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Commits on Jan 19, 2024

  1. Add monitoring instrumentation around dhcp-disable state.

    PROBLEM
    =======
    
    We're tracking reported issues in our environment where Kea servers are
    acting like they are being left disabled by unkonwn causes. Discovering
    the root cause of that problem has been hampered by poor visibility;
    
    * It is normal for servers to go without receiving packets. When the
      `pkt?-received` counter stops incrementing, we cannot determine
      reliably if this is due to e.g. DHCP relay agent misconfiguration or a
      Kea service issue. It frustrates debugging.
    
    * Kea does not increment any counter to indicate it is disabled,
      providing us with no clear monitoring predicate to detect the
      situation early.
    
    * It is consequently hours or days before this is noticed, by which time
      debug logs have rotated.
    
    This results in the situation of there being no outward indication of
    disabled state.
    
    SOLUTION
    ========
    
    In this change, we resolve our visibility problems;
    
    * A new `pkt?-raw-received` counter is incremented at the earliest point
      of execution we know a packet has been received (in `runOne()`).
    
    * A new `pkt?-dhcp-disabled` counter counts packets dropped due to the
      service being in a disabled state at the time of reception.
    
    * A WARN level log is emitted for packets dropped while disabled. To
      alleviate spam concerns, a 1-in-100 sample is used.
    
      * To facilitate this change, the `StatsMgr` now provides a
        `getInteger()` method.
    
    * `NetworkState` internal boolean constructs are exposed as monitoring
      metrics. This allows us to observe the normal process of disabling
      service during e.g. HA reconnects and recovery, and separately detect
      extended unexpected disabled states prior to sufficient packet level
      impacts (consider e.g. DHCP anycast servers which may not receive
      packets to count as disabled for extended periods - we want to know it
      had been disabled for several days before the anycast route changes,
      not after).
    
    This "belt and suspenders" approach should close our monitoring gaps and
    provide attention to faulty services as they present.
    
    DISCUSSION
    ==========
    
    The `pkt?-raw-received` counters are a glaring admission of guilt,
    and we should use `pkt?-received` in these places. The exsisting
    `pkt?-received` counters should be renamed to `pkt?-processed`, as
    they indicate execution of the respective `processPacket()` function -
    still a useful monitoring indicator that e.g. the thread handoff
    succeeded. Probably there should also be a counter for the case where
    the callback addition failed instead of the present debug log.
    
    **HOWEVER**, existing Kea users may rely on the present behavior and
    names of metrics. A monitoring predicate using the current
    `pkt?-received` metrics may trigger automatic remediation processes for
    Kea servers unintentionally left disabled for too long; these predicate
    configurations would have to change or else the remediation would be
    defeated. This may not be an ideal user experience.
    
    Consequently, I've offered the unfortunate solution you see here;
    `pkt?-raw-received`, but I'm extremely unhappy with it; I'd be happy to
    modify the situation to whatever ISC prefers to offer their users in
    this case.
    davidhankins committed Jan 19, 2024
    Configuration menu
    Copy the full SHA
    93b65a1 View commit details
    Browse the repository at this point in the history

Commits on Jan 22, 2024

  1. Configuration menu
    Copy the full SHA
    e0392c4 View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2024

  1. Use std::unordered_set<>.count() instead of .contains().

    The contains() method was added in C++20, while count() will always
    return 0 or 1 and goes back to C++11.
    davidhankins committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    323435d View commit details
    Browse the repository at this point in the history
  2. Fix compilation errors.

    davidhankins committed Jan 25, 2024
    Configuration menu
    Copy the full SHA
    34e3453 View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2024

  1. Configuration menu
    Copy the full SHA
    e726921 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6e0ce9d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e534bc9 View commit details
    Browse the repository at this point in the history