You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is mostly as a reminder for myself to fix (later...).
As part of the aggregated stats, ePPing keeps some global counters for protocols, which are never reset (although userspace reports difference since last report by calculating the diff). For TCP, UDP and ICMP we use 64-bit counters, and are thereby incredibly unlikely to overflow (even if you're constantly handling 100 Gbps it will take over 46 years for the byte counter to overflow, so will be someone else's problem then). However, for protocols other than TCP/UDP/ICMP we only use 32-bit counters (and only count packets).
One of these "other" protocols is GRE, and based on some recent measurements in a live network we noticed that this counter may end up overflowing if running for multiple months. If the counter overflows, the user space part will start reporting bogus values (calculation of difference from previous report will overflow and result in a very large number). I see 3 ways to fix this:
Change the "other" protocol counters from 32-bit to 64 bit. The entire 256-member array (the possible range for IP protocol numbers) will then grow from 1 kB to 2 kB, which will not be an issue for RAM usage, but may hurt cache performance. Code wise this is the easiest fix as it only needs to change a single line (the definition of the other_ipprotos array in the global_counters struct).
Add GRE to the "named" protocols, giving it 64-bit counters. This will add a tiny amount of extra run-time overhead as an extra if-statement will need to be added to determine which counter to increment. Code wise this is quite simple, but will still require adding a few lines in multiple places to handle the added counter fields, and make the affected parts of the code even more repetitive. This only fixes it for GRE. If we later find some other protocol(s) that risk overflowing, adding them to the "named" fields will also require similar code changes for that protocol as well, so this is not a good solution in the long run if many different protocols turn out to be somewhat common in different deployments.
Add a check for overflow in the user space calculation and gracefully "wrap-around" the diff calculation, as overflow of unsigned integers is well-defined behavior. As the counters should only ever increase (or stay the same) between two consecutive reports, an overflow can easily be detected if the counter was higher in the previous report, and the correct diff can still be calculated. This is based on the assumption that between any two reports the counter may at most overflow once, but that should be the case for the vast majority of use cases (e.g. even if you only report every 10 minutes, you must on average handle over 7 million packets per second of that specific protocol for that to be a risk). This will add a small amount of overhead to the user space component when reporting the aggregated values, but should not impact the performance of eBPF programs at all.
I'm currently in favor of option 3, as it (unlike option 2) will fix this issue for all possible protocols, and (unlike option 1) it cannot harm the performance of the eBPF programs (as both the eBPF program and the datastructure remain unchanged). The overhead of doing a slightly more complicated diff calculation 256 times once per reporting interval in the user space should be negligible.
The text was updated successfully, but these errors were encountered:
This is mostly as a reminder for myself to fix (later...).
As part of the aggregated stats, ePPing keeps some global counters for protocols, which are never reset (although userspace reports difference since last report by calculating the diff). For TCP, UDP and ICMP we use 64-bit counters, and are thereby incredibly unlikely to overflow (even if you're constantly handling 100 Gbps it will take over 46 years for the byte counter to overflow, so will be someone else's problem then). However, for protocols other than TCP/UDP/ICMP we only use 32-bit counters (and only count packets).
One of these "other" protocols is GRE, and based on some recent measurements in a live network we noticed that this counter may end up overflowing if running for multiple months. If the counter overflows, the user space part will start reporting bogus values (calculation of difference from previous report will overflow and result in a very large number). I see 3 ways to fix this:
other_ipprotos
array in theglobal_counters
struct).I'm currently in favor of option 3, as it (unlike option 2) will fix this issue for all possible protocols, and (unlike option 1) it cannot harm the performance of the eBPF programs (as both the eBPF program and the datastructure remain unchanged). The overhead of doing a slightly more complicated diff calculation 256 times once per reporting interval in the user space should be negligible.
The text was updated successfully, but these errors were encountered: