diff --git a/pkg/pillar/docs/watcher.md b/pkg/pillar/docs/watcher.md index 353fd24a92..50e5229505 100644 --- a/pkg/pillar/docs/watcher.md +++ b/pkg/pillar/docs/watcher.md @@ -41,3 +41,41 @@ By adaptively triggering garbage collection based on actual memory pressure and allocation patterns, we ensure efficient memory usage and maintain system performance. This approach helps prevent potential memory-related issues by proactively managing resources. + +## Goroutine Leak Detector + +We have implemented a system to detect potential goroutine leaks by monitoring +the number of active goroutines over time. This proactive approach helps us +identify unusual increases that may indicate a leak. + +To achieve this, we collect data on the number of goroutines at regular +intervals within the `goroutinesMonitor` function. However, raw data can be +noisy due to normal fluctuations in goroutine usage. To mitigate this, we apply +a moving average to the collected data using the `movingAverage` function. This +smoothing process reduces short-term variations and highlights longer-term +trends, making it easier to detect significant changes in the goroutine count. + +After smoothing the data, we calculate the rate of change by determining the +difference between consecutive smoothed values. This rate of change reflects how +quickly the number of goroutines is increasing or decreasing over time. To +analyze this effectively, we compute the mean and standard deviation of the rate +of change using the `calculateMeanStdDev` function. These statistical measures +provide insights into the typical behavior and variability within our system. + +Using the standard deviation, we set a dynamic threshold that adapts to the +system's normal operating conditions within the `detectGoroutineLeaks` function. +If both the mean rate of change and the latest observed rate exceed this +threshold, it indicates an abnormal increase in goroutine count, signaling a +potential leak. This method reduces false positives by accounting for natural +fluctuations and focusing on significant deviations from expected patterns. + +When a potential leak is detected, we respond by dumping the stack traces of all +goroutines using the `handlePotentialGoroutineLeak` function. This action +provides detailed information that can help diagnose the source of the leak, as +it reveals where goroutines are being created and potentially not terminated +properly. + +To prevent repeated handling of the same issue within a short time frame, we +incorporate a cooldown period in the `goroutinesMonitor` function. This ensures +that resources are not wasted on redundant operations and that the monitoring +system remains efficient.