Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(eventgenerator): use PromQL API for metric types throughput and responsetime #2890

Merged
merged 16 commits into from
Apr 26, 2024

Conversation

geigerj0
Copy link
Contributor

@geigerj0 geigerj0 commented Apr 24, 2024

Problem

After the log-cache enablement with 3a, we've received a bug stating that responsetime and throughput are no longer working. It turned out, that the envelopes are being read via REST API which has a default of 100 maximum envelopes per response:

Using 100 envelopes to calculate the average throughput and responsetime for the last x seconds breaks the calculations. The following example shows that the limited set of envelopes causes throughput (a.k.a. requests per seconds) to not grow beyond 3:

  • 100 envelopes
  • polling interval 40 seconds
  • 100 / 40 ~= 3 requests per seconds (which is obviously wrong because only 100 envelopes were taken into account and not all)

The reason why this wasn't discovered earlier is because the acceptance test for throughput only checks if the threshold grows above 1:

policy = GenerateDynamicScaleOutPolicy(1, 2, "throughput", 1)

Solution

Make use of the PromQL API of log-cache for throughput and responsetime: https://github.com/cloudfoundry/log-cache-release/tree/main/src#get-apiv1query.

Using the PromQL API shifts the whole metric aggregation logic to log-cache and we can simply consume the results.

@geigerj0 geigerj0 added bug allow-acceptance-tests This label needs to be added to enable the acceptance tests to run. labels Apr 24, 2024
}

var _ EnvelopeProcessor = &Processor{}

type Processor struct {
logger lager.Logger
collectionInterval time.Duration
Copy link
Contributor Author

@geigerj0 geigerj0 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed collectionInterval because it was completely unused in the Processor

@@ -34,7 +34,7 @@ func NewMetricServerClient(logger lager.Logger, url string, httpClient *http.Cli
}
}
func (c *MetricServerClient) GetMetrics(appId string, metricType string, startTime time.Time, endTime time.Time) ([]models.AppInstanceMetric, error) {
c.logger.Debug("GetMetric")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: fixed debug message to be aligned with actual function name

@@ -65,7 +65,7 @@ func (m *MetricPoller) retrieveMetric(appMonitor *models.AppMonitor) error {
metrics, err := m.metricClient.GetMetrics(appId, metricType, startTime, endTime)
m.logger.Debug("received metrics from metricClient", lager.Data{"retrievedMetrics": metrics})
if err != nil {
return fmt.Errorf("retriveMetric Failed: %w", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: typo fix

Comment on lines -46 to -50
func (p Processor) GetTimerMetrics(envelopes []*loggregator_v2.Envelope, appID string, currentTimestamp int64) []models.AppInstanceMetric {
p.logger.Debug("GetTimerMetrics")
p.logger.Debug("Compacted envelopes", lager.Data{"Envelopes": envelopes})
return GetHttpStartStopInstanceMetrics(envelopes, appID, currentTimestamp, p.collectionInterval)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need to process the timer metrics ourself since log-cache does it now for us after switching to a PromQL API call 👼

@@ -91,10 +83,10 @@ func GetHttpStartStopInstanceMetrics(envelopes []*loggregator_v2.Envelope, appID
var metrics []models.AppInstanceMetric

numRequestsPerAppIdx := calcNumReqs(envelopes)
sumReponseTimesPerAppIdx := calcSumResponseTimes(envelopes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: typo fix


throughputMetrics := getThroughputInstanceMetrics(envelopes, appID, numRequestsPerAppIdx, collectionInterval, currentTimestamp)
responseTimeMetric := getResponsetimeInstanceMetrics(envelopes, appID, numRequestsPerAppIdx, sumReponseTimesPerAppIdx, currentTimestamp)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated: typo fix

TLSConfig *tls.Config
uaaCreds models.UAACreds
url string
collectionInterval time.Duration
Copy link
Contributor Author

@geigerj0 geigerj0 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added collectionInterval to the client after removing it (https://github.com/cloudfoundry/app-autoscaler-release/pull/2890/files#r1579368032) since it's required for the PromQL API calls

@geigerj0 geigerj0 marked this pull request as ready for review April 25, 2024 13:22
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
46.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

Copy link
Member

@silvestre silvestre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you!

@geigerj0 geigerj0 merged commit b8a5d64 into main Apr 26, 2024
35 of 36 checks passed
@geigerj0 geigerj0 deleted the fix-throughput-responsetime branch April 26, 2024 13:23
silvestre added a commit that referenced this pull request May 13, 2024
…loggregator mode

# Issue
In #2890 the acceptance tests for the `throughput` and `responsetime` tests have been made more strict.

Since then the acceptance tests testing the legacy scenario of integrating with `loggregator` have been failing in the CI.

Note: In production environments the tests have not been failing.

# Fix
As using the `loggregator` is no longer recommended, we skip `throughput` and `responsetime` acceptance tests in the loggregator CI tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
allow-acceptance-tests This label needs to be added to enable the acceptance tests to run. bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants