fix(eventgenerator): use PromQL API for metric types `throughput` and `responsetime` #2890

geigerj0 · 2024-04-24T11:54:27Z

Problem

After the log-cache enablement with 3a, we've received a bug stating that responsetime and throughput are no longer working. It turned out, that the envelopes are being read via REST API which has a default of 100 maximum envelopes per response:

https://github.com/cloudfoundry/log-cache-release/tree/main/src#get-apiv1readsource-id
app-autoscaler-release/src/autoscaler/eventgenerator/client/log_cache_client.go

Line 98 in d02b7b8

envelopes, err := c.Client.Read(context.Background(), appId, startTime, filters...)

Using 100 envelopes to calculate the average throughput and responsetime for the last x seconds breaks the calculations. The following example shows that the limited set of envelopes causes throughput (a.k.a. requests per seconds) to not grow beyond 3:

100 envelopes
polling interval 40 seconds
100 / 40 ~= 3 requests per seconds (which is obviously wrong because only 100 envelopes were taken into account and not all)

The reason why this wasn't discovered earlier is because the acceptance test for throughput only checks if the threshold grows above 1:

app-autoscaler-release/src/acceptance/app/dynamic_policy_test.go

Line 180 in d02b7b8

policy = GenerateDynamicScaleOutPolicy(1, 2, "throughput", 1)

Solution

Make use of the PromQL API of log-cache for throughput and responsetime: https://github.com/cloudfoundry/log-cache-release/tree/main/src#get-apiv1query.

Using the PromQL API shifts the whole metric aggregation logic to log-cache and we can simply consume the results.

src/acceptance/helpers/helpers.go

…nvelopeprocessor 👀

geigerj0 · 2024-04-25T12:15:48Z

src/autoscaler/envelopeprocessor/envelope_processor.go

 }

 var _ EnvelopeProcessor = &Processor{}

 type Processor struct {
-	logger             lager.Logger
-	collectionInterval time.Duration


removed collectionInterval because it was completely unused in the Processor

geigerj0 · 2024-04-25T12:16:50Z

src/autoscaler/eventgenerator/client/metric_server_client.go

@@ -34,7 +34,7 @@ func NewMetricServerClient(logger lager.Logger, url string, httpClient *http.Cli
 	}
 }
 func (c *MetricServerClient) GetMetrics(appId string, metricType string, startTime time.Time, endTime time.Time) ([]models.AppInstanceMetric, error) {
-	c.logger.Debug("GetMetric")


unrelated: fixed debug message to be aligned with actual function name

geigerj0 · 2024-04-25T12:17:37Z

src/autoscaler/eventgenerator/aggregator/metric_poller.go

@@ -65,7 +65,7 @@ func (m *MetricPoller) retrieveMetric(appMonitor *models.AppMonitor) error {
 	metrics, err := m.metricClient.GetMetrics(appId, metricType, startTime, endTime)
 	m.logger.Debug("received metrics from metricClient", lager.Data{"retrievedMetrics": metrics})
 	if err != nil {
-		return fmt.Errorf("retriveMetric Failed: %w", err)


unrelated: typo fix

geigerj0 · 2024-04-25T12:18:51Z

src/autoscaler/envelopeprocessor/envelope_processor.go

-func (p Processor) GetTimerMetrics(envelopes []*loggregator_v2.Envelope, appID string, currentTimestamp int64) []models.AppInstanceMetric {
-	p.logger.Debug("GetTimerMetrics")
-	p.logger.Debug("Compacted envelopes", lager.Data{"Envelopes": envelopes})
-	return GetHttpStartStopInstanceMetrics(envelopes, appID, currentTimestamp, p.collectionInterval)
-}


We no longer need to process the timer metrics ourself since log-cache does it now for us after switching to a PromQL API call 👼

geigerj0 · 2024-04-25T12:18:59Z

src/autoscaler/envelopeprocessor/envelope_processor.go

@@ -91,10 +83,10 @@ func GetHttpStartStopInstanceMetrics(envelopes []*loggregator_v2.Envelope, appID
 	var metrics []models.AppInstanceMetric

 	numRequestsPerAppIdx := calcNumReqs(envelopes)
-	sumReponseTimesPerAppIdx := calcSumResponseTimes(envelopes)


unrelated: typo fix

geigerj0 · 2024-04-25T12:19:03Z

src/autoscaler/envelopeprocessor/envelope_processor.go


 	throughputMetrics := getThroughputInstanceMetrics(envelopes, appID, numRequestsPerAppIdx, collectionInterval, currentTimestamp)
-	responseTimeMetric := getResponsetimeInstanceMetrics(envelopes, appID, numRequestsPerAppIdx, sumReponseTimesPerAppIdx, currentTimestamp)


unrelated: typo fix

geigerj0 · 2024-04-25T12:20:21Z

src/autoscaler/eventgenerator/client/log_cache_client.go

+	TLSConfig          *tls.Config
+	uaaCreds           models.UAACreds
+	url                string
+	collectionInterval time.Duration


Added collectionInterval to the client after removing it (https://github.com/cloudfoundry/app-autoscaler-release/pull/2890/files#r1579368032) since it's required for the PromQL API calls

src/autoscaler/eventgenerator/client/log_cache_client.go

sonarqubecloud · 2024-04-25T13:44:50Z

Quality Gate failed

Failed conditions
46.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

src/acceptance/app/dynamic_policy_test.go

silvestre

Very nice, thank you!

…loggregator mode # Issue In #2890 the acceptance tests for the `throughput` and `responsetime` tests have been made more strict. Since then the acceptance tests testing the legacy scenario of integrating with `loggregator` have been failing in the CI. Note: In production environments the tests have not been failing. # Fix As using the `loggregator` is no longer recommended, we skip `throughput` and `responsetime` acceptance tests in the loggregator CI tests.

geigerj0 added 2 commits April 24, 2024 13:46

query via promQL for throughput and responsetime

e856fe7

remove focus

af136f4

geigerj0 added bug allow-acceptance-tests This label needs to be added to enable the acceptance tests to run. labels Apr 24, 2024

geigerj0 added 7 commits April 24, 2024 15:39

extract different retrieval paths into own functions

cbd7350

fix tests and improve logging

788e293

reduce amount of requests a bit

2efe3c5

improve logging

8c57618

make tests a bit stricter

1532d9a

try scaling policy that only downscales if value is in a certain range

eaab914

improve test names

13a0cb8

github-actions bot reviewed Apr 25, 2024

View reviewed changes

src/acceptance/helpers/helpers.go Show resolved Hide resolved

geigerj0 added 5 commits April 25, 2024 11:24

fix linter finding

a5d1451

fix stricter scale-in tests

923d8ac

remove unused params in func-signature

ffc7d90

move collectioninterval to logcacheclient, was completely unused in e…

e7ceb74

…nvelopeprocessor 👀

improve sample

c6b6138

geigerj0 commented Apr 25, 2024

View reviewed changes

src/autoscaler/eventgenerator/client/log_cache_client.go Show resolved Hide resolved

improve sample

3fb278a

geigerj0 marked this pull request as ready for review April 25, 2024 13:22

geigerj0 commented Apr 25, 2024

View reviewed changes

src/autoscaler/eventgenerator/client/log_cache_client.go Show resolved Hide resolved

CI pipeline is super flaky, run it from scratch one more time 😭

1b3be7f

geigerj0 commented Apr 26, 2024

View reviewed changes

src/acceptance/app/dynamic_policy_test.go Show resolved Hide resolved

silvestre approved these changes Apr 26, 2024

View reviewed changes

geigerj0 merged commit b8a5d64 into main Apr 26, 2024
35 of 36 checks passed

geigerj0 deleted the fix-throughput-responsetime branch April 26, 2024 13:23

silvestre mentioned this pull request May 13, 2024

fix(log_cache_client): Only process HTTPStartStopEvents with peerType client #2928

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eventgenerator): use PromQL API for metric types `throughput` and `responsetime` #2890

fix(eventgenerator): use PromQL API for metric types `throughput` and `responsetime` #2890

geigerj0 commented Apr 24, 2024 •

edited

Loading

geigerj0 Apr 25, 2024 •

edited

Loading

geigerj0 Apr 25, 2024

geigerj0 Apr 25, 2024

geigerj0 Apr 25, 2024

geigerj0 Apr 25, 2024

geigerj0 Apr 25, 2024

geigerj0 Apr 25, 2024 •

edited

Loading

sonarqubecloud bot commented Apr 25, 2024

silvestre left a comment


		throughputMetrics := getThroughputInstanceMetrics(envelopes, appID, numRequestsPerAppIdx, collectionInterval, currentTimestamp)
		responseTimeMetric := getResponsetimeInstanceMetrics(envelopes, appID, numRequestsPerAppIdx, sumReponseTimesPerAppIdx, currentTimestamp)

fix(eventgenerator): use PromQL API for metric types throughput and responsetime #2890

fix(eventgenerator): use PromQL API for metric types throughput and responsetime #2890

Conversation

geigerj0 commented Apr 24, 2024 • edited Loading

Problem

Solution

geigerj0 Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

geigerj0 Apr 25, 2024

Choose a reason for hiding this comment

geigerj0 Apr 25, 2024

Choose a reason for hiding this comment

geigerj0 Apr 25, 2024

Choose a reason for hiding this comment

geigerj0 Apr 25, 2024

Choose a reason for hiding this comment

geigerj0 Apr 25, 2024

Choose a reason for hiding this comment

geigerj0 Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

sonarqubecloud bot commented Apr 25, 2024

Quality Gate failed

silvestre left a comment

Choose a reason for hiding this comment

fix(eventgenerator): use PromQL API for metric types `throughput` and `responsetime` #2890

fix(eventgenerator): use PromQL API for metric types `throughput` and `responsetime` #2890

geigerj0 commented Apr 24, 2024 •

edited

Loading

geigerj0 Apr 25, 2024 •

edited

Loading

geigerj0 Apr 25, 2024 •

edited

Loading