Implement framework to validate backwards compatibility of metrics #6278

yurishkuro · 2024-11-29T17:53:44Z

Jaeger binaries produce various metrics that can be used to monitor Jaeger itself in production, such as throughput, queue saturation, error rates, etc. We historically treated those metrics as a stable public API (we even provide a Grafana dashboard mixin), but we never actually had proper integration tests that validate that changes to the code do not break the metrics.

Proposal

We can enhance our existing integration tests to also perform comparison of the metrics. The full set of all possible metrics is never available from a single run because if some components are not utilized (e.g. adaptive sampling or a particular storage type) then their metrics will not be registered. However, since our integration tests cover most of the components, we can scape the metrics endpoint at the end of each test and combine them into a full picture.

Caveat: the exact shape of the metrics depends on all the nested namespaces that could be applied to the metrics.Factory, so it is sensitive to the exact code in the main functions, which is where metrics.Factory always originates. Our integration tests for Jaeger v2 usually use the actual binary, so the resulting metrics will reflect how that binary performs in production. But all integration tests for v1 are run from a unit testing framework and the metrics.Factory initialization may not match how it's done in the mains. So we may only be able to solve this for Jaeger v2, which is fine.

Approach

At the end of each integration test (CIT) workflow we scrape the metrics collected by the binary and upload them as a github artifact
Then the final workflow can gather all those artifacts and compare them with similar reports from the latest release. If differences are found it can upload them as another artifact and link to the PR so that maintainers can inspect the changes and decide if they are acceptable.
The artifacts uploaded for the official release can also be referenced from the documentation website as a way of documenting the current collection of metrics.

Help Wanted

We seek community help to implement this functionality. This is not a 30min fix, but we still marked it as good-first-issue because it can be done incrementally.

Tasks

currently all CIT workflows are independent. In order to be able to have a final job once all CIT jobs are finished we may need to combine them all into a single CIT workflow with multiple jobs
the ability to scrape and compare metrics was implemented in [V2]Add Script for metrics markdown table #5941. We need to integrate the scraping into each CIT workflow (using a helper script) and upload the output as workflow artifacts
implement the validation job in a workflow that would compare artifacts from the current PR with those from the latest release and generate a diff (also uploaded as a separate artifact)
make the validation job post some form of summary as a comment to the PR (or as the output of the Check)
implement a way to incorporate the metrics report into documentation website

The text was updated successfully, but these errors were encountered:

Saumya40-codes · 2024-11-30T05:20:40Z

Tasks

currently all CIT workflows are independent. In order to be able to have a final job once all CIT jobs are finished we may need to combine them all into a single CIT workflow with multiple jobs

Instead of this, can we use upload artifact action and a single file of download artifact action

So, we have one main file to get artifacts from all workflow runs which might look something like this:

name: Get All Artifact

on:
  push:
    branches:
      - master
  workflow_dispatch:

jobs:
  workflow_1:
    uses: ./.github/workflows/workflow-1.yml

  workflow_2:
    uses: ./.github/workflows/workflow-2.yml

  deploy:
    runs-on: ubuntu-latest
    needs:
      - workflow_1
      - workflow_2
    steps:
      - name: 🔽 Checkout
        uses: actions/checkout@v3

      - name: ⬇️ Download artifact from workflow-1.yml
        uses: actions/download-artifact@v4
        with:
          name: 'metrics-1'
          path: 'deploy'

      - name: ⬇️ Download artifact from workflow-2.yml
        uses: actions/download-artifact@v4
        with:
          name: 'metrics-2'
          path: 'deploy'

      # we will be doing comparison here with the binary

      - name: '📄 Log the content of metrics-1.txt (from compile.yml)'
        run: |
          cat deploy/metrics-1.txt

      - name: '📄 Log the content of hello-2.txt (from compile-2.yml)'
        run: |
          cat deploy/metrics-2.txt

i was able to work through this

## Which problem is this PR solving? - Towards #6219 ## Description of the changes - This PR moves the decoration of the span readers with the metrics factory to inside the storage factory itself rather than handling it a higher level (ex. all-in-one, query server/extension). - For v2, the namespacing of the metrics has been moved from the query extension to the storage extension. - For v1, the namespacing of the metrics has been moved from the various binaries to the storage meta-factory. - This PR contains a few breaking changes as it changes the namespace under which the following metrics are published: - Storage specific metrics were that were being pushed under `jaeger_query` to will now be pushed undder `jaeger_storage` - Cassandra specific metrics were pushed under `jaeger_cassandra` and `jaeger_cassandra-archive` will now be pushed under `jaeger_storage` with the following tags: - `kind=cassandra` - `role=primary` or `role=archive` - `name` with the name of the storage (in jaeger-v2 only) - ElasticSearch/OpenSearch specific metrics were being pushed under `jaeger_index_create` will now be pushed under `jaeger_storage_index_create` with the following tags: - `kind=elasticsearch`/kind=`opensearch` - `role=primary` or `role=archive` - `name` with the name of the storage (in jaeger-v2 only) ## How was this change tested? - CI - We've booked #6278 to implement a framework that will allow for validation of metrics ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [x] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: Mahad Zaryab <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]>

yurishkuro · 2024-12-01T18:33:54Z

Instead of this, can we use upload artifact action and a single file of download artifact action

why is it "instead"? The point of the task was to merge all different workflows as different jobs in a single workflow. It's fine to keep them as separate included files (only if you keep them in the workflows dir they may still get executed separately, I don't think your uses: syntax works for workflows)

yurishkuro added good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Nov 29, 2024

dosubot bot added enhancement v2 labels Nov 29, 2024

mahadzaryab1 mentioned this issue Nov 30, 2024

[v2][storage] Move span reader decorator to storage factories #6280

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement framework to validate backwards compatibility of metrics #6278

Implement framework to validate backwards compatibility of metrics #6278

yurishkuro commented Nov 29, 2024 •

edited

Loading

Saumya40-codes commented Nov 30, 2024

Tasks

yurishkuro commented Dec 1, 2024

Implement framework to validate backwards compatibility of metrics #6278

Implement framework to validate backwards compatibility of metrics #6278

Comments

yurishkuro commented Nov 29, 2024 • edited Loading

Proposal

Approach

Help Wanted

Tasks

Saumya40-codes commented Nov 30, 2024

Tasks

yurishkuro commented Dec 1, 2024

yurishkuro commented Nov 29, 2024 •

edited

Loading