Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvester regression testing #91

Merged
merged 5 commits into from
Oct 4, 2024
Merged

Conversation

RomanIakovlev
Copy link
Collaborator

@RomanIakovlev RomanIakovlev commented Sep 18, 2024

This PR is a work in progress. It aims to fix the #90 by adding a way to automatically harvest and compare definitions between prod and dev environments, summarizing and highlighting inconsistencies for a reviewer.

The overall approach I'm taking is to add more integration tests and modify the existing ones in tools/integration. I aim to achieve two improvements over the existing tests:

  1. Currently only a fixed hardcoded list of coordinates is being tested for harvesting and comparison of definitions between prod and dev environments. I want to add variation into the existing suites by testing a dynamically obtained list of recently generated definitions from prod environment.
  2. When comparing definitions between prod and dev, the generated output is not very human-friendly, especially when multiple definitions need to be compared. I want to add a way to summarize and classify the differences between definitions, so that a reviewer can quickly assess if current dev environment can be safely promoted to production.

You find more details about my implementation plan below

Making list of tested coordinates dynamic

To add variety to the test data, I plan to add support for querying recently modified definitions from production. Since there's currently no way to query the recently modified data from the ClearlyDefined API, I've added an Azure Function (see tools/harvester-forwarding/src/functions/getRecentDefinitions.js in this PR) to query CosmosDB and get a number of definition for each type (npm, maven, etc) that were created or updated recently. The function is called over HTTP and takes 2 parameters, days for how many days back to look, and limit for how many records of each type to collect.

I plan to use the output of that function in addition to the fixed coordinates list. I think we should have both fixed and dynamic list of coordinates to test, hence I plan to parameterize the existing tests to be able to pick either static or dynamic list in a given test run, and then add support for running existing tests in a matrix fashion for running the same tests on different coordinates lists.

Helping reviewers with definitions' comparison

There are existing tests that compare definitions between prod and dev environments. When results are not identical, the existing tests produce a standard line-by-line diff. It is hard to pinpoint the nature of the change by looking at that diff (whether it is an improvement or regression), and if we want to do this for multiple definitions, we need a better way.

For that I've implemented a comparison logic that classifies the differences between definitions into 3 categories, regression, improvement and inconclusive. It then groups all the differences into these categories and presents the output in the following way:

See details
{
  "overallResult": "inconclusive",
  "differences": {
    "inconclusive": [
      {
        "field": "described.tools",
        "diff": {
          "staging": [
            "clearlydefined/1.2.1",
            "reuse/3.2.1",
            "licensee/9.18.1",
            "scancode/32.3.0"
          ],
          "production": [
            "clearlydefined/1.2.0",
            "licensee/9.14.0",
            "scancode/30.3.0"
          ]
        }
      },
      {
        "field": "licensed.facets.core.discovered.unknown",
        "diff": {
          "staging": 1221,
          "production": 1224
        }
      },
      {
        "field": "licensed.facets.core.discovered.expressions",
        "diff": {
          "staging": [
            "Apache-2.0",
            "BSD-3-Clause",
            "LicenseRef-scancode-dco-1.1"
          ],
          "production": [
            "Apache-2.0",
            "BSD-3-Clause",
            "NOASSERTION"
          ]
        }
      }
    ],
    "improvement": [
      {
        "field": "licensed.facets.core.attribution.parties",
        "diff": {
          "addedElements": [
            "copyright 2013-2017 docker, inc.",
            "copyright (c) 2004, 2006 the linux foundation and its contributors"
          ]
        }
      }
    ]
  }
}

If you open the details, you may see there are 2 categories of differences:

  • inconclusive for paths described.tools, licensed.facets.core.discovered.unknown and licensed.facets.core.discovered.expressions
  • improvement for path licensed.facets.core.attribution.parties, since new attributions were found in dev compared to production.

The paths "_meta", "licensed.score", "licensed.toolScore", "described.score", "described.toolScore" were ignored, since changes there are to be expected.

I believe this output format is easier to read and understand. I plan to add a new test into the existing tools/integration/test/integration/e2e-test-service/definitionTest.js suite to produce such output for the definitions under test.

Please let me know what you think about these changes. If I get a green light I'll proceed with converting this PR into its final version.

CC @qtomlinson @elrayle

Copy link
Contributor

@ljones140 ljones140 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great

I would like to see some documentation of these tests when you come to productionise

};
}

function compareValues(val1, val2, ignoredKeys, path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great.
I would get frustrated having to deal with JS type, null and undefined comparisons 😄


if (results.includes('improvement')) {
return 'improvement';
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the way you've thought about this.
Having these specific categories really helps the person who sees the results

},
});

async function getData(context, days, limitPerType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably very unlikely it would happen.
If the Mongo DB schema was to change, I assume this test would fail and then whoever looked it would have to update.

But I doubt it will change this far into the project

@@ -14,19 +14,16 @@ describe('Validation definitions between dev and prod', function () {
//Rest a bit to avoid overloading the servers
afterEach(() => new Promise(resolve => setTimeout(resolve, definition.timeout / 2)))

describe('Validation between dev and prod', function () {
before(() => {
loadFixtures().forEach(([url, definition]) =>
Copy link
Collaborator

@qtomlinson qtomlinson Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate the reason why loadFixture is removed? Some definitions are different between Dev and Prod for the following reasons:

  • Fixes implemented in Dev, but component is not reharvested on Prod. Eventually when the component is eventually correct on Prod, the fixture can be removed.
  • Feature implemented in Dev but not yet deployed to Prod, e.g. LicenseRef.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the fixtures because their purpose was not clear to me. It's not obvious to me what kind of assurance we expect to get from a test that compares the real (dev) definition to a modified (mocked prod) definition.

If we expect the definitions to be different, due to a bug fix or a new feature, let's assert that in a test, for each definition separately. Mocking and expecting equality, as is done now, seems to work in the opposite direction, glossing over the expected difference.

I admit I might be misunderstanding the purpose of this whole test, so maybe I should leave it as is and create a new one that will do the comparison using the new comparison function? However my initial approach was to modify this test, and I've removed the mocks to make sure I compare the real data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of integration tests is to detect any breaking changes. For example, after the recent ScanCode upgrade was completed, the integration tests showed differences: some were improvements, and some were regressions (See clearlydefined/service#1056 (comment)). For the improvements that are not in the Production deployment, vetted improved definition files can be put into fixtures, allowing for a successful run before the Production deployment is updated. Ideally, after the production deployment is updated, the components whose definitions can be fixed by the new deployment can be re-harvested and the fixture can be removed, as the definitions in dev and prod should be in sync again. @elrayle, Feel free to add if anything is missed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information @qtomlinson. I've decided to keep this test unchanged and add my structured diff comparison as a separate step.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is where you already landed. I'll add my thoughts for clarity. I like having both, fixture based tests and dynamic tests. Fixed catches regressions where we expect the same results every time. Dynamic provides a broader sweep with a goal of increasing confidence or identify systemic problems with a proposed release.

return JSON.parse(data);
} catch (err) {
// If the file doesn't exist, fetch the data and save it to disk
const response = await fetch('https://cosmos-query-function-app.azurewebsites.net/api/getrecentdefinitions?days=1&limit=1');
Copy link
Collaborator

@qtomlinson qtomlinson Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another source of recently harvested coordinates exists at the status endpoint: https://dev-api.clearlydefined.io/status/recentlycrawled. The response format is:

[
	{
		"coordinates": "go/golang/github.com%2Fazure/azure-sdk-for-go/v43.3.0+incompatible",
		"timestamp": "2024-09-25T22:36:22.017Z"
	},
	{
		"coordinates": "go/golang/github.com%2Fazure%2Fgo-autorest/autorest/v0.11.24",
		"timestamp": "2024-09-25T22:10:52.752Z"
	}
]

The internal logic is at service/statusService. It utilize application insight, so might be cheaper than cosmo query?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, I wasn't aware of this endpoint. However it's not giving us enough data as you pointed out, so we'll have to rely on some other mechanism.

I think querying the CosmosDB should not increase the cost much if at all, because it's only done rarely and touches a limited amount of data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good. Another idea just occurred to me: the search query in the getRecentDefinition is based on _meta.updated, which is also the change publication based off. The changed coordinates published hourly can potentially be used to provide the recent coordinates by day and by type (through sorting). Just thought to mention it as an idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I also thought of using the changes notifications mechanism for getting the recent definitions. For the relatively simple use case, as presented here, the data present in the changes notifications would be sufficient.

However I have plans to add some more data for these tests going forward. For example, one of the things we'd be interested in is to see whether we're getting rid of OTHER and NOASSERTION license entries when migrating to the ScanCode's LicenseRefs. For that we'd have to make a more elaborate query, and data from changes notifications mechanism will not be sufficient anymore. We'd have to query the database, using the same approach as presented in this PR (an Azure HTTP Function with CosmosDB access).

Do you have some concerns about the proposed database querying mechanism, @qtomlinson?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes DB query is more extensible. Thanks for the explanation and clarification!

@qtomlinson
Copy link
Collaborator

qtomlinson commented Sep 25, 2024

@RomanIakovlev The general approach looks good! It's great to see improvements and enhancements to the definition comparison!
Two things to consider:

@RomanIakovlev RomanIakovlev changed the title Initial version of harvester regression testing Harvester regression testing Sep 27, 2024
@RomanIakovlev RomanIakovlev marked this pull request as ready for review September 27, 2024 14:30
@RomanIakovlev RomanIakovlev force-pushed the roman/regression_testing branch from d3b5416 to 4361fd6 Compare September 27, 2024 14:33
@RomanIakovlev
Copy link
Collaborator Author

@qtomlinson Thanks for your feedback Qing. Regarding the point of including scores into the structured diff, I'd prefer to keep the ignored keys as is for now, mainly for the sake of readability of the output.

I agree looking into the scores might be necessary sometimes, but I don't think we need that all the time, only in those special cases when there are other, more significant changes (e.g. copyright detection difference). For those times, we can manually run the diff with another set of keys. It might be worth having the list of ignored keys taken as a workflow parameter for those times, but I'd prefer to add this as a separate change in the future.

@RomanIakovlev RomanIakovlev force-pushed the roman/regression_testing branch from 4361fd6 to 601b3a8 Compare September 27, 2024 14:52
@RomanIakovlev RomanIakovlev force-pushed the roman/regression_testing branch from 601b3a8 to f93ec6a Compare September 27, 2024 15:07
Copy link
Collaborator

@elrayle elrayle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the substantial contribution to testing and raising confidence in release candidates. It appears to me that the one question that required resolution has been met.

@@ -14,19 +14,16 @@ describe('Validation definitions between dev and prod', function () {
//Rest a bit to avoid overloading the servers
afterEach(() => new Promise(resolve => setTimeout(resolve, definition.timeout / 2)))

describe('Validation between dev and prod', function () {
before(() => {
loadFixtures().forEach(([url, definition]) =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is where you already landed. I'll add my thoughts for clarity. I like having both, fixture based tests and dynamic tests. Fixed catches regressions where we expect the same results every time. Dynamic provides a broader sweep with a goal of increasing confidence or identify systemic problems with a proposed release.

Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work to add more integration tests. Minor edits can be put as a separate PR.

const status = await harvestTillCompletion(components)
const recentDefinitions = await getComponents()
console.info(`Recent definitions: ${recentDefinitions}`)
const status = await harvestTillCompletion(recentDefinitions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming? recentDefinitions, these can be static component coordinates.

}
}

function compareValues(val1, val2, ignoredKeys, path) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, naming: val2 seems to be the expected/baseline value, based on which regression or improvement is classified.

}

function isEmpty(value) {
if (value === null || value === undefined) return true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty string?

return false
}
}
return true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: similar to return isSuperset(setB, setA)?

Copy link
Collaborator

@elrayle elrayle Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting merging the two methods since they are basically the same? I would definitely see that as a follow-on PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely can be a separate PR. The logic seems to be similar to

function isSubset(setA, setB) {
  return isSuperset(setB, setA)
}

return handleLargeArrays(val1, val2, path, 'inconclusive')
}

function handleLargeArrays(val1, val2, path, result) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming? other types in addition to arrays are handled here

if (Array.isArray(val1) && Array.isArray(val2)) {
const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: toLowerCase after stringify?

@@ -0,0 +1,224 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is this file for documentation purpose?

Copy link
Collaborator

@elrayle elrayle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some suggested changes for most of @qtomlinson questions. @RomanIakovlev what do you think about the suggested changes?

}
}

function compareValues(val1, val2, ignoredKeys, path) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps...

Suggested change
function compareValues(val1, val2, ignoredKeys, path) {
function compareValues(devActual, expected, ignoredKeys, path) {

}

function isEmpty(value) {
if (value === null || value === undefined) return true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (value === null || value === undefined) return true
if (value === null || value === undefined || value === '') return true

Comment on lines +117 to +119
if (Array.isArray(val1) && Array.isArray(val2)) {
const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Array.isArray(val1) && Array.isArray(val2)) {
const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
if (Array.isArray(val1) && Array.isArray(val2)) {
const set1 = new Set(val1.map(item => JSON.stringify(item).toLowerCase())
const set2 = new Set(val2.map(item => JSON.stringify(item).toLowerCase())

Comment on lines +13 to +15
const recentDefinitions = await getComponents()
console.info(`Recent definitions: ${recentDefinitions}`)
const status = await harvestTillCompletion(recentDefinitions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const recentDefinitions = await getComponents()
console.info(`Recent definitions: ${recentDefinitions}`)
const status = await harvestTillCompletion(recentDefinitions)
const targetDefinitions = await getComponents()
console.info(`Recent definitions: ${targetDefinitions}`)
const status = await harvestTillCompletion(targetDefinitions)

Didn't check to see if the constant is used beyond these lines.

const set1 = new Set(val1.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))
const set2 = new Set(val2.map(item => (typeof item === 'string' ? item.toLowerCase() : JSON.stringify(item))))

if (isSuperset(set1, set2)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When comparing arrays in certain cases, superset might not be improvement. For example, the file section in the definitions, the expected outcome is that the exact file paths listed in the production are detected in dev deployment.

@RomanIakovlev
Copy link
Collaborator Author

Thanks @elrayle and @qtomlinson for the reviews and change suggestions. As those required changes are not blocking, I'd rather merge this PR as it is now and address the changes in a separate one.

@RomanIakovlev RomanIakovlev merged commit bfb99d7 into main Oct 4, 2024
2 checks passed
@RomanIakovlev RomanIakovlev deleted the roman/regression_testing branch October 4, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants