Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvester regression testing #91

Merged
merged 5 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions .github/workflows/integration-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ permissions:
jobs:
test:
runs-on: ubuntu-latest
strategy:
max-parallel: 1
matrix:
dynamicCoordinates: [true, false]
defaults:
run:
working-directory: ./tools/integration
Expand All @@ -29,7 +33,7 @@ jobs:
run: npm test

- name: Trigger harvest and verify completion
run: npm run e2e-test-harvest
run: DYNAMIC_COORDINATES=${{ matrix.dynamicCoordinates }} npm run e2e-test-harvest

- name: Verify service functions
run: npm run e2e-test-service
run: DYNAMIC_COORDINATES=${{ matrix.dynamicCoordinates }} npm run e2e-test-service
Original file line number Diff line number Diff line change
@@ -1,37 +1,5 @@
const { app } = require("@azure/functions");

app.http("compareDefinitions", {
methods: ["POST"],
authLevel: "anonymous",
handler: async (request, context) => {
try {
const {
productionDoc,
stagingDoc,
ignoredKeys = [],
} = await request.json();

if (!productionDoc || !stagingDoc) {
return {
status: 400,
body: "Please provide both productionDoc and stagingDoc in the request body",
};
}

const result = compareDocuments(stagingDoc, productionDoc, ignoredKeys);

return {
status: 200,
jsonBody: result,
};
} catch (error) {
return {
status: 400,
body: "Error processing request: " + error.message,
};
}
}
});
// (c) Copyright 2024, GitHub and ClearlyDefined contributors. Licensed under the MIT license.
// SPDX-License-Identifier: MIT

function compareDocuments(staging, production, ignoredKeys, path = '') {
let differences = {};
Expand All @@ -49,7 +17,7 @@ function compareDocuments(staging, production, ignoredKeys, path = '') {
const currentPath = path ? `${path}.${key}` : key;

if (shouldIgnore(currentPath, ignoredKeys)) {
continue; // Skip comparison for keys that should be ignored
continue;
}

const stagingValue = staging[key];
Expand Down Expand Up @@ -247,4 +215,6 @@ function getType(value) {
if (value === null) return 'null';
if (Array.isArray(value)) return 'array';
return typeof value;
}
}

module.exports = { compareDocuments }
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,16 @@
// SPDX-License-Identifier: MIT

const { callFetch } = require('../../../lib/fetch')
const { devApiBaseUrl, prodApiBaseUrl, components, definition } = require('../testConfig')
const { devApiBaseUrl, prodApiBaseUrl, getComponents, definition } = require('../testConfig')
const { strictEqual } = require('assert')

describe('Validation attachments between dev and prod', function () {
describe('Validation attachments between dev and prod', async function () {
this.timeout(definition.timeout * 2)

//Rest a bit to avoid overloading the servers
afterEach(() => new Promise(resolve => setTimeout(resolve, definition.timeout / 2)))

const components = await getComponents()
components.forEach(coordinates => {
it(`should have the same attachement as prod for ${coordinates}`, () => fetchAndCompareAttachments(coordinates))
})
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
const { omit, isEqual, pick } = require('lodash')
const { deepStrictEqual, strictEqual } = require('assert')
const { callFetch, buildPostOpts } = require('../../../lib/fetch')
const { devApiBaseUrl, prodApiBaseUrl, components, definition } = require('../testConfig')
const { devApiBaseUrl, prodApiBaseUrl, getComponents, definition } = require('../testConfig')
const nock = require('nock')
const fs = require('fs')

Expand All @@ -14,19 +14,16 @@ describe('Validation definitions between dev and prod', function () {
//Rest a bit to avoid overloading the servers
afterEach(() => new Promise(resolve => setTimeout(resolve, definition.timeout / 2)))

describe('Validation between dev and prod', function () {
before(() => {
loadFixtures().forEach(([url, definition]) =>
Copy link
Collaborator

@qtomlinson qtomlinson Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate the reason why loadFixture is removed? Some definitions are different between Dev and Prod for the following reasons:

  • Fixes implemented in Dev, but component is not reharvested on Prod. Eventually when the component is eventually correct on Prod, the fixture can be removed.
  • Feature implemented in Dev but not yet deployed to Prod, e.g. LicenseRef.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the fixtures because their purpose was not clear to me. It's not obvious to me what kind of assurance we expect to get from a test that compares the real (dev) definition to a modified (mocked prod) definition.

If we expect the definitions to be different, due to a bug fix or a new feature, let's assert that in a test, for each definition separately. Mocking and expecting equality, as is done now, seems to work in the opposite direction, glossing over the expected difference.

I admit I might be misunderstanding the purpose of this whole test, so maybe I should leave it as is and create a new one that will do the comparison using the new comparison function? However my initial approach was to modify this test, and I've removed the mocks to make sure I compare the real data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of integration tests is to detect any breaking changes. For example, after the recent ScanCode upgrade was completed, the integration tests showed differences: some were improvements, and some were regressions (See clearlydefined/service#1056 (comment)). For the improvements that are not in the Production deployment, vetted improved definition files can be put into fixtures, allowing for a successful run before the Production deployment is updated. Ideally, after the production deployment is updated, the components whose definitions can be fixed by the new deployment can be re-harvested and the fixture can be removed, as the definitions in dev and prod should be in sync again. @elrayle, Feel free to add if anything is missed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information @qtomlinson. I've decided to keep this test unchanged and add my structured diff comparison as a separate step.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this is where you already landed. I'll add my thoughts for clarity. I like having both, fixture based tests and dynamic tests. Fixed catches regressions where we expect the same results every time. Dynamic provides a broader sweep with a goal of increasing confidence or identify systemic problems with a proposed release.

nock(prodApiBaseUrl, { allowUnmocked: true }).get(url).reply(200, definition)
)
})

describe('Validation between dev and prod', async function () {
const components = await getComponents()
console.info(`Testing definitions for ${JSON.stringify(components)}`)
components.forEach(coordinates => {
it(`should return the same definition as prod for ${coordinates}`, () => fetchAndCompareDefinition(coordinates))
})
})

describe('Validate on dev', function () {
describe('Validate on dev', async function () {
const components = await getComponents()
const coordinates = components[0]

describe('Search definitions', function () {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@

const { deepStrictEqual } = require('assert')
const { callFetch, buildPostOpts } = require('../../../lib/fetch')
const { devApiBaseUrl, prodApiBaseUrl, components, definition } = require('../testConfig')
const { devApiBaseUrl, prodApiBaseUrl, getComponents, definition } = require('../testConfig')
const nock = require('nock')
const fs = require('fs')

describe('Validate notice files between dev and prod', function () {
describe('Validate notice files between dev and prod', async function () {
this.timeout(definition.timeout)

//Rest a bit to avoid overloading the servers
Expand All @@ -20,7 +20,7 @@ describe('Validate notice files between dev and prod', function () {
.reply(200, notice)
})
})

const components = await getComponents()
components.forEach(coordinates => {
it(`should return the same notice as prod for ${coordinates}`, () => fetchAndCompareNotices(coordinates))
})
Expand Down
6 changes: 4 additions & 2 deletions tools/integration/test/integration/harvestTest.js
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// (c) Copyright 2024, SAP SE and ClearlyDefined contributors. Licensed under the MIT license.
// SPDX-License-Identifier: MIT

const { components, devApiBaseUrl, harvest } = require('./testConfig')
const { getComponents, devApiBaseUrl, harvest } = require('./testConfig')
const Poller = require('../../lib/poller')
const Harvester = require('../../lib/harvester')
const { strictEqual } = require('assert')
Expand All @@ -10,7 +10,9 @@ describe('Tests for harvesting different components', function () {
it('should verify all harvests are complete', async function () {
this.timeout(harvest.timeout)
console.time('Harvest Test')
const status = await harvestTillCompletion(components)
const recentDefinitions = await getComponents()
console.info(`Recent definitions: ${recentDefinitions}`)
const status = await harvestTillCompletion(recentDefinitions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: naming? recentDefinitions, these can be static component coordinates.

Comment on lines +13 to +15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const recentDefinitions = await getComponents()
console.info(`Recent definitions: ${recentDefinitions}`)
const status = await harvestTillCompletion(recentDefinitions)
const targetDefinitions = await getComponents()
console.info(`Recent definitions: ${targetDefinitions}`)
const status = await harvestTillCompletion(targetDefinitions)

Didn't check to see if the constant is used beyond these lines.

for (const [coordinates, isHarvested] of status) {
strictEqual(isHarvested, true, `Harvest for ${coordinates} is not complete`)
}
Expand Down
43 changes: 41 additions & 2 deletions tools/integration/test/integration/testConfig.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
// (c) Copyright 2024, SAP SE and ClearlyDefined contributors. Licensed under the MIT license.
// SPDX-License-Identifier: MIT
const fs = require('fs').promises;
const path = require('path');

const devApiBaseUrl = 'https://dev-api.clearlydefined.io'
const prodApiBaseUrl = 'https://api.clearlydefined.io'
Expand All @@ -11,7 +13,7 @@ const pollingMaxTime = 1000 * 60 * 60 // 60 minutes
const harvestTools = ['licensee', 'reuse', 'scancode']

//Components to test
const components = [
const componentsStatic = [
'pypi/pypi/-/platformdirs/4.2.0', //Keep this as the first element to test, it is relatively small
'maven/mavencentral/org.apache.httpcomponents/httpcore/4.4.16',
'maven/mavengoogle/android.arch.lifecycle/common/1.0.1',
Expand All @@ -32,10 +34,47 @@ const components = [
// 'sourcearchive/mavencentral/org.apache.httpcomponents/httpcore/4.1' // Dev and prod have different license and scores. See https://github.com/clearlydefined/crawler/issues/533
]

function shouldUseDynamicComponents() {
// check for environment variable DYNAMIC_COORDINATES, if it is set to true, use dynamic components
return process.env.DYNAMIC_COORDINATES === 'true';

}

async function getComponents() {
if (shouldUseDynamicComponents()) {
console.info("Using dynamic components");
return componentsDynamic();
} else {
console.info("Using static components");
return Promise.resolve(componentsStatic);
}
}

const componentsDynamic = async () => {

const filePath = path.join(__dirname, 'recentDefinitions.json');

try {
// Check if the file exists
await fs.access(filePath);
// Read the file contents
const data = await fs.readFile(filePath, 'utf8');
console.info("Read dynamic components from disk")
return JSON.parse(data);
} catch (err) {
// If the file doesn't exist, fetch the data and save it to disk
const response = await fetch('https://cosmos-query-function-app.azurewebsites.net/api/getrecentdefinitions?days=1&limit=1');
Copy link
Collaborator

@qtomlinson qtomlinson Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another source of recently harvested coordinates exists at the status endpoint: https://dev-api.clearlydefined.io/status/recentlycrawled. The response format is:

[
	{
		"coordinates": "go/golang/github.com%2Fazure/azure-sdk-for-go/v43.3.0+incompatible",
		"timestamp": "2024-09-25T22:36:22.017Z"
	},
	{
		"coordinates": "go/golang/github.com%2Fazure%2Fgo-autorest/autorest/v0.11.24",
		"timestamp": "2024-09-25T22:10:52.752Z"
	}
]

The internal logic is at service/statusService. It utilize application insight, so might be cheaper than cosmo query?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info, I wasn't aware of this endpoint. However it's not giving us enough data as you pointed out, so we'll have to rely on some other mechanism.

I think querying the CosmosDB should not increase the cost much if at all, because it's only done rarely and touches a limited amount of data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good. Another idea just occurred to me: the search query in the getRecentDefinition is based on _meta.updated, which is also the change publication based off. The changed coordinates published hourly can potentially be used to provide the recent coordinates by day and by type (through sorting). Just thought to mention it as an idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I also thought of using the changes notifications mechanism for getting the recent definitions. For the relatively simple use case, as presented here, the data present in the changes notifications would be sufficient.

However I have plans to add some more data for these tests going forward. For example, one of the things we'd be interested in is to see whether we're getting rid of OTHER and NOASSERTION license entries when migrating to the ScanCode's LicenseRefs. For that we'd have to make a more elaborate query, and data from changes notifications mechanism will not be sufficient anymore. We'd have to query the database, using the same approach as presented in this PR (an Azure HTTP Function with CosmosDB access).

Do you have some concerns about the proposed database querying mechanism, @qtomlinson?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes DB query is more extensible. Thanks for the explanation and clarification!

const data = await response.json();
await fs.writeFile(filePath, JSON.stringify(data, null, 2), 'utf8');
console.info("Read dynamic components from remote")
return data;
}
};

module.exports = {
devApiBaseUrl,
prodApiBaseUrl,
components,
getComponents,
harvest: {
poll: { interval: pollingInterval, maxTime: pollingMaxTime }, // for each component
tools: harvestTools,
Expand Down
Loading