-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harvester regression testing #91
Changes from 1 commit
6107258
7695556
417dcb1
78f86e2
f93ec6a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,7 +1,7 @@ | ||||||||||||||
// (c) Copyright 2024, SAP SE and ClearlyDefined contributors. Licensed under the MIT license. | ||||||||||||||
// SPDX-License-Identifier: MIT | ||||||||||||||
|
||||||||||||||
const { components, devApiBaseUrl, harvest } = require('./testConfig') | ||||||||||||||
const { getComponents, devApiBaseUrl, harvest } = require('./testConfig') | ||||||||||||||
const Poller = require('../../lib/poller') | ||||||||||||||
const Harvester = require('../../lib/harvester') | ||||||||||||||
const { strictEqual } = require('assert') | ||||||||||||||
|
@@ -10,7 +10,9 @@ describe('Tests for harvesting different components', function () { | |||||||||||||
it('should verify all harvests are complete', async function () { | ||||||||||||||
this.timeout(harvest.timeout) | ||||||||||||||
console.time('Harvest Test') | ||||||||||||||
const status = await harvestTillCompletion(components) | ||||||||||||||
const recentDefinitions = await getComponents() | ||||||||||||||
console.info(`Recent definitions: ${recentDefinitions}`) | ||||||||||||||
const status = await harvestTillCompletion(recentDefinitions) | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: naming? recentDefinitions, these can be static component coordinates.
Comment on lines
+13
to
+15
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Didn't check to see if the constant is used beyond these lines. |
||||||||||||||
for (const [coordinates, isHarvested] of status) { | ||||||||||||||
strictEqual(isHarvested, true, `Harvest for ${coordinates} is not complete`) | ||||||||||||||
} | ||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,7 @@ | ||
// (c) Copyright 2024, SAP SE and ClearlyDefined contributors. Licensed under the MIT license. | ||
// SPDX-License-Identifier: MIT | ||
const fs = require('fs').promises; | ||
const path = require('path'); | ||
|
||
const devApiBaseUrl = 'https://dev-api.clearlydefined.io' | ||
const prodApiBaseUrl = 'https://api.clearlydefined.io' | ||
|
@@ -11,7 +13,7 @@ const pollingMaxTime = 1000 * 60 * 60 // 60 minutes | |
const harvestTools = ['licensee', 'reuse', 'scancode'] | ||
|
||
//Components to test | ||
const components = [ | ||
const componentsStatic = [ | ||
'pypi/pypi/-/platformdirs/4.2.0', //Keep this as the first element to test, it is relatively small | ||
'maven/mavencentral/org.apache.httpcomponents/httpcore/4.4.16', | ||
'maven/mavengoogle/android.arch.lifecycle/common/1.0.1', | ||
|
@@ -32,10 +34,47 @@ const components = [ | |
// 'sourcearchive/mavencentral/org.apache.httpcomponents/httpcore/4.1' // Dev and prod have different license and scores. See https://github.com/clearlydefined/crawler/issues/533 | ||
] | ||
|
||
function shouldUseDynamicComponents() { | ||
// check for environment variable DYNAMIC_COORDINATES, if it is set to true, use dynamic components | ||
return process.env.DYNAMIC_COORDINATES === 'true'; | ||
|
||
} | ||
|
||
async function getComponents() { | ||
if (shouldUseDynamicComponents()) { | ||
console.info("Using dynamic components"); | ||
return componentsDynamic(); | ||
} else { | ||
console.info("Using static components"); | ||
return Promise.resolve(componentsStatic); | ||
} | ||
} | ||
|
||
const componentsDynamic = async () => { | ||
|
||
const filePath = path.join(__dirname, 'recentDefinitions.json'); | ||
|
||
try { | ||
// Check if the file exists | ||
await fs.access(filePath); | ||
// Read the file contents | ||
const data = await fs.readFile(filePath, 'utf8'); | ||
console.info("Read dynamic components from disk") | ||
return JSON.parse(data); | ||
} catch (err) { | ||
// If the file doesn't exist, fetch the data and save it to disk | ||
const response = await fetch('https://cosmos-query-function-app.azurewebsites.net/api/getrecentdefinitions?days=1&limit=1'); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another source of recently harvested coordinates exists at the status endpoint: https://dev-api.clearlydefined.io/status/recentlycrawled. The response format is:
The internal logic is at service/statusService. It utilize application insight, so might be cheaper than cosmo query? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the info, I wasn't aware of this endpoint. However it's not giving us enough data as you pointed out, so we'll have to rely on some other mechanism. I think querying the CosmosDB should not increase the cost much if at all, because it's only done rarely and touches a limited amount of data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The approach looks good. Another idea just occurred to me: the search query in the getRecentDefinition is based on _meta.updated, which is also the change publication based off. The changed coordinates published hourly can potentially be used to provide the recent coordinates by day and by type (through sorting). Just thought to mention it as an idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, I also thought of using the changes notifications mechanism for getting the recent definitions. For the relatively simple use case, as presented here, the data present in the changes notifications would be sufficient. However I have plans to add some more data for these tests going forward. For example, one of the things we'd be interested in is to see whether we're getting rid of Do you have some concerns about the proposed database querying mechanism, @qtomlinson? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes DB query is more extensible. Thanks for the explanation and clarification! |
||
const data = await response.json(); | ||
await fs.writeFile(filePath, JSON.stringify(data, null, 2), 'utf8'); | ||
console.info("Read dynamic components from remote") | ||
return data; | ||
} | ||
}; | ||
|
||
module.exports = { | ||
devApiBaseUrl, | ||
prodApiBaseUrl, | ||
components, | ||
getComponents, | ||
harvest: { | ||
poll: { interval: pollingInterval, maxTime: pollingMaxTime }, // for each component | ||
tools: harvestTools, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate the reason why loadFixture is removed? Some definitions are different between Dev and Prod for the following reasons:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the fixtures because their purpose was not clear to me. It's not obvious to me what kind of assurance we expect to get from a test that compares the real (dev) definition to a modified (mocked prod) definition.
If we expect the definitions to be different, due to a bug fix or a new feature, let's assert that in a test, for each definition separately. Mocking and expecting equality, as is done now, seems to work in the opposite direction, glossing over the expected difference.
I admit I might be misunderstanding the purpose of this whole test, so maybe I should leave it as is and create a new one that will do the comparison using the new comparison function? However my initial approach was to modify this test, and I've removed the mocks to make sure I compare the real data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of integration tests is to detect any breaking changes. For example, after the recent ScanCode upgrade was completed, the integration tests showed differences: some were improvements, and some were regressions (See clearlydefined/service#1056 (comment)). For the improvements that are not in the Production deployment, vetted improved definition files can be put into fixtures, allowing for a successful run before the Production deployment is updated. Ideally, after the production deployment is updated, the components whose definitions can be fixed by the new deployment can be re-harvested and the fixture can be removed, as the definitions in dev and prod should be in sync again. @elrayle, Feel free to add if anything is missed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information @qtomlinson. I've decided to keep this test unchanged and add my structured diff comparison as a separate step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like this is where you already landed. I'll add my thoughts for clarity. I like having both, fixture based tests and dynamic tests. Fixed catches regressions where we expect the same results every time. Dynamic provides a broader sweep with a goal of increasing confidence or identify systemic problems with a proposed release.