-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new summarizer for recent ScanCode versions #1056
Add new summarizer for recent ScanCode versions #1056
Conversation
db1c911
to
040e444
Compare
2392faf
to
dac4efb
Compare
it('summarizes using license_expression', () => { | ||
const coordinates = { type: 'debsrc', provider: 'debian' } | ||
const harvestData = getHarvestData('32.0.8', 'debsrc-license-expression') | ||
const result = summarizer.summarize(coordinates, harvestData) | ||
assert.equal(result.licensed.declared, 'Apache-2.0') | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test currently fails due to aboutcode-org/scancode-toolkit#3690
In the meantime, I'm wondering if we could build a usable workaround out of the new data at
service/test/fixtures/scancode/32.0.8/debsrc-license-expression.json
Lines 85 to 98 in dac4efb
"other_license_expressions": [ | |
{ | |
"value": "apache-2.0", | |
"count": 27 | |
}, | |
{ | |
"value": "ace-tao", | |
"count": 2 | |
}, | |
{ | |
"value": "unknown-license-reference", | |
"count": 1 | |
} | |
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to skip this specific test for the moment, I'd love suggestions for other test cases I could add that fall back to the packages[0].license_expression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More test cases can be found in commit message for the fix. In addition, a 2nd unit test was also introduced in the same commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime, I'm wondering if we could build a usable workaround out of the new data at
The other license expressions seems to align with discovered license, but may not align with declared.
Failing to detect license at declared license expression, and also at package[0]. The next route to try is to find the license file at the package root level or a specific license file folder. In v3 and v30, file.is_license_text is the flag to indicate a license file. This flag seems to be removed? Is there a new way to indicate license file in v32?
There are two files with "is_legal": true, but none of them at the package root level:
- "path": "tenacity-8.2.1/LICENSE"
- "path": "debian/copyright"
Is there a convention to organize debian packages and specification on a designated folder to put license file? If so, can potentially go to that specific folder to look for license file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file.is_license_text is the flag to indicate a license file. This flag seems to be removed?
Indeed:
The deprecated "--is-license-text" option has been removed. This is now built-in with the --license-text option and --info and exposed with the "percentage_of_license_text" attribute.
https://github.com/nexB/scancode-toolkit/blob/develop/CHANGELOG.rst#license-detection
Is there a new way to indicate license file in v32?
Yes, I've made updates to the function now called _getLicenseFromLicenseDetections
to use percentage_of_license_text >= 80
. The problem was that _getDetectedLicenseFromFiles
was not working for debsrc
components because the component name does not match the actual package folder name, python-tenacity
vs tenacity-8.2.1
. I've pushed a commit with a suggestion for how this could be supported: 49c3b47
Additionally, I was wondering if we don't want to maybe look at more than package[0]
and maybe iterate over all packages? The tenacity
package element in the packages
array has all the right info:
service/test/fixtures/scancode/32.0.8/debsrc-license-expression.json
Lines 297 to 341 in 8bc66a7
"declared_license_expression": "apache-2.0", | |
"declared_license_expression_spdx": "Apache-2.0", | |
"license_detections": [ | |
{ | |
"license_expression": "apache-2.0", | |
"matches": [ | |
{ | |
"score": 100, | |
"start_line": 1, | |
"end_line": 1, | |
"matched_length": 3, | |
"match_coverage": 100, | |
"matcher": "1-hash", | |
"license_expression": "apache-2.0", | |
"rule_identifier": "spdx_license_id_apache-2.0_for_apache-2.0.RULE", | |
"rule_relevance": 100, | |
"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_apache-2.0_for_apache-2.0.RULE", | |
"matched_text": "Apache 2.0" | |
} | |
], | |
"identifier": "apache_2_0-d66ab77d-a5cc-7104-e702-dc7df61fe9e8" | |
}, | |
{ | |
"license_expression": "apache-2.0", | |
"matches": [ | |
{ | |
"score": 95, | |
"start_line": 1, | |
"end_line": 1, | |
"matched_length": 6, | |
"match_coverage": 100, | |
"matcher": "1-hash", | |
"license_expression": "apache-2.0", | |
"rule_identifier": "pypi_apache_no-version.RULE", | |
"rule_relevance": 95, | |
"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/pypi_apache_no-version.RULE", | |
"matched_text": "- 'License :: OSI Approved :: Apache Software License'" | |
} | |
], | |
"identifier": "apache_2_0-e267f9d9-ae62-e9c9-9cc2-8cd0a1e4928f" | |
} | |
], | |
"other_license_expression": null, | |
"other_license_expression_spdx": null, | |
"other_license_detections": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for filing all the issues to ScanCode!
I've made updates to the function now called
_getLicenseFromLicenseDetections
to usepercentage_of_license_text >= 80
.
I believe that determining the limit (e.g. 80) should be a decision made by the community and properly documented, as mentioned in Jeff's comment in another PR. To provide some context, in our handling of scancode data prior to version 3.0.0, licenses with a score
of 80 or higher were considered file-level licenses, whereas licenses with a score
of 90 or higher were considered declared licenses for the package. For more information, please refer to the discussion thread. It appears that there is a distinction between identifying a license file and a declared license for the package.
Additionally, I was wondering if we don't want to maybe look at more than
package[0]
and maybe iterate over all packages? Thetenacity
package element in thepackages
array has all the right info:
The package[2].type is pypi and it points to tenacity 8.2.1, which has a "declared_license_expression" of "apache-2.0". The pypi package pypi/pypi/-/tenacity/8.2.1 is declared as Apache-2.0.
The question then arises: is it possible for a package (in this case, pypi/pypi/-/tenacity/8.2.1) to be repackaged and published (as debsrc/debian/-/python-tenacity/8.2.1-1) under a different license? @yashkohli88 @capfei, could you please provide some feedback on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a convention to organize debian packages and specification on a designated folder to put license file? If so, can potentially go to that specific folder to look for license file.
Looks like that debian/copyright is the required license info. See Debian New Maintainers' Guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like that debian/copyright is the required license info.
Unfortunately, debian/copyright
appears to contain a lot of non-license text as well, so its percentage_of_license_text
is just over 53% 🤔
For the rest of this logic, I should now be mirroring the logic from the existing ScanCode summarizer and the logic from the old --is-license-text
option as it was originally implemented: https://github.com/nexB/scancode-toolkit/blob/v30.1.0/src/licensedcode/plugin_license_text.py
4ac5a36
to
659a823
Compare
There is a test case available at this link. In this test case, v30 ScanCode identified two occurrences of 'NOASSERTION' ('unknown-license-reference') for the following files:
However, v3 Scancode did not report any license findings for these two files. It is worth noting that '.a' files are typically object files or static libraries used in Unix-like operating systems. This raises the question of whether the license detection in v30 ScanCode is a regression. Additionally, it would be interesting to know how v32 ScanCode performs in this regard. |
659a823
to
e880aa0
Compare
@qtomlinson I pushed the package you mentioned as an additional test case. Looks like e.g. |
74589a1
to
acb8cbb
Compare
d658029
to
bbc2a6c
Compare
Going to the CD link to the .a files it points us to the jna-5.6.0-sources.jar though when I download it I don't see the .a files, I only see them in the regular jna-5.6.0.jar file I ran a strings and a grep on the .a files and don't see anything that screams out a license reference to me (though see plenty of strings, and a bunch of GNU references but they seem API not license related. Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally. https://repo1.maven.org/maven2/net/java/dev/jna/jna/5.6.0/ I think this is potentially a false positive, though I'd like to see scancode output for sure In other .a files we might find true positives (the FFMpeg .a files for example might show this) |
Scancode output can be found here. |
30fa633
to
b67935c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed work and all the tests!
providers/summary/scancode-new.js
Outdated
|
||
const licenseExpression = | ||
file.detected_license_expression_spdx || this._getClosestLicenseMatchByFileName([file], coordinates, 80) | ||
setIfValue(result, 'license', licenseExpression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... licenseExpression for a file can be set even if the file is not a license file for the package (e.g. discovered license). isLicenseFile
in _getClosestLicenseMatchByFileName
may not be necessary here on the file level. so might be better to refactor the isLicenseFile
check outside _getClosestLicenseMatchByFileName
. isLicenseFile
check is indeed necessary when setting file nature on line 159.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @qtomlinson!
The old summarizer has this logic:
const fileLicense = asserted || file.licenses || []
let licenses = new Set(fileLicense.map(x => x.license).filter(x => x))
if (!licenses.size)
licenses = new Set(fileLicense.filter(x => x.score >= 80).map(x => this._createExpressionFromLicense(x)))
const licenseExpression = joinExpressions(licenses)
setIfValue(result, 'license', licenseExpression)
Using file.detected_license_expression_spdx
here to me is equivalent to using joinExpressions(file.licenses)
which is one option that can happen in the old code, would you agree? I'm open to changing this but just wanted to call out that it would be a deviation from the old algorithm.
Can you explain why you think isLicenseFile
might not be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @lumaxis for the explanation. Using file.detected_license_expression_spdx
looks good to me. The this._getClosestLicenseMatchByFileName([file], coordinates, 80)
branch includes isLicenseFile
check. This checks whether the file is a license file for the package, which is not present in the logic of old summarizer. Specifically, isLicenseFile
checks whether the file is named as a license file under root or special directories as per package management system's convention (e.g. META-INF for maven). For discovered licenses, it is preferred include all the license findings above certain confidence (80 threshold) within the content of the package. So this filter may not be necessary for file level discovered license. It is definitely necessary for the package level declared license. More information, see the difference on declared vs discovered license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qtomlinson Yea, good callout 🤔 I've split the previous function up into two, to be able to correctly replicate the old logic correctly. Do you perhaps know of a good test case for this? I'd like to add a unit test where the previous logic would have failed but haven't been able to find one, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lumaxis Thanks for making the change. I have tried half a dozen test cases where there are discovered license, but was not able to invoke _getLicenseExpressionFromFileLicenseDetections.
Thanks for the pointer to the scancode output. The bare 'freeware' text is coming from an include filepath from the source tree used to build the .a file
"/opt/freeware/src/packages/BUILD" is a special historical file path used on AIX to hold open source for building purposes In this case "freeware" is a misnomer, they really mean "open source" That said, the "gcc-build-4.6.3/./gcc/include/unwind.h" which is a dependency for this .a file is not seen or scanned by ScanCode and is out of scope for the license results but might affect the final licensing of the .a file! The unwind.h file is likely GPL v3 w/ GCC Runtime Library Exception, similar to one seen here: Final thoughts:
|
Leaving a note here with this run of the integrations test with both of my There's a couple of failures but as far as I can tell, all are expected 🙏🏼 |
Will clearlydefined/operations#76 impact the failing integration tests? |
The tests were run with clearlydefined/operations#76. Otherwise, tests would fail at harvest. This is also 'Add auto detect schema versions', which is currently for review, trying to address. |
Thanks for the log! I have looked through the logs and summarized as the following cases:
Need to confirm which one is correct.
If we have confirmed that the new declared and file licenses, and scoring are correct, we can update the comparison by uploading the fixtures. @Jeffrey-Luszcz , could you help us verify differences in file license (point 3) and declared license (point 4)? |
This is due to the difference in copyright detection in ScanCode v32 and ScanCode v30. @Jeffrey-Luszcz Is this a regression that needs to be reported to ScanCode? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
3 [git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641] The deny.toml file in this result might be something we should EXCLUDE from scans since it does not represent actual license content. deny.toml is a config file for a testing tool https://github.com/EmbarkStudios/cargo-deny and exists to create allow/deny lists for licenses used by components in the dependency list. git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641, License is different for the file below, In this case the deny.toml file contains the following license strings: The scan results are better now but also missing ISC and MIT seen in the deny.toml file section containing the other license names |
3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test
The clearlydefined/downloaded/LICENSE is the license obtained from https://licenses.nuget.org/Apache-2.0 (licenseUrl from the component manifest). The difference in licenses reported for this file lies in how CD interprets ScanCode raw results in new (actual) and legacy (expected) summarizer. In both 30.3.0.json and 32.3.0.json ScanCode results, there is a matching of "ECL 2.0" with scores of around 48% for the file:
|
3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test
I would consider the "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". result to be incorrect. This license text is the Apache 2.0 text (with possibly some nuget doc boilerplate like:
NuGet.Protocol/6.7.1 does NOT contain the ECL 2.0 as an option for the package, only AL 2.0. The three "licenses" that we are thinking about here are a pure Apache 2.0, the NuGet Apache 2.0 file with some additional info at the bottom and top and a "pure" ECL 2.0 license text (where B represents the original patent clause, B' the modified patent clause and D is the Nuget text talking about SPDX from the block above: The noise of adding "OR (AL 2.0 or EC 2.0)" is pretty bad and not user friendly esp for a license like the ECL 2.0 which is seen in only a handful of packages. The AL 2.0 is somewhere like the 3rd most popular license, I'd be worried if they all started getting reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". I wonder if scancode is seeing a license that it thinks might superset of the Apache 2.0 because of the "D" text in the NuGet license file and thus returning a bunch of possibilities it wouldn't if it had a more pure Apache 2.0 text. It still should realize that the ECL 2.0 is a special sub-set or modified version of the Apache 2.0 in my opinion.... It would be worth seeing if we start getting "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". noisy results for things we expect to be pure Apache 2.0 due to a change in ScanCode of if this is a special case due to the Nuget spdx Noise... |
@Jeffrey-Luszcz Could you kindly verify the validity of these license differences and confirm that the new version is the correct one? |
pypi/pypi/-/sdbus/0.12.0 SDBus explicitly says its LGPL 2.1 in its README. |
To unblock clearlydefined/crawler#502, the service needs to be ready to process files put out by newer ScanCode versions.
ScanCode major versions 31 and 32 introduced pretty drastic changes to its output format which required significant changes to our summarizing logic. To not add further special cases, that would have complicated the existing code even more, I instead opted to add a separate file that exclusively handles this new format.