Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new summarizer for recent ScanCode versions #1056

Merged
merged 37 commits into from
Jun 26, 2024

Conversation

lumaxis
Copy link
Contributor

@lumaxis lumaxis commented Feb 14, 2024

To unblock clearlydefined/crawler#502, the service needs to be ready to process files put out by newer ScanCode versions.

ScanCode major versions 31 and 32 introduced pretty drastic changes to its output format which required significant changes to our summarizing logic. To not add further special cases, that would have complicated the existing code even more, I instead opted to add a separate file that exclusively handles this new format.

@lumaxis lumaxis force-pushed the updates-new-scancode-version branch from db1c911 to 040e444 Compare February 16, 2024 15:53
providers/summary/scancode-new.js Outdated Show resolved Hide resolved
providers/summary/scancode.js Outdated Show resolved Hide resolved
test/providers/summary/scancode-new.js Outdated Show resolved Hide resolved
test/providers/summary/scancode-new.js Outdated Show resolved Hide resolved
Comment on lines 84 to 100
it('summarizes using license_expression', () => {
const coordinates = { type: 'debsrc', provider: 'debian' }
const harvestData = getHarvestData('32.0.8', 'debsrc-license-expression')
const result = summarizer.summarize(coordinates, harvestData)
assert.equal(result.licensed.declared, 'Apache-2.0')
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently fails due to aboutcode-org/scancode-toolkit#3690

In the meantime, I'm wondering if we could build a usable workaround out of the new data at

"other_license_expressions": [
{
"value": "apache-2.0",
"count": 27
},
{
"value": "ace-tao",
"count": 2
},
{
"value": "unknown-license-reference",
"count": 1
}
],
? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to skip this specific test for the moment, I'd love suggestions for other test cases I could add that fall back to the packages[0].license_expression

Copy link
Collaborator

@qtomlinson qtomlinson Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More test cases can be found in commit message for the fix. In addition, a 2nd unit test was also introduced in the same commit.

Copy link
Collaborator

@qtomlinson qtomlinson Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime, I'm wondering if we could build a usable workaround out of the new data at

The other license expressions seems to align with discovered license, but may not align with declared.
Failing to detect license at declared license expression, and also at package[0]. The next route to try is to find the license file at the package root level or a specific license file folder. In v3 and v30, file.is_license_text is the flag to indicate a license file. This flag seems to be removed? Is there a new way to indicate license file in v32?
There are two files with "is_legal": true, but none of them at the package root level:

  • "path": "tenacity-8.2.1/LICENSE"
  • "path": "debian/copyright"

Is there a convention to organize debian packages and specification on a designated folder to put license file? If so, can potentially go to that specific folder to look for license file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file.is_license_text is the flag to indicate a license file. This flag seems to be removed?

Indeed:

The deprecated "--is-license-text" option has been removed. This is now built-in with the --license-text option and --info and exposed with the "percentage_of_license_text" attribute.

https://github.com/nexB/scancode-toolkit/blob/develop/CHANGELOG.rst#license-detection

Is there a new way to indicate license file in v32?

Yes, I've made updates to the function now called _getLicenseFromLicenseDetections to use percentage_of_license_text >= 80. The problem was that _getDetectedLicenseFromFiles was not working for debsrc components because the component name does not match the actual package folder name, python-tenacity vs tenacity-8.2.1. I've pushed a commit with a suggestion for how this could be supported: 49c3b47

Additionally, I was wondering if we don't want to maybe look at more than package[0] and maybe iterate over all packages? The tenacity package element in the packages array has all the right info:

"declared_license_expression": "apache-2.0",
"declared_license_expression_spdx": "Apache-2.0",
"license_detections": [
{
"license_expression": "apache-2.0",
"matches": [
{
"score": 100,
"start_line": 1,
"end_line": 1,
"matched_length": 3,
"match_coverage": 100,
"matcher": "1-hash",
"license_expression": "apache-2.0",
"rule_identifier": "spdx_license_id_apache-2.0_for_apache-2.0.RULE",
"rule_relevance": 100,
"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_apache-2.0_for_apache-2.0.RULE",
"matched_text": "Apache 2.0"
}
],
"identifier": "apache_2_0-d66ab77d-a5cc-7104-e702-dc7df61fe9e8"
},
{
"license_expression": "apache-2.0",
"matches": [
{
"score": 95,
"start_line": 1,
"end_line": 1,
"matched_length": 6,
"match_coverage": 100,
"matcher": "1-hash",
"license_expression": "apache-2.0",
"rule_identifier": "pypi_apache_no-version.RULE",
"rule_relevance": 95,
"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/pypi_apache_no-version.RULE",
"matched_text": "- 'License :: OSI Approved :: Apache Software License'"
}
],
"identifier": "apache_2_0-e267f9d9-ae62-e9c9-9cc2-8cd0a1e4928f"
}
],
"other_license_expression": null,
"other_license_expression_spdx": null,
"other_license_detections": [],

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for filing all the issues to ScanCode!

I've made updates to the function now called _getLicenseFromLicenseDetections to use percentage_of_license_text >= 80.

I believe that determining the limit (e.g. 80) should be a decision made by the community and properly documented, as mentioned in Jeff's comment in another PR. To provide some context, in our handling of scancode data prior to version 3.0.0, licenses with a score of 80 or higher were considered file-level licenses, whereas licenses with a score of 90 or higher were considered declared licenses for the package. For more information, please refer to the discussion thread. It appears that there is a distinction between identifying a license file and a declared license for the package.

Additionally, I was wondering if we don't want to maybe look at more than package[0] and maybe iterate over all packages? The tenacity package element in the packages array has all the right info:

The package[2].type is pypi and it points to tenacity 8.2.1, which has a "declared_license_expression" of "apache-2.0". The pypi package pypi/pypi/-/tenacity/8.2.1 is declared as Apache-2.0.

The question then arises: is it possible for a package (in this case, pypi/pypi/-/tenacity/8.2.1) to be repackaged and published (as debsrc/debian/-/python-tenacity/8.2.1-1) under a different license? @yashkohli88 @capfei, could you please provide some feedback on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a convention to organize debian packages and specification on a designated folder to put license file? If so, can potentially go to that specific folder to look for license file.

Looks like that debian/copyright is the required license info. See Debian New Maintainers' Guide

Copy link
Contributor Author

@lumaxis lumaxis Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that debian/copyright is the required license info.

Unfortunately, debian/copyright appears to contain a lot of non-license text as well, so its percentage_of_license_text is just over 53% 🤔

For the rest of this logic, I should now be mirroring the logic from the existing ScanCode summarizer and the logic from the old --is-license-text option as it was originally implemented: https://github.com/nexB/scancode-toolkit/blob/v30.1.0/src/licensedcode/plugin_license_text.py

@lumaxis lumaxis force-pushed the updates-new-scancode-version branch from 4ac5a36 to 659a823 Compare April 17, 2024 14:51
@qtomlinson
Copy link
Collaborator

There is a test case available at this link. In this test case, v30 ScanCode identified two occurrences of 'NOASSERTION' ('unknown-license-reference') for the following files:

  • 'com/sun/jna/aix-ppc/libjnidispatch.a'
  • 'com/sun/jna/aix-ppc64/libjnidispatch.a'

However, v3 Scancode did not report any license findings for these two files. It is worth noting that '.a' files are typically object files or static libraries used in Unix-like operating systems. This raises the question of whether the license detection in v30 ScanCode is a regression. Additionally, it would be interesting to know how v32 ScanCode performs in this regard.

@lumaxis lumaxis force-pushed the updates-new-scancode-version branch from 659a823 to e880aa0 Compare April 29, 2024 16:17
@lumaxis
Copy link
Contributor Author

lumaxis commented Apr 29, 2024

@qtomlinson I pushed the package you mentioned as an additional test case. Looks like e.g. com/sun/jna/aix-ppc/libjnidispatch.a still gets reported with unknown-license-reference. However, the overall detected has slightly changed: https://github.com/clearlydefined/service/pull/1056/files#diff-47fcf272ac39f44f87c22747ea0503417c484186c4e0296f84f3f8e55ab8c1f7R98

@lumaxis lumaxis force-pushed the updates-new-scancode-version branch 2 times, most recently from 74589a1 to acb8cbb Compare April 30, 2024 09:29
@lumaxis lumaxis marked this pull request as ready for review April 30, 2024 09:36
@lumaxis lumaxis requested review from qtomlinson and elrayle April 30, 2024 09:36
@lumaxis lumaxis force-pushed the updates-new-scancode-version branch 2 times, most recently from d658029 to bbc2a6c Compare May 14, 2024 09:49
@Jeffrey-Luszcz
Copy link

Going to the CD link to the .a files it points us to the jna-5.6.0-sources.jar though when I download it I don't see the .a files, I only see them in the regular jna-5.6.0.jar file

I ran a strings and a grep on the .a files and don't see anything that screams out a license reference to me (though see plenty of strings, and a bunch of GNU references but they seem API not license related.

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

https://repo1.maven.org/maven2/net/java/dev/jna/jna/5.6.0/

I think this is potentially a false positive, though I'd like to see scancode output for sure

In other .a files we might find true positives (the FFMpeg .a files for example might show this)

@qtomlinson
Copy link
Collaborator

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

Scancode output can be found here. "matched_text": "freeware/" for com/sun/jna/aix-ppc/libjnidispatch.a

@lumaxis lumaxis force-pushed the updates-new-scancode-version branch from 30fa633 to b67935c Compare May 30, 2024 14:48
Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed work and all the tests!

lib/utils.js Show resolved Hide resolved
lib/utils.js Outdated Show resolved Hide resolved
providers/summary/index.js Outdated Show resolved Hide resolved
providers/summary/scancode.js Show resolved Hide resolved

const licenseExpression =
file.detected_license_expression_spdx || this._getClosestLicenseMatchByFileName([file], coordinates, 80)
setIfValue(result, 'license', licenseExpression)
Copy link
Collaborator

@qtomlinson qtomlinson May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... licenseExpression for a file can be set even if the file is not a license file for the package (e.g. discovered license). isLicenseFile in _getClosestLicenseMatchByFileName may not be necessary here on the file level. so might be better to refactor the isLicenseFile check outside _getClosestLicenseMatchByFileName. isLicenseFile check is indeed necessary when setting file nature on line 159.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @qtomlinson!

The old summarizer has this logic:

const fileLicense = asserted || file.licenses || []
let licenses = new Set(fileLicense.map(x => x.license).filter(x => x))
if (!licenses.size)
  licenses = new Set(fileLicense.filter(x => x.score >= 80).map(x => this._createExpressionFromLicense(x)))
const licenseExpression = joinExpressions(licenses)
setIfValue(result, 'license', licenseExpression)

Using file.detected_license_expression_spdx here to me is equivalent to using joinExpressions(file.licenses) which is one option that can happen in the old code, would you agree? I'm open to changing this but just wanted to call out that it would be a deviation from the old algorithm.

Can you explain why you think isLicenseFile might not be necessary?

Copy link
Collaborator

@qtomlinson qtomlinson Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lumaxis for the explanation. Using file.detected_license_expression_spdx looks good to me. The this._getClosestLicenseMatchByFileName([file], coordinates, 80) branch includes isLicenseFile check. This checks whether the file is a license file for the package, which is not present in the logic of old summarizer. Specifically, isLicenseFile checks whether the file is named as a license file under root or special directories as per package management system's convention (e.g. META-INF for maven). For discovered licenses, it is preferred include all the license findings above certain confidence (80 threshold) within the content of the package. So this filter may not be necessary for file level discovered license. It is definitely necessary for the package level declared license. More information, see the difference on declared vs discovered license

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qtomlinson Yea, good callout 🤔 I've split the previous function up into two, to be able to correctly replicate the old logic correctly. Do you perhaps know of a good test case for this? I'd like to add a unit test where the previous logic would have failed but haven't been able to find one, yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lumaxis Thanks for making the change. I have tried half a dozen test cases where there are discovered license, but was not able to invoke _getLicenseExpressionFromFileLicenseDetections.

@Jeffrey-Luszcz
Copy link

Thanks for the pointer to the scancode output. The bare 'freeware' text is coming from an include filepath from the source tree used to build the .a file

/home/0/freeware/bin/../lib/gcc/powerpcibmaix7.1.0.0/4.6.3:/home/0/freeware/bin/../lib/gcc:/home/0/freeware/bin/../lib/gcc/powerpc-ibm-aix7.1.0.0/4.6.3/../../..:/usr/lib:/lib
/opt/freeware/src/packages/BUILD/gcc-build-4.6.3/./gcc/include/unwind.h

"/opt/freeware/src/packages/BUILD" is a special historical file path used on AIX to hold open source for building purposes
https://community.ibm.com/community/user/power/discussion/purpose-of-optfreewaresrcpackagesbuild

In this case "freeware" is a misnomer, they really mean "open source"
the use of freeware in this embedded string can be ignored since its not really telling us that jna.a is "freeware"

That said, the "gcc-build-4.6.3/./gcc/include/unwind.h" which is a dependency for this .a file is not seen or scanned by ScanCode and is out of scope for the license results but might affect the final licensing of the .a file!

The unwind.h file is likely GPL v3 w/ GCC Runtime Library Exception,

similar to one seen here:
https://github.com/far-far-away-science/hab-v2/blob/e8b63d4c9d4df487bb7d2cd0d6e10f092e20581d/software/archive/gcc/include/unwind.h#L4

Final thoughts:
So in the end "freeware" is a false positive
A human curation might add "GNU GPL v3 w/ GCC Runtime Library Exception in a deep audit

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

Scancode output can be found here. "matched_text": "freeware/" for com/sun/jna/aix-ppc/libjnidispatch.a

@lumaxis
Copy link
Contributor Author

lumaxis commented Jun 17, 2024

Leaving a note here with this run of the integrations test with both of my crawler and service branches deployed: https://github.com/clearlydefined/operations/actions/runs/9083092159/job/24961014518

There's a couple of failures but as far as I can tell, all are expected 🙏🏼

@elrayle
Copy link
Collaborator

elrayle commented Jun 18, 2024

There's a couple of failures but as far as I can tell, all are expected

Will clearlydefined/operations#76 impact the failing integration tests?

@qtomlinson
Copy link
Collaborator

qtomlinson commented Jun 18, 2024

Will clearlydefined/operations#76 impact the failing integration tests?

The tests were run with clearlydefined/operations#76. Otherwise, tests would fail at harvest. This is also 'Add auto detect schema versions', which is currently for review, trying to address.

@qtomlinson
Copy link
Collaborator

qtomlinson commented Jun 18, 2024

There's a couple of failures but as far as I can tell, all are expected 🙏🏼

Thanks for the log! I have looked through the logs and summarized as the following cases:

  1. crate/cratesio/-/ratatui/0.26.0: missing field in production (projectWebsite: 'https://ratatui.rs'). This has been fixed by my recent PR

  2. Different scoring reported for pypi/pypi/-/platformdirs/4.2.0

  • new scancode summarizer: licensed.toolScore.discovered: 0
  • production: licensed.toolScore.discovered: 1
    Need to look into the definition detail to see why the score is different.
  1. File licenses are different for the following 3 cases (differences extracted from the log):
  • git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641, License is different for the file below,
    expected: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}
    actual: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND Unicode-DFS-2016 AND WTFPL","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}

  • nuget/nuget/-/NuGet.Protocol/6.7.1, 2 files are different
    expected: {"path":"NuGet.Protocol.nuspec","license":"Apache-2.0 AND NOASSERTION","attributions":["(c) Microsoft Corporation"],"hashes":{"sha1":"94b5bbfb0e08ee3586ee7cc7c503b5c4ade2e997","sha256":"2d9c3b5b9ff0d9aca7461e21b66c9f8b69e259b557ce7b34898eb71f3efda465"}}
    actual: {"path":"NuGet.Protocol.nuspec","license":"Apache-2.0","attributions":["(c) Microsoft Corporation"],"hashes":{"sha1":"94b5bbfb0e08ee3586ee7cc7c503b5c4ade2e997","sha256":"2d9c3b5b9ff0d9aca7461e21b66c9f8b69e259b557ce7b34898eb71f3efda465"}}
    expected: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}
    actual: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0 AND (ECL-2.0 AND Apache-2.0)","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}

  • pod/cocoapods/-/SoftButton/0.1.0, 1 file license is different.
    expected: {"path":"README.md","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}
    actual: {"path":"README.md","license":"MIT","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}

Need to confirm which one is correct.

  1. Different declared license:
  • npm/npmjs/-/redis/0.1.0
    • declared license is MIT from the new scancode summarizer and causing score change.
    • declared code is empty in production
  • pypi/pypi/-/sdbus/0.12.0
    • new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
    • production, declared: 'GPL-2.0 AND GPL-2.0-only AND GPL-3.0-or-later AND LGPL-2.1-only'

If we have confirmed that the new declared and file licenses, and scoring are correct, we can update the comparison by uploading the fixtures.

@Jeffrey-Luszcz , could you help us verify differences in file license (point 3) and declared license (point 4)?

@qtomlinson
Copy link
Collaborator

qtomlinson commented Jun 23, 2024

Different scoring reported for pypi/pypi/-/platformdirs/4.2.0

This is due to the difference in copyright detection in ScanCode v32 and ScanCode v30.
30.3.0.json detects copyrights in platformdirs-4.2.0/LICENSE; while 32.3.0.json result does not.

@Jeffrey-Luszcz Is this a regression that needs to be reported to ScanCode?

Copy link
Collaborator

@qtomlinson qtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@qtomlinson qtomlinson merged commit ed44664 into clearlydefined:master Jun 26, 2024
4 checks passed
@Jeffrey-Luszcz
Copy link

3 [git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641]

The deny.toml file in this result might be something we should EXCLUDE from scans since it does not represent actual license content.

deny.toml is a config file for a testing tool https://github.com/EmbarkStudios/cargo-deny and exists to create allow/deny lists for licenses used by components in the dependency list.

git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641, License is different for the file below,
expected: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}
actual: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND Unicode-DFS-2016 AND WTFPL","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}

In this case the deny.toml file contains the following license strings:
"Apache-2.0",
"BSD-2-Clause",
"BSD-3-Clause",
"ISC",
"MIT",
"Unicode-DFS-2016",
"WTFPL",

The scan results are better now but also missing ISC and MIT seen in the deny.toml file section containing the other license names

@qtomlinson
Copy link
Collaborator

qtomlinson commented Jul 4, 2024

3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test

  • expected: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}
    actual: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0 AND (ECL-2.0 AND Apache-2.0)","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}

The clearlydefined/downloaded/LICENSE is the license obtained from https://licenses.nuget.org/Apache-2.0 (licenseUrl from the component manifest). The difference in licenses reported for this file lies in how CD interprets ScanCode raw results in new (actual) and legacy (expected) summarizer.

In both 30.3.0.json and 32.3.0.json ScanCode results, there is a matching of "ECL 2.0" with scores of around 48% for the file:

  • When using the v30 result, there is a score-based (>=80) filtering in place for file-level license findings. License findings scoring under 80 are disregarded, hence only "Apache 2.0" is reported.
  • When using the v32 result, the detected_license_expression_spdx from ScanCode is directly utilized, which results in the license being reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". It appears that this detected_license_expression_spdx includes license findings below a score of 80.
    @Jeffrey-Luszcz @lumaxis @elrayle The question arises whether this inclusion of license findings with a lower matching score is the desired behavior.

@Jeffrey-Luszcz
Copy link

3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test

When using the v32 result, the detected_license_expression_spdx from ScanCode is directly utilized, which results in the license being reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". It appears that this detected_license_expression_spdx includes license findings below a score of 80.
@Jeffrey-Luszcz @lumaxis @elrayle The question arises whether this inclusion of license findings with a lower matching score is the desired behavior.

I would consider the "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". result to be incorrect. This license text is the Apache 2.0 text (with possibly some nuget doc boilerplate like:

Notes
This license was released January 2004

SPDX web page
https://spdx.org/licenses/Apache-2.0.html
Notice
This license content is provided by the [SPDX project](https://spdx.dev/). For more information about licenses.nuget.org, see [our documentation](https://aka.ms/licenses.nuget.org).

Data pulled from [spdx/license-list-data](https://github.com/spdx/license-list-data) on February 9, 2023.

NuGet.Protocol/6.7.1 does NOT contain the ECL 2.0 as an option for the package, only AL 2.0.
The ECL 2.0 is a modified version of the AL 2.0 with ADDITIONAL text in the Patent Clause. ScanCode should be able to differentiate between to variants of a similar license.

The three "licenses" that we are thinking about here are a pure Apache 2.0, the NuGet Apache 2.0 file with some additional info at the bottom and top and a "pure" ECL 2.0 license text (where B represents the original patent clause, B' the modified patent clause and D is the Nuget text talking about SPDX from the block above:
Apache 2.0 NuGet ECL 2.0
A A A
B B B'
C C C
D

The noise of adding "OR (AL 2.0 or EC 2.0)" is pretty bad and not user friendly esp for a license like the ECL 2.0 which is seen in only a handful of packages. The AL 2.0 is somewhere like the 3rd most popular license, I'd be worried if they all started getting reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)".

I wonder if scancode is seeing a license that it thinks might superset of the Apache 2.0 because of the "D" text in the NuGet license file and thus returning a bunch of possibilities it wouldn't if it had a more pure Apache 2.0 text. It still should realize that the ECL 2.0 is a special sub-set or modified version of the Apache 2.0 in my opinion....

It would be worth seeing if we start getting "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". noisy results for things we expect to be pure Apache 2.0 due to a change in ScanCode of if this is a special case due to the Nuget spdx Noise...

@qtomlinson
Copy link
Collaborator

qtomlinson commented Aug 7, 2024

  • pod/cocoapods/-/SoftButton/0.1.0, 1 file license is different.
    expected: {"path":"README.md","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}
    actual: {"path":"README.md","license":"MIT","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}

Need to confirm which one is correct.

  1. Different declared license:
  • npm/npmjs/-/redis/0.1.0

    • declared license is MIT from the new scancode summarizer and causing score change.
    • declared code is empty in production
  • pypi/pypi/-/sdbus/0.12.0

    • new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
 * fixture pre v32 , declared: 'GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-or-later',

@Jeffrey-Luszcz Could you kindly verify the validity of these license differences and confirm that the new version is the correct one?

@Jeffrey-Luszcz
Copy link

Jeffrey-Luszcz commented Aug 14, 2024

pypi/pypi/-/sdbus/0.12.0
new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
production, declared: 'GPL-2.0 AND GPL-2.0-only AND GPL-3.0-or-later AND LGPL-2.1-only'

SDBus explicitly says its LGPL 2.1 in its README.
This scancode license string seems overly complicated for a simple LGPL 2.1 declaration likely due to GPL and LGPL license text found at top level. I would make the case that either we do a curation for this component or talk about how to handle the LGPL declared case where a GPL file is shipped along with the LGPL file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants