Add new summarizer for recent ScanCode versions #1056

lumaxis · 2024-02-14T11:43:55Z

To unblock clearlydefined/crawler#502, the service needs to be ready to process files put out by newer ScanCode versions.

ScanCode major versions 31 and 32 introduced pretty drastic changes to its output format which required significant changes to our summarizing logic. To not add further special cases, that would have complicated the existing code even more, I instead opted to add a separate file that exclusively handles this new format.

providers/summary/scancode-new.js

providers/summary/scancode.js

test/providers/summary/scancode-new.js

lumaxis · 2024-04-04T12:43:02Z

test/providers/summary/scancode-new.js

+  it('summarizes using license_expression', () => {
+    const coordinates = { type: 'debsrc', provider: 'debian' }
+    const harvestData = getHarvestData('32.0.8', 'debsrc-license-expression')
+    const result = summarizer.summarize(coordinates, harvestData)
+    assert.equal(result.licensed.declared, 'Apache-2.0')
+  })


This test currently fails due to aboutcode-org/scancode-toolkit#3690

In the meantime, I'm wondering if we could build a usable workaround out of the new data at

service/test/fixtures/scancode/32.0.8/debsrc-license-expression.json

Lines 85 to 98 in dac4efb

"other_license_expressions": [

{

"value": "apache-2.0",

"count": 27

},

{

"value": "ace-tao",

"count": 2

},

{

"value": "unknown-license-reference",

"count": 1

}

],

? 🤔

If we decide to skip this specific test for the moment, I'd love suggestions for other test cases I could add that fall back to the packages[0].license_expression

More test cases can be found in commit message for the fix. In addition, a 2nd unit test was also introduced in the same commit.

In the meantime, I'm wondering if we could build a usable workaround out of the new data at

The other license expressions seems to align with discovered license, but may not align with declared.
Failing to detect license at declared license expression, and also at package[0]. The next route to try is to find the license file at the package root level or a specific license file folder. In v3 and v30, file.is_license_text is the flag to indicate a license file. This flag seems to be removed? Is there a new way to indicate license file in v32?
There are two files with "is_legal": true, but none of them at the package root level:

"path": "tenacity-8.2.1/LICENSE"

"path": "debian/copyright"

Is there a convention to organize debian packages and specification on a designated folder to put license file? If so, can potentially go to that specific folder to look for license file.

file.is_license_text is the flag to indicate a license file. This flag seems to be removed?

Indeed:

The deprecated "--is-license-text" option has been removed. This is now built-in with the --license-text option and --info and exposed with the "percentage_of_license_text" attribute.

https://github.com/nexB/scancode-toolkit/blob/develop/CHANGELOG.rst#license-detection

Is there a new way to indicate license file in v32?

Yes, I've made updates to the function now called _getLicenseFromLicenseDetections to use percentage_of_license_text >= 80. The problem was that _getDetectedLicenseFromFiles was not working for debsrc components because the component name does not match the actual package folder name, python-tenacity vs tenacity-8.2.1. I've pushed a commit with a suggestion for how this could be supported: 49c3b47

Additionally, I was wondering if we don't want to maybe look at more than package[0] and maybe iterate over all packages? The tenacity package element in the packages array has all the right info:

service/test/fixtures/scancode/32.0.8/debsrc-license-expression.json

Lines 297 to 341 in 8bc66a7

"declared_license_expression": "apache-2.0",

"declared_license_expression_spdx": "Apache-2.0",

"license_detections": [

{

"license_expression": "apache-2.0",

"matches": [

{

"score": 100,

"start_line": 1,

"end_line": 1,

"matched_length": 3,

"match_coverage": 100,

"matcher": "1-hash",

"license_expression": "apache-2.0",

"rule_identifier": "spdx_license_id_apache-2.0_for_apache-2.0.RULE",

"rule_relevance": 100,

"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_apache-2.0_for_apache-2.0.RULE",

"matched_text": "Apache 2.0"

}

],

"identifier": "apache_2_0-d66ab77d-a5cc-7104-e702-dc7df61fe9e8"

},

{

"license_expression": "apache-2.0",

"matches": [

{

"score": 95,

"start_line": 1,

"end_line": 1,

"matched_length": 6,

"match_coverage": 100,

"matcher": "1-hash",

"license_expression": "apache-2.0",

"rule_identifier": "pypi_apache_no-version.RULE",

"rule_relevance": 95,

"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/pypi_apache_no-version.RULE",

"matched_text": "- 'License :: OSI Approved :: Apache Software License'"

}

],

"identifier": "apache_2_0-e267f9d9-ae62-e9c9-9cc2-8cd0a1e4928f"

}

],

"other_license_expression": null,

"other_license_expression_spdx": null,

"other_license_detections": [],

Thanks for filing all the issues to ScanCode!

I've made updates to the function now called _getLicenseFromLicenseDetections to use percentage_of_license_text >= 80.

I believe that determining the limit (e.g. 80) should be a decision made by the community and properly documented, as mentioned in Jeff's comment in another PR. To provide some context, in our handling of scancode data prior to version 3.0.0, licenses with a score of 80 or higher were considered file-level licenses, whereas licenses with a score of 90 or higher were considered declared licenses for the package. For more information, please refer to the discussion thread. It appears that there is a distinction between identifying a license file and a declared license for the package.

Additionally, I was wondering if we don't want to maybe look at more than package[0] and maybe iterate over all packages? The tenacity package element in the packages array has all the right info:

The package[2].type is pypi and it points to tenacity 8.2.1, which has a "declared_license_expression" of "apache-2.0". The pypi package pypi/pypi/-/tenacity/8.2.1 is declared as Apache-2.0.

The question then arises: is it possible for a package (in this case, pypi/pypi/-/tenacity/8.2.1) to be repackaged and published (as debsrc/debian/-/python-tenacity/8.2.1-1) under a different license? @yashkohli88 @capfei, could you please provide some feedback on this?

Is there a convention to organize debian packages and specification on a designated folder to put license file? If so, can potentially go to that specific folder to look for license file.

Looks like that debian/copyright is the required license info. See Debian New Maintainers' Guide

Looks like that debian/copyright is the required license info.

Unfortunately, debian/copyright appears to contain a lot of non-license text as well, so its percentage_of_license_text is just over 53% 🤔

For the rest of this logic, I should now be mirroring the logic from the existing ScanCode summarizer and the logic from the old --is-license-text option as it was originally implemented: https://github.com/nexB/scancode-toolkit/blob/v30.1.0/src/licensedcode/plugin_license_text.py

qtomlinson · 2024-04-23T17:49:37Z

There is a test case available at this link. In this test case, v30 ScanCode identified two occurrences of 'NOASSERTION' ('unknown-license-reference') for the following files:

'com/sun/jna/aix-ppc/libjnidispatch.a'
'com/sun/jna/aix-ppc64/libjnidispatch.a'

However, v3 Scancode did not report any license findings for these two files. It is worth noting that '.a' files are typically object files or static libraries used in Unix-like operating systems. This raises the question of whether the license detection in v30 ScanCode is a regression. Additionally, it would be interesting to know how v32 ScanCode performs in this regard.

lumaxis · 2024-04-29T16:35:16Z

@qtomlinson I pushed the package you mentioned as an additional test case. Looks like e.g. com/sun/jna/aix-ppc/libjnidispatch.a still gets reported with unknown-license-reference. However, the overall detected has slightly changed: https://github.com/clearlydefined/service/pull/1056/files#diff-47fcf272ac39f44f87c22747ea0503417c484186c4e0296f84f3f8e55ab8c1f7R98

Jeffrey-Luszcz · 2024-05-29T15:28:46Z

Going to the CD link to the .a files it points us to the jna-5.6.0-sources.jar though when I download it I don't see the .a files, I only see them in the regular jna-5.6.0.jar file

I ran a strings and a grep on the .a files and don't see anything that screams out a license reference to me (though see plenty of strings, and a bunch of GNU references but they seem API not license related.

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

https://repo1.maven.org/maven2/net/java/dev/jna/jna/5.6.0/

I think this is potentially a false positive, though I'd like to see scancode output for sure

In other .a files we might find true positives (the FFMpeg .a files for example might show this)

qtomlinson · 2024-05-29T18:27:00Z

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

Scancode output can be found here. "matched_text": "freeware/" for com/sun/jna/aix-ppc/libjnidispatch.a

qtomlinson

Thanks for the detailed work and all the tests!

lib/utils.js

providers/summary/index.js

providers/summary/scancode.js

qtomlinson · 2024-05-30T18:29:43Z

providers/summary/scancode-new.js

+
+        const licenseExpression =
+          file.detected_license_expression_spdx || this._getClosestLicenseMatchByFileName([file], coordinates, 80)
+        setIfValue(result, 'license', licenseExpression)


hm... licenseExpression for a file can be set even if the file is not a license file for the package (e.g. discovered license). isLicenseFile in _getClosestLicenseMatchByFileName may not be necessary here on the file level. so might be better to refactor the isLicenseFile check outside _getClosestLicenseMatchByFileName. isLicenseFile check is indeed necessary when setting file nature on line 159.

Thank you @qtomlinson!

The old summarizer has this logic:

const fileLicense = asserted || file.licenses || [] let licenses = new Set(fileLicense.map(x => x.license).filter(x => x)) if (!licenses.size) licenses = new Set(fileLicense.filter(x => x.score >= 80).map(x => this._createExpressionFromLicense(x))) const licenseExpression = joinExpressions(licenses) setIfValue(result, 'license', licenseExpression)

Using file.detected_license_expression_spdx here to me is equivalent to using joinExpressions(file.licenses) which is one option that can happen in the old code, would you agree? I'm open to changing this but just wanted to call out that it would be a deviation from the old algorithm.

Can you explain why you think isLicenseFile might not be necessary?

Thank you @lumaxis for the explanation. Using file.detected_license_expression_spdx looks good to me. The this._getClosestLicenseMatchByFileName([file], coordinates, 80) branch includes isLicenseFile check. This checks whether the file is a license file for the package, which is not present in the logic of old summarizer. Specifically, isLicenseFile checks whether the file is named as a license file under root or special directories as per package management system's convention (e.g. META-INF for maven). For discovered licenses, it is preferred include all the license findings above certain confidence (80 threshold) within the content of the package. So this filter may not be necessary for file level discovered license. It is definitely necessary for the package level declared license. More information, see the difference on declared vs discovered license

@qtomlinson Yea, good callout 🤔 I've split the previous function up into two, to be able to correctly replicate the old logic correctly. Do you perhaps know of a good test case for this? I'd like to add a unit test where the previous logic would have failed but haven't been able to find one, yet.

@lumaxis Thanks for making the change. I have tried half a dozen test cases where there are discovered license, but was not able to invoke _getLicenseExpressionFromFileLicenseDetections.

test/providers/summary/scancode-new.js

Jeffrey-Luszcz · 2024-06-07T18:03:02Z

Thanks for the pointer to the scancode output. The bare 'freeware' text is coming from an include filepath from the source tree used to build the .a file

/home/0/freeware/bin/../lib/gcc/powerpcibmaix7.1.0.0/4.6.3:/home/0/freeware/bin/../lib/gcc:/home/0/freeware/bin/../lib/gcc/powerpc-ibm-aix7.1.0.0/4.6.3/../../..:/usr/lib:/lib
/opt/freeware/src/packages/BUILD/gcc-build-4.6.3/./gcc/include/unwind.h

"/opt/freeware/src/packages/BUILD" is a special historical file path used on AIX to hold open source for building purposes
https://community.ibm.com/community/user/power/discussion/purpose-of-optfreewaresrcpackagesbuild

In this case "freeware" is a misnomer, they really mean "open source"
the use of freeware in this embedded string can be ignored since its not really telling us that jna.a is "freeware"

That said, the "gcc-build-4.6.3/./gcc/include/unwind.h" which is a dependency for this .a file is not seen or scanned by ScanCode and is out of scope for the license results but might affect the final licensing of the .a file!

The unwind.h file is likely GPL v3 w/ GCC Runtime Library Exception,

similar to one seen here:
https://github.com/far-far-away-science/hab-v2/blob/e8b63d4c9d4df487bb7d2cd0d6e10f092e20581d/software/archive/gcc/include/unwind.h#L4

Final thoughts:
So in the end "freeware" is a false positive
A human curation might add "GNU GPL v3 w/ GCC Runtime Library Exception in a deep audit

Do we know what unknown-license-reference string its finding? If not I can try running scancode on it locally.

Scancode output can be found here. "matched_text": "freeware/" for com/sun/jna/aix-ppc/libjnidispatch.a

lumaxis · 2024-06-17T15:59:33Z

Leaving a note here with this run of the integrations test with both of my crawler and service branches deployed: https://github.com/clearlydefined/operations/actions/runs/9083092159/job/24961014518

There's a couple of failures but as far as I can tell, all are expected 🙏🏼

elrayle · 2024-06-18T12:52:30Z

There's a couple of failures but as far as I can tell, all are expected

Will clearlydefined/operations#76 impact the failing integration tests?

qtomlinson · 2024-06-18T17:51:02Z

Will clearlydefined/operations#76 impact the failing integration tests?

The tests were run with clearlydefined/operations#76. Otherwise, tests would fail at harvest. This is also 'Add auto detect schema versions', which is currently for review, trying to address.

qtomlinson · 2024-06-18T18:06:47Z

There's a couple of failures but as far as I can tell, all are expected 🙏🏼

Thanks for the log! I have looked through the logs and summarized as the following cases:

crate/cratesio/-/ratatui/0.26.0: missing field in production (projectWebsite: 'https://ratatui.rs'). This has been fixed by my recent PR
Different scoring reported for pypi/pypi/-/platformdirs/4.2.0

new scancode summarizer: licensed.toolScore.discovered: 0
production: licensed.toolScore.discovered: 1
Need to look into the definition detail to see why the score is different.

File licenses are different for the following 3 cases (differences extracted from the log):

git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641, License is different for the file below,
expected: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}
actual: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND Unicode-DFS-2016 AND WTFPL","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}
nuget/nuget/-/NuGet.Protocol/6.7.1, 2 files are different
expected: {"path":"NuGet.Protocol.nuspec","license":"Apache-2.0 AND NOASSERTION","attributions":["(c) Microsoft Corporation"],"hashes":{"sha1":"94b5bbfb0e08ee3586ee7cc7c503b5c4ade2e997","sha256":"2d9c3b5b9ff0d9aca7461e21b66c9f8b69e259b557ce7b34898eb71f3efda465"}}
actual: {"path":"NuGet.Protocol.nuspec","license":"Apache-2.0","attributions":["(c) Microsoft Corporation"],"hashes":{"sha1":"94b5bbfb0e08ee3586ee7cc7c503b5c4ade2e997","sha256":"2d9c3b5b9ff0d9aca7461e21b66c9f8b69e259b557ce7b34898eb71f3efda465"}}
expected: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}
actual: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0 AND (ECL-2.0 AND Apache-2.0)","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}
pod/cocoapods/-/SoftButton/0.1.0, 1 file license is different.
expected: {"path":"README.md","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}
actual: {"path":"README.md","license":"MIT","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}

Need to confirm which one is correct.

Different declared license:

npm/npmjs/-/redis/0.1.0
- declared license is MIT from the new scancode summarizer and causing score change.
- declared code is empty in production
pypi/pypi/-/sdbus/0.12.0
- new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
- production, declared: 'GPL-2.0 AND GPL-2.0-only AND GPL-3.0-or-later AND LGPL-2.1-only'

If we have confirmed that the new declared and file licenses, and scoring are correct, we can update the comparison by uploading the fixtures.

@Jeffrey-Luszcz , could you help us verify differences in file license (point 3) and declared license (point 4)?

qtomlinson · 2024-06-23T02:39:44Z

Different scoring reported for pypi/pypi/-/platformdirs/4.2.0

This is due to the difference in copyright detection in ScanCode v32 and ScanCode v30.
30.3.0.json detects copyrights in platformdirs-4.2.0/LICENSE; while 32.3.0.json result does not.

@Jeffrey-Luszcz Is this a regression that needs to be reported to ScanCode?

qtomlinson

Looks good to me!

Jeffrey-Luszcz · 2024-06-28T14:44:51Z

3 [git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641]

The deny.toml file in this result might be something we should EXCLUDE from scans since it does not represent actual license content.

deny.toml is a config file for a testing tool https://github.com/EmbarkStudios/cargo-deny and exists to create allow/deny lists for licenses used by components in the dependency list.

git/github/ratatui-org/ratatui/bcf43688ec4a13825307aef88f3cdcd007b32641, License is different for the file below,
expected: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}
actual: {"path":"deny.toml","license":"Apache-2.0 AND BSD-2-Clause AND BSD-3-Clause AND Unicode-DFS-2016 AND WTFPL","hashes":{"sha1":"bf384bff590477ffd1b06c7de5b14dc65466964c","sha256":"32451b183c80028c2f63a02370b25969014515acc136f669ffc743ddca587202"}}

In this case the deny.toml file contains the following license strings:
"Apache-2.0",
"BSD-2-Clause",
"BSD-3-Clause",
"ISC",
"MIT",
"Unicode-DFS-2016",
"WTFPL",

The scan results are better now but also missing ISC and MIT seen in the deny.toml file section containing the other license names

qtomlinson · 2024-07-04T22:14:19Z

3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test

expected: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}
actual: {"path":"clearlydefined/downloaded/LICENSE","license":"Apache-2.0 AND (ECL-2.0 AND Apache-2.0)","hashes":{"sha1":"6215cb16583ad28f22e1a7c905ee5bbee1044635","sha256":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"},"token":"e38b0854f4b591f3297571e73c665bd4266e8c8f9c55fb580942e51a4da619a8"}

The clearlydefined/downloaded/LICENSE is the license obtained from https://licenses.nuget.org/Apache-2.0 (licenseUrl from the component manifest). The difference in licenses reported for this file lies in how CD interprets ScanCode raw results in new (actual) and legacy (expected) summarizer.

In both 30.3.0.json and 32.3.0.json ScanCode results, there is a matching of "ECL 2.0" with scores of around 48% for the file:

When using the v30 result, there is a score-based (>=80) filtering in place for file-level license findings. License findings scoring under 80 are disregarded, hence only "Apache 2.0" is reported.
When using the v32 result, the detected_license_expression_spdx from ScanCode is directly utilized, which results in the license being reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". It appears that this detected_license_expression_spdx includes license findings below a score of 80.
@Jeffrey-Luszcz @lumaxis @elrayle The question arises whether this inclusion of license findings with a lower matching score is the desired behavior.

Jeffrey-Luszcz · 2024-07-05T21:51:06Z

3 nuget/nuget/-/NuGet.Protocol/6.7.1 in integration test

When using the v32 result, the detected_license_expression_spdx from ScanCode is directly utilized, which results in the license being reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". It appears that this detected_license_expression_spdx includes license findings below a score of 80.
@Jeffrey-Luszcz @lumaxis @elrayle The question arises whether this inclusion of license findings with a lower matching score is the desired behavior.

I would consider the "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". result to be incorrect. This license text is the Apache 2.0 text (with possibly some nuget doc boilerplate like:

Notes
This license was released January 2004

SPDX web page
https://spdx.org/licenses/Apache-2.0.html
Notice
This license content is provided by the [SPDX project](https://spdx.dev/). For more information about licenses.nuget.org, see [our documentation](https://aka.ms/licenses.nuget.org).

Data pulled from [spdx/license-list-data](https://github.com/spdx/license-list-data) on February 9, 2023.

NuGet.Protocol/6.7.1 does NOT contain the ECL 2.0 as an option for the package, only AL 2.0.
The ECL 2.0 is a modified version of the AL 2.0 with ADDITIONAL text in the Patent Clause. ScanCode should be able to differentiate between to variants of a similar license.

The three "licenses" that we are thinking about here are a pure Apache 2.0, the NuGet Apache 2.0 file with some additional info at the bottom and top and a "pure" ECL 2.0 license text (where B represents the original patent clause, B' the modified patent clause and D is the Nuget text talking about SPDX from the block above:
Apache 2.0 NuGet ECL 2.0
A A A
B B B'
C C C
D

The noise of adding "OR (AL 2.0 or EC 2.0)" is pretty bad and not user friendly esp for a license like the ECL 2.0 which is seen in only a handful of packages. The AL 2.0 is somewhere like the 3rd most popular license, I'd be worried if they all started getting reported as "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)".

I wonder if scancode is seeing a license that it thinks might superset of the Apache 2.0 because of the "D" text in the NuGet license file and thus returning a bunch of possibilities it wouldn't if it had a more pure Apache 2.0 text. It still should realize that the ECL 2.0 is a special sub-set or modified version of the Apache 2.0 in my opinion....

It would be worth seeing if we start getting "Apache-2.0 AND (ECL-2.0 AND Apache-2.0)". noisy results for things we expect to be pure Apache 2.0 due to a change in ScanCode of if this is a special case due to the Nuget spdx Noise...

qtomlinson · 2024-08-07T16:23:09Z

pod/cocoapods/-/SoftButton/0.1.0, 1 file license is different.
expected: {"path":"README.md","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}
actual: {"path":"README.md","license":"MIT","hashes":{"sha1":"f2f54c7ed2178108a86e1fa58344a5c3ccc1da1e","sha256":"c338b0c0fbf0e334e50a2e0762900fc3ab55a3efb93a5f1e81ca6a5610c91246"}}

Need to confirm which one is correct.

Different declared license:

npm/npmjs/-/redis/0.1.0

declared license is MIT from the new scancode summarizer and causing score change.

declared code is empty in production

pypi/pypi/-/sdbus/0.12.0

new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'

 * fixture pre v32 , declared: 'GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-or-later',

@Jeffrey-Luszcz Could you kindly verify the validity of these license differences and confirm that the new version is the correct one?

Jeffrey-Luszcz · 2024-08-14T14:41:28Z

pypi/pypi/-/sdbus/0.12.0
new scancode summarizer, declared: 'GPL-1.0-or-later AND GPL-2.0 AND LGPL-2.0-or-later AND LGPL-2.1-only AND LGPL-2.1-or-later AND Python-2.0'
production, declared: 'GPL-2.0 AND GPL-2.0-only AND GPL-3.0-or-later AND LGPL-2.1-only'

SDBus explicitly says its LGPL 2.1 in its README.
This scancode license string seems overly complicated for a simple LGPL 2.1 declaration likely due to GPL and LGPL license text found at top level. I would make the case that either we do a curation for this component or talk about how to handle the LGPL declared case where a GPL file is shipped along with the LGPL file

lumaxis force-pushed the updates-new-scancode-version branch from db1c911 to 040e444 Compare February 16, 2024 15:53

lumaxis commented Feb 19, 2024

View reviewed changes

providers/summary/scancode-new.js Outdated Show resolved Hide resolved

providers/summary/scancode.js Outdated Show resolved Hide resolved

test/providers/summary/scancode-new.js Outdated Show resolved Hide resolved

test/providers/summary/scancode-new.js Outdated Show resolved Hide resolved

lumaxis force-pushed the updates-new-scancode-version branch from 2392faf to dac4efb Compare April 2, 2024 16:05

qtomlinson mentioned this pull request Apr 3, 2024

Precedence is not preserved when joining Scancode license expressions #1084

Closed

lumaxis commented Apr 4, 2024

View reviewed changes

lumaxis force-pushed the updates-new-scancode-version branch from 4ac5a36 to 659a823 Compare April 17, 2024 14:51

lumaxis force-pushed the updates-new-scancode-version branch from 659a823 to e880aa0 Compare April 29, 2024 16:17

lumaxis force-pushed the updates-new-scancode-version branch 2 times, most recently from 74589a1 to acb8cbb Compare April 30, 2024 09:29

lumaxis marked this pull request as ready for review April 30, 2024 09:36

lumaxis requested review from qtomlinson and elrayle April 30, 2024 09:36

lumaxis force-pushed the updates-new-scancode-version branch 2 times, most recently from d658029 to bbc2a6c Compare May 14, 2024 09:49

lumaxis added 12 commits May 30, 2024 14:48

Update debscr-license-expression fixture

633a49e

Add fixtures for new ScanCode version

d760ab2

Add initial new ScanCode processing logic

df88ded

Conditionally call new ScanCode summarizer

4419d52

Update semver

b6458f1

Update Summarizer test

728ef6f

Use semver to compare scancodeVersion

613a985

Move shared functions to utils

971483b

Ensure version checking code path is tested as well

80c2aa4

Various small cleanups

d4f6c89

Update _getLicenseByFileName

996473d

Update and extend tests

c9ce542

lumaxis added 3 commits May 30, 2024 14:48

Rename functions

05e77a3

Update test description

cf45f89

Minor codestyle improvement in test

b67935c

lumaxis force-pushed the updates-new-scancode-version branch from 30fa633 to b67935c Compare May 30, 2024 14:48

qtomlinson reviewed May 30, 2024

View reviewed changes

test/providers/summary/scancode-new.js Outdated Show resolved Hide resolved

qtomlinson approved these changes May 30, 2024

View reviewed changes

lumaxis added 4 commits June 10, 2024 15:27

Merge branch 'master' into updates-new-scancode-version

5f1d2ab

Restructure logic and add ScanCode delegator

eb314d0

Update debian package location logic

af03090

Remove duplicated test

b5a0d38

qtomlinson mentioned this pull request Jun 19, 2024

"Discovered" licenses from notices file showing up in "Declared" field - Google Mavens clearlydefined/crawler#583

Open

Separate function to better replicate old logic

202b854

qtomlinson approved these changes Jun 24, 2024

View reviewed changes

Merge branch 'master' into updates-new-scancode-version

dcf7963

qtomlinson merged commit ed44664 into clearlydefined:master Jun 26, 2024
4 checks passed

elrayle mentioned this pull request Jun 26, 2024

Announce scancode LicenseRef support #1149

Open

4 tasks

qtomlinson mentioned this pull request Aug 23, 2024

Investigate regression cases found in integration tests after updating ScanCode #1183

Open

qtomlinson mentioned this pull request Sep 25, 2024

Harvester regression testing clearlydefined/operations#91

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new summarizer for recent ScanCode versions #1056

Add new summarizer for recent ScanCode versions #1056

lumaxis commented Feb 14, 2024

lumaxis Apr 4, 2024

lumaxis Apr 4, 2024

qtomlinson Apr 4, 2024 •

edited

Loading

qtomlinson Apr 5, 2024 •

edited

Loading

lumaxis Apr 5, 2024

qtomlinson Apr 5, 2024

qtomlinson Apr 11, 2024

lumaxis Apr 29, 2024 •

edited

Loading

qtomlinson commented Apr 23, 2024

lumaxis commented Apr 29, 2024

Jeffrey-Luszcz commented May 29, 2024

qtomlinson commented May 29, 2024

qtomlinson left a comment

qtomlinson May 30, 2024 •

edited

Loading

lumaxis Jun 14, 2024

qtomlinson Jun 18, 2024 •

edited

Loading

lumaxis Jun 24, 2024

qtomlinson Jun 24, 2024

Jeffrey-Luszcz commented Jun 7, 2024

lumaxis commented Jun 17, 2024

elrayle commented Jun 18, 2024

qtomlinson commented Jun 18, 2024 •

edited

Loading

qtomlinson commented Jun 18, 2024 •

edited

Loading

qtomlinson commented Jun 23, 2024 •

edited

Loading

qtomlinson left a comment

Jeffrey-Luszcz commented Jun 28, 2024

qtomlinson commented Jul 4, 2024 •

edited

Loading

Jeffrey-Luszcz commented Jul 5, 2024

qtomlinson commented Aug 7, 2024 •

edited

Loading

Jeffrey-Luszcz commented Aug 14, 2024 •

edited

Loading

	"other_license_expressions": [
	{
	"value": "apache-2.0",
	"count": 27
	},
	{
	"value": "ace-tao",
	"count": 2
	},
	{
	"value": "unknown-license-reference",
	"count": 1
	}
	],

	"declared_license_expression": "apache-2.0",
	"declared_license_expression_spdx": "Apache-2.0",
	"license_detections": [
	{
	"license_expression": "apache-2.0",
	"matches": [
	{
	"score": 100,
	"start_line": 1,
	"end_line": 1,
	"matched_length": 3,
	"match_coverage": 100,
	"matcher": "1-hash",
	"license_expression": "apache-2.0",
	"rule_identifier": "spdx_license_id_apache-2.0_for_apache-2.0.RULE",
	"rule_relevance": 100,
	"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/spdx_license_id_apache-2.0_for_apache-2.0.RULE",
	"matched_text": "Apache 2.0"
	}
	],
	"identifier": "apache_2_0-d66ab77d-a5cc-7104-e702-dc7df61fe9e8"
	},
	{
	"license_expression": "apache-2.0",
	"matches": [
	{
	"score": 95,
	"start_line": 1,
	"end_line": 1,
	"matched_length": 6,
	"match_coverage": 100,
	"matcher": "1-hash",
	"license_expression": "apache-2.0",
	"rule_identifier": "pypi_apache_no-version.RULE",
	"rule_relevance": 95,
	"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/pypi_apache_no-version.RULE",
	"matched_text": "- 'License :: OSI Approved :: Apache Software License'"
	}
	],
	"identifier": "apache_2_0-e267f9d9-ae62-e9c9-9cc2-8cd0a1e4928f"
	}
	],
	"other_license_expression": null,
	"other_license_expression_spdx": null,
	"other_license_detections": [],

Add new summarizer for recent ScanCode versions #1056

Add new summarizer for recent ScanCode versions #1056

Conversation

lumaxis commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qtomlinson Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

qtomlinson Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lumaxis Apr 29, 2024 • edited Loading

Choose a reason for hiding this comment

qtomlinson commented Apr 23, 2024

lumaxis commented Apr 29, 2024

Jeffrey-Luszcz commented May 29, 2024

qtomlinson commented May 29, 2024

qtomlinson left a comment

Choose a reason for hiding this comment

qtomlinson May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qtomlinson Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffrey-Luszcz commented Jun 7, 2024

lumaxis commented Jun 17, 2024

elrayle commented Jun 18, 2024

qtomlinson commented Jun 18, 2024 • edited Loading

qtomlinson commented Jun 18, 2024 • edited Loading

qtomlinson commented Jun 23, 2024 • edited Loading

qtomlinson left a comment

Choose a reason for hiding this comment

Jeffrey-Luszcz commented Jun 28, 2024

qtomlinson commented Jul 4, 2024 • edited Loading

Jeffrey-Luszcz commented Jul 5, 2024

qtomlinson commented Aug 7, 2024 • edited Loading

Jeffrey-Luszcz commented Aug 14, 2024 • edited Loading

qtomlinson Apr 4, 2024 •

edited

Loading

qtomlinson Apr 5, 2024 •

edited

Loading

lumaxis Apr 29, 2024 •

edited

Loading

qtomlinson May 30, 2024 •

edited

Loading

qtomlinson Jun 18, 2024 •

edited

Loading

qtomlinson commented Jun 18, 2024 •

edited

Loading

qtomlinson commented Jun 18, 2024 •

edited

Loading

qtomlinson commented Jun 23, 2024 •

edited

Loading

qtomlinson commented Jul 4, 2024 •

edited

Loading

qtomlinson commented Aug 7, 2024 •

edited

Loading

Jeffrey-Luszcz commented Aug 14, 2024 •

edited

Loading