You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are other cases that are potentially worse, where a crate with a choice of 3 licenses is again declared with AND rather than OR, e.g. slog and related feature crates should be MPL-2.0 OR MIT OR Apache-2.0, but is declared by clearlydefined as MPL-2.0 AND MIT AND Apache-2.0:
This case is worse because the declared license is more restrictive than the correct license.
Summary analysis
The clearlydefined summarizer correctly identifies the package license due to extracting the declared license from Cargo.toml, but the licensee and scancode scanners/summarizers (a) don't understand Rust crates, so do not read licensing info from Cargo.toml, and (b) detect the multiple license text files on offer and incorrectly assume that they all apply, combining their licenses using AND.
The Aggregator service combines the correct clearlydefined declaration with the other incorrect declarations to give an incorrect final result.
Investigation detail
Overview of clearlydefined license detection/declaration
clearlydefined uses multiple tools to harvest data about packages (note: looks like there's a bug where not all of these tools run on every scan):
clearlydefined
fossology
licensee
scancode
The output from all of these tools is saved as a JSON object and is viewable via the clearlydefined web interface in the Raw/Harvested Data section. Each tool searches for known license references within package files and produces a report of detected license types.
clearlydefined processes the harvested data via services:
SummaryService: for each tool type, processes the raw data to derive common values for each tool, e.g. license.declared
the top-level loop is in business/summarizer.js
this calls through to per-tool summarizers in providers/summary
AggregatorService: combines the summary data from each tool to generate a single consolidated view of the package data
(there is other processing, but that is not relevant to this issue)
the top-level code is in business/aggregator.js
this makes calls to functions in lib/utils - mergeDefinitions(), which calls _mergeDescribed(), _mergeLicensed(), _mergeFiles
License aggregation
Licenses reported by the tools are combined by the Aggregation service by ANDing their SPDX identifiers:
[lib/utils.js]
function _mergeLicensed(base, proposed, override) {
if (!proposed) return base
if (!base) return proposed
const result = _mergeExcept(base, proposed, ['declared'])
setIfValue(result, 'declared', override ? proposed.declared : SPDX.merge(proposed.declared, base.declared, 'AND'))
return result
}
Issue analysis
The issue is best explained by inspecting the data from an example crate crate/cratesio/-/quote/1.0.9.
Cargo.toml specifies that the crate is licensed as MIT OR Apache-2.0
The README.md also declares the license: Licensed under either of Apache License, Version 2.0 or MIT license at your option.
There are two license files containing the text of the MIT and Apache-2.0 licenses: LICENSE-MIT, LICENSE-APACHE
clearlydefined incorrectly declares the license as Apache-2.0 AND MIT, rather than the expected MIT OR Apache-2.0.
The licensed.declared field generated by the SummaryService for each of the tools is:
clearlydefined/1.20: MIT OR Apache-2.0 (correct)
licensee/9.13.0: Apache-2.0 AND MIT (incorrect)
scancode/3.2.2: Apache-2.0 AND MIT (incorrect)
The AggregatorService combines these licenses to form the final declared license: Apache-2.0 AND MIT (incorrect).
The key question is therefore why the clearlydefined tool summary is correct, but the licensee and scancode summaries are incorrect.
clearlydefined summarizer
The clearlydefined summary code (providers/summary/clearlydefined.js) understands Rust packages, and as the final step in summarizing uses the values from Cargo.toml to correctly set license.declared:
[providers/summary/clearlydefined.js]
addCrateData(result, data, coordinates) {
...
const license = get(data, 'registryData.license')
if (license) setIfValue(result, 'licensed.declared', SPDX.normalize(license.split('/').join(' OR ')))
...
licensee summarizer
The licensee summarizer iterates through all the package files, extracts any license references discovered, then combines the unique license names using AND:
An Apache-2.0 reference in the LICENSE-APACHE file
An MIT reference in the LICENSE-MIT file
NOASSERTION in Cargo.toml
It combines the discovered Apache-2.0 and MIT references using AND (as per the code above).
This logic may be correct for source files within a package that use different licenses. However in this case the logic is wrong because the license references are being detected only in the license definition files, and takes no account of the fact that the user has the choice of using either of these licenses (as described in Cargo.toml and README.md).
scancode summarizer
The scancode summarizer calculates the declared license as follows:
This iterates through all the package files, filters only those that are license texts, then combines all of those licenses with AND.
Looking at the raw harvested data, the files which match this are again LICENSE-APACHE and LICENSE-MIT, resulting in an invalid declared license of Apache-2.0 AND MIT.
fossology summarizer
Although the fossology summarizer is not used in this example, its code is very similar to the scancode summarizer - returning the licenses of all the license text files it finds:
[providers/summary/fossology.js]
_declareLicense(coordinates, result) {
if (!result.files) return
// if we know this is a license file by the name of it and it has a license detected in it
// then let's declare the license for the component
const licenses = uniq(
result.files.filter(file => file.license && isLicenseFile(file.path, coordinates)).map(file => file.license)
)
setIfValue(result, 'licensed.declared', licenses.join(' AND '))
}
This would therefore also give an incorrect license declaration in this case (if the tool had run).
Possible fixes/workarounds
For Rust crates, only use the results from the clearlydefined summarizer
For Rust crates, modify the non-clearlydefined summarizers to ignore license text files, and only declare licenses on other files if detected
Additional notes
I have added some tests to enable testing/investigation of this issue with the help of a debugger. I'll push these changes in a branch and add a link here.
The text was updated successfully, but these errors were encountered:
Problem description
Many Rust crates have a choice of licenses - typically
MIT OR Apache-2.0
, following the lead of Rust itself:https://www.rust-lang.org/policies/licenses
clearlydefined
is often (usually) incorrectly declaring the license of these crates asMIT AND Apache-2.0
, rather thanMIT OR Apache-2.0
. Examples:quote 1.0.9
serde 1.0.126
clap 2.33.3
regex 1.5.3
There are other cases that are potentially worse, where a crate with a choice of 3 licenses is again declared with
AND
rather thanOR
, e.g.slog
and related feature crates should beMPL-2.0 OR MIT OR Apache-2.0
, but is declared byclearlydefined
asMPL-2.0 AND MIT AND Apache-2.0
:slog 2.7.0
This case is worse because the declared license is more restrictive than the correct license.
Summary analysis
The
clearlydefined
summarizer correctly identifies the package license due to extracting the declared license fromCargo.toml
, but thelicensee
andscancode
scanners/summarizers (a) don't understand Rust crates, so do not read licensing info fromCargo.toml
, and (b) detect the multiple license text files on offer and incorrectly assume that they all apply, combining their licenses using AND.The Aggregator service combines the correct
clearlydefined
declaration with the other incorrect declarations to give an incorrect final result.Investigation detail
Overview of
clearlydefined
license detection/declarationclearlydefined
uses multiple tools to harvest data about packages (note: looks like there's a bug where not all of these tools run on every scan):clearlydefined
fossology
licensee
scancode
The output from all of these tools is saved as a JSON object and is viewable via the
clearlydefined
web interface in the Raw/Harvested Data section. Each tool searches for known license references within package files and produces a report of detected license types.clearlydefined
processes the harvested data via services:SummaryService
: for each tool type, processes the raw data to derive common values for each tool, e.g.license.declared
business/summarizer.js
providers/summary
AggregatorService
: combines the summary data from each tool to generate a single consolidated view of the package data(there is other processing, but that is not relevant to this issue)
business/aggregator.js
lib/utils
-mergeDefinitions()
, which calls_mergeDescribed()
,_mergeLicensed()
,_mergeFiles
License aggregation
Licenses reported by the tools are combined by the Aggregation service by
AND
ing their SPDX identifiers:Issue analysis
The issue is best explained by inspecting the data from an example crate
crate/cratesio/-/quote/1.0.9
.Cargo.toml
specifies that the crate is licensed asMIT OR Apache-2.0
README.md
also declares the license:Licensed under either of Apache License, Version 2.0 or MIT license at your option.
LICENSE-MIT
,LICENSE-APACHE
clearlydefined
incorrectly declares the license asApache-2.0 AND MIT
, rather than the expectedMIT OR Apache-2.0
.The
licensed.declared
field generated by theSummaryService
for each of the tools is:clearlydefined/1.20
:MIT OR Apache-2.0
(correct)licensee/9.13.0
:Apache-2.0 AND MIT
(incorrect)scancode/3.2.2
:Apache-2.0 AND MIT
(incorrect)The
AggregatorService
combines these licenses to form the final declared license:Apache-2.0 AND MIT
(incorrect).The key question is therefore why the
clearlydefined
tool summary is correct, but thelicensee
andscancode
summaries are incorrect.clearlydefined
summarizerThe
clearlydefined
summary code (providers/summary/clearlydefined.js
) understands Rust packages, and as the final step in summarizing uses the values fromCargo.toml
to correctly setlicense.declared
:licensee
summarizerThe
licensee
summarizer iterates through all the package files, extracts any license references discovered, then combines the unique license names usingAND
:The harvested
licensee
data looks like this:This shows that it has detected:
Apache-2.0
reference in theLICENSE-APACHE
fileMIT
reference in theLICENSE-MIT
fileNOASSERTION
inCargo.toml
It combines the discovered
Apache-2.0
andMIT
references usingAND
(as per the code above).This logic may be correct for source files within a package that use different licenses. However in this case the logic is wrong because the license references are being detected only in the license definition files, and takes no account of the fact that the user has the choice of using either of these licenses (as described in
Cargo.toml
andREADME.md
).scancode
summarizerThe
scancode
summarizer calculates the declared license as follows:This iterates through all the package files, filters only those that are license texts, then combines all of those licenses with
AND
.Looking at the raw harvested data, the files which match this are again
LICENSE-APACHE
andLICENSE-MIT
, resulting in an invalid declared license ofApache-2.0 AND MIT
.fossology
summarizerAlthough the
fossology
summarizer is not used in this example, its code is very similar to thescancode
summarizer - returning the licenses of all the license text files it finds:This would therefore also give an incorrect license declaration in this case (if the tool had run).
Possible fixes/workarounds
clearlydefined
summarizerAdditional notes
I have added some tests to enable testing/investigation of this issue with the help of a debugger. I'll push these changes in a branch and add a link here.
The text was updated successfully, but these errors were encountered: