Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many (most) Rust crates with a choice of licenses are declared incorrectly #856

Closed
johnbatty opened this issue Jun 15, 2021 · 3 comments
Closed

Comments

@johnbatty
Copy link
Contributor

Problem description

Many Rust crates have a choice of licenses - typically MIT OR Apache-2.0, following the lead of Rust itself:
https://www.rust-lang.org/policies/licenses

clearlydefined is often (usually) incorrectly declaring the license of these crates as MIT AND Apache-2.0, rather than MIT OR Apache-2.0. Examples:

There are other cases that are potentially worse, where a crate with a choice of 3 licenses is again declared with AND rather than OR, e.g. slog and related feature crates should be MPL-2.0 OR MIT OR Apache-2.0, but is declared by clearlydefined as MPL-2.0 AND MIT AND Apache-2.0:

This case is worse because the declared license is more restrictive than the correct license.

Summary analysis

The clearlydefined summarizer correctly identifies the package license due to extracting the declared license from Cargo.toml, but the licensee and scancode scanners/summarizers (a) don't understand Rust crates, so do not read licensing info from Cargo.toml, and (b) detect the multiple license text files on offer and incorrectly assume that they all apply, combining their licenses using AND.

The Aggregator service combines the correct clearlydefined declaration with the other incorrect declarations to give an incorrect final result.

Investigation detail

Overview of clearlydefined license detection/declaration

clearlydefined uses multiple tools to harvest data about packages (note: looks like there's a bug where not all of these tools run on every scan):

  • clearlydefined
  • fossology
  • licensee
  • scancode

The output from all of these tools is saved as a JSON object and is viewable via the clearlydefined web interface in the Raw/Harvested Data section. Each tool searches for known license references within package files and produces a report of detected license types.

clearlydefined processes the harvested data via services:

  • SummaryService: for each tool type, processes the raw data to derive common values for each tool, e.g. license.declared
    • the top-level loop is in business/summarizer.js
    • this calls through to per-tool summarizers in providers/summary
  • AggregatorService: combines the summary data from each tool to generate a single consolidated view of the package data
    (there is other processing, but that is not relevant to this issue)
    • the top-level code is in business/aggregator.js
    • this makes calls to functions in lib/utils - mergeDefinitions(), which calls _mergeDescribed(), _mergeLicensed(), _mergeFiles

License aggregation

Licenses reported by the tools are combined by the Aggregation service by ANDing their SPDX identifiers:

[lib/utils.js]
function _mergeLicensed(base, proposed, override) {
  if (!proposed) return base
  if (!base) return proposed
  const result = _mergeExcept(base, proposed, ['declared'])
  setIfValue(result, 'declared', override ? proposed.declared : SPDX.merge(proposed.declared, base.declared, 'AND'))
  return result
}

Issue analysis

The issue is best explained by inspecting the data from an example crate crate/cratesio/-/quote/1.0.9.

  • Cargo.toml specifies that the crate is licensed as MIT OR Apache-2.0
  • The README.md also declares the license:
    Licensed under either of Apache License, Version 2.0 or MIT license at your option.
  • There are two license files containing the text of the MIT and Apache-2.0 licenses: LICENSE-MIT, LICENSE-APACHE

clearlydefined incorrectly declares the license as Apache-2.0 AND MIT, rather than the expected MIT OR Apache-2.0.

The licensed.declared field generated by the SummaryService for each of the tools is:

  • clearlydefined/1.20: MIT OR Apache-2.0 (correct)
  • licensee/9.13.0: Apache-2.0 AND MIT (incorrect)
  • scancode/3.2.2: Apache-2.0 AND MIT (incorrect)

The AggregatorService combines these licenses to form the final declared license: Apache-2.0 AND MIT (incorrect).

The key question is therefore why the clearlydefined tool summary is correct, but the licensee and scancode summaries are incorrect.

clearlydefined summarizer

The clearlydefined summary code (providers/summary/clearlydefined.js) understands Rust packages, and as the final step in summarizing uses the values from Cargo.toml to correctly set license.declared:

[providers/summary/clearlydefined.js]
  addCrateData(result, data, coordinates) {
...
    const license = get(data, 'registryData.license')
    if (license) setIfValue(result, 'licensed.declared', SPDX.normalize(license.split('/').join(' OR ')))
    ...

licensee summarizer

The licensee summarizer iterates through all the package files, extracts any license references discovered, then combines the unique license names using AND:

[providers/summary/licensee.js]
  _addLicenseFromFiles(result, coordinates) {
    if (!result.files) return
    const licenses = result.files
      .map(file => (isDeclaredLicense(file.license) && isLicenseFile(file.path, coordinates) ? file.license : null))
      .filter(x => x)
    setIfValue(result, 'licensed.declared', uniq(licenses).join(' AND '))
  }

The harvested licensee data looks like this:

licensee.output.content.licenses:
[
  { spdx_id: "Apache-2.0",... },
  { spdx_id: "MIT",... },
  { spdx_id: "NOASSERTION",... },
]

licensee.output.content.matched_files:
[
  { filename: 'LICENSE-APACHE',... },
  { filename: 'LICENSE-MIT',... },
  { filename: 'Cargo.toml',... },
]

This shows that it has detected:

  • An Apache-2.0 reference in the LICENSE-APACHE file
  • An MIT reference in the LICENSE-MIT file
  • NOASSERTION in Cargo.toml

It combines the discovered Apache-2.0 and MIT references using AND (as per the code above).
This logic may be correct for source files within a package that use different licenses. However in this case the logic is wrong because the license references are being detected only in the license definition files, and takes no account of the fact that the user has the choice of using either of these licenses (as described in Cargo.toml and README.md).

scancode summarizer

The scancode summarizer calculates the declared license as follows:

[providers/summary/scancode.js]
    const declaredLicense =
      this._readDeclaredLicense(harvested) || this._getDeclaredLicense(scancodeVersion, harvested, coordinates)
...
  _readDeclaredLicense(harvested) {
    const declared = get(harvested, 'content.summary.packages[0].declared_license')
    return SPDX.normalize(declared)
  }
...
  _getDeclaredLicense(scancodeVersion, harvested, coordinates) {
    const rootFile = this._getRootFiles(coordinates, harvested.content.files)
    switch (scancodeVersion) {
...
      case '3.0.2':
        return this._getLicenseByIsLicenseText(rootFile)
...
   }
  }

  _getLicenseByIsLicenseText(files) {
    const fullLicenses = files
      .filter(file => file.is_license_text && file.licenses)
      .reduce((licenses, file) => {
        file.licenses.forEach(license => {
          licenses.add(this._createExpressionFromLicense(license))
        })
        return licenses
      }, new Set())
    return this._joinExpressions(fullLicenses)
  }

  _joinExpressions(expressions) {
    if (!expressions) return null
    const list = setToArray(expressions)
    if (!list) return null
    return list.join(' AND ')
  }

This iterates through all the package files, filters only those that are license texts, then combines all of those licenses with AND.
Looking at the raw harvested data, the files which match this are again LICENSE-APACHE and LICENSE-MIT, resulting in an invalid declared license of Apache-2.0 AND MIT.

fossology summarizer

Although the fossology summarizer is not used in this example, its code is very similar to the scancode summarizer - returning the licenses of all the license text files it finds:

[providers/summary/fossology.js]
  _declareLicense(coordinates, result) {
    if (!result.files) return
    // if we know this is a license file by the name of it and it has a license detected in it
    // then let's declare the license for the component
    const licenses = uniq(
      result.files.filter(file => file.license && isLicenseFile(file.path, coordinates)).map(file => file.license)
    )
    setIfValue(result, 'licensed.declared', licenses.join(' AND '))
  }

This would therefore also give an incorrect license declaration in this case (if the tool had run).

Possible fixes/workarounds

  • For Rust crates, only use the results from the clearlydefined summarizer
  • For Rust crates, modify the non-clearlydefined summarizers to ignore license text files, and only declare licenses on other files if detected

Additional notes

I have added some tests to enable testing/investigation of this issue with the help of a debugger. I'll push these changes in a branch and add a link here.

@johnbatty
Copy link
Contributor Author

johnbatty commented Jun 22, 2021

My work-in-progress test code for reproing this problem is here:
#858

I plan to clean this up and get it upstreamed, as I think the tests will be useful to include.

@pombredanne
Copy link
Member

FWIW, all these are detected correctly in the latest ScanCode

@nellshamrell
Copy link
Contributor

This was fixed with #858, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants