Skip to content

scikit-learn sensitive data leakage vulnerability

Moderate severity GitHub Reviewed Published Jun 6, 2024 to the GitHub Advisory Database • Updated Oct 25, 2024

Package

pip scikit-learn (pip)

Affected versions

< 1.5.0

Patched versions

1.5.0

Description

A sensitive data leakage vulnerability was identified in scikit-learn's TfidfVectorizer, specifically in versions up to and including 1.4.1.post1, which was fixed in version 1.5.0. The vulnerability arises from the unexpected storage of all tokens present in the training data within the stop_words_ attribute, rather than only storing the subset of tokens required for the TF-IDF technique to function. This behavior leads to the potential leakage of sensitive information, as the stop_words_ attribute could contain tokens that were meant to be discarded and not stored, such as passwords or keys. The impact of this vulnerability varies based on the nature of the data being processed by the vectorizer.

References

Published by the National Vulnerability Database Jun 6, 2024
Published to the GitHub Advisory Database Jun 6, 2024
Reviewed Jun 17, 2024
Last updated Oct 25, 2024

Severity

Moderate

CVSS overall score

This score calculates overall vulnerability severity from 0 to 10 and is based on the Common Vulnerability Scoring System (CVSS).
/ 10

CVSS v3 base metrics

Attack vector
Network
Attack complexity
High
Privileges required
Low
User interaction
None
Scope
Unchanged
Confidentiality
High
Integrity
None
Availability
None

CVSS v3 base metrics

Attack vector: More severe the more the remote (logically and physically) an attacker can be in order to exploit the vulnerability.
Attack complexity: More severe for the least complex attacks.
Privileges required: More severe if no privileges are required.
User interaction: More severe when no user interaction is required.
Scope: More severe when a scope change occurs, e.g. one vulnerable component impacts resources in components beyond its security scope.
Confidentiality: More severe when loss of data confidentiality is highest, measuring the level of data access available to an unauthorized user.
Integrity: More severe when loss of data integrity is the highest, measuring the consequence of data modification possible by an unauthorized user.
Availability: More severe when the loss of impacted component availability is highest.
CVSS:3.0/AV:N/AC:H/PR:L/UI:N/S:U/C:H/I:N/A:N

EPSS score

0.043%
(11th percentile)

CVE ID

CVE-2024-5206

GHSA ID

GHSA-jw8x-6495-233v
Loading Checking history
See something to contribute? Suggest improvements for this vulnerability.