-
-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix license discovery: split at dashes for word-based matching #543
base: main
Are you sure you want to change the base?
Conversation
I fixed the problem in a similar way 9ba91d8 Is the tokenizer wrong (remove dash from token characters?) |
The rapidfuzz tokenizer only splits at spaces, not at other characters. This is the relevant source code. Regarding your fix: I think it confuses the following pairs of licenses:
I obtained this list using a little Python script: import urllib.request
import json
import re
if __name__ == '__main__':
with urllib.request.urlopen("https://github.com/spdx/license-list-data/raw/main/json/licenses.json") as url:
data = json.load(url)
for i, first_license in enumerate(data["licenses"]):
first_id = first_license["licenseId"]
first_bag = set(re.findall(r'\w+', first_id.lower()))
for second_license in data["licenses"][i+1:]:
second_id = second_license["licenseId"]
second_bag = set(re.findall(r'\w+', second_id.lower()))
if first_bag == second_bag:
print(first_id, second_id) Probably we should add a test like this as well. |
I now ensured in an additional tests that license IDs are always mapped to themselves |
@marcelotrevisani I think the failing tests are not related to the changes I made in this PR |
yes, they are not related indeed, gonna take a look and I will try to fix them later. |
can you rebase this branch to main please? |
Co-authored-by: Marcelo Duarte Trevisani <[email protected]>
706496c
to
a46e072
Compare
Description
I noticed the following test is currently failing on main:
license/test_discovery/test_discovery.test_short_license_id
for("3-Clause BSD License", "BSD-3-Clause")
.Reason: There is a license called
DEC-3-Clause
which is as similar asBSD-3-Clause
to3-Clause BSD License
regarding the similarity measures we currently use. Splitting at dashes is a good way of handling cases correctly in which the license name is given without dashes.It should be noted that the number of tests is very sparse so there is a possibility of breaking something else with this change. I still think it is safe to proceed since we are only modifying the behavior in situations we are unsure anyway.
Additional Note
The other tests failing on main are not related to this PR.