Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible errors in the eurlex_train.txt and eurlex_test.txt - missing labels? #53

Open
CarloNicolini opened this issue Feb 14, 2024 · 0 comments

Comments

@CarloNicolini
Copy link

I was trying to load the eurlex_train.txt and eurlex_test.txt.
As far as I understood they are in the LibSVM format for multilabel classification.

Using the sklearn.datasets.load_svmlight_file fails though.
I've observed that in the eurlex_train.txt file, there are 28 rows holding no label, where the newline starts with a space.

If you run the following command

cat eurlex_train.txt | grep -n "^ " | cut -d ':' -f 1

it results in 28 rows with the following line numbers in the eurlex_train.txt where the labels are missing:

95
254
511
1529
1941
1955
4031
4428
4645
4729
5233
5764
6297
6335
6705
7085
9479
9677
10001
10490
10738
10912
11676
12282
12601
13149
14169
14724

Despite this, the training using the Rust CLI (and the python wrapper too) works straight.
I've observed that a check for the presence of labels in the line are present in the omikuji/src/data.rs by the parse_xc_repo_data_line function.

Since it seems I cannot rely on the very good sklearn.datasets.load_svmlight_file, what label should I assign to those rows?
In a first simple implementation I decided to skip missing-label rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant