Possible errors in the eurlex_train.txt and eurlex_test.txt - missing labels? #53

CarloNicolini · 2024-02-14T10:44:38Z

I was trying to load the eurlex_train.txt and eurlex_test.txt.
As far as I understood they are in the LibSVM format for multilabel classification.

Using the sklearn.datasets.load_svmlight_file fails though.
I've observed that in the eurlex_train.txt file, there are 28 rows holding no label, where the newline starts with a space.

If you run the following command

cat eurlex_train.txt | grep -n "^ " | cut -d ':' -f 1

it results in 28 rows with the following line numbers in the eurlex_train.txt where the labels are missing:

Despite this, the training using the Rust CLI (and the python wrapper too) works straight.
I've observed that a check for the presence of labels in the line are present in the omikuji/src/data.rs by the parse_xc_repo_data_line function.

Since it seems I cannot rely on the very good sklearn.datasets.load_svmlight_file, what label should I assign to those rows?
In a first simple implementation I decided to skip missing-label rows.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible errors in the eurlex_train.txt and eurlex_test.txt - missing labels? #53

Possible errors in the eurlex_train.txt and eurlex_test.txt - missing labels? #53

CarloNicolini commented Feb 14, 2024

Possible errors in the eurlex_train.txt and eurlex_test.txt - missing labels? #53

Possible errors in the eurlex_train.txt and eurlex_test.txt - missing labels? #53

Comments

CarloNicolini commented Feb 14, 2024