Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error occurred when using the xgboost as a classifier for hiclass #122

Open
RamSnoussi opened this issue May 7, 2024 · 14 comments
Open

Comments

@RamSnoussi
Copy link

RamSnoussi commented May 7, 2024

Hi,
Bellow it's my example when using the xgboost classifier for hiclass. My question is specifically directed to the hiClass Python package for hierarchical classification. I would like to model the problem using hierarchical classification approach to proceed like in figure below:

dataset_structure

How can I correct this error?

from hiclass import LocalClassifierPerParentNode
from xgboost import XGBClassifier
import numpy as np

X_train = np.array([
    [40.7,  1. ,  1. ,  2. ,  5. ,  2. ,  1. ,  5. , 34.3],
    [39.2,  0. ,  2. ,  4. ,  1. ,  3. ,  1. ,  2. , 34.1],
    [40.6,  0. ,  3. ,  1. ,  4. ,  5. ,  0. ,  6. , 27.7],
    [36.5,  0. ,  3. ,  1. ,  2. ,  2. ,  0. ,  2. , 39.9],
])

Y_train = np.array([
    ['Gastrointestinal', 'Norovirus', ''],
    ['Respiratory', 'Covid', ''],
    ['Allergy', 'External', 'Bee Allergy'],
    ['Respiratory', 'Cold', ''],
])
X_test = np.array([[35.5,  0. ,  1. ,  1. ,  3. ,  3. ,  0. ,  2. , 37.5]])
classifier = LocalClassifierPerParentNode(local_classifier=XGBClassifier(objective='multi:softmax'))
classifier.fit(X_train, Y_train)
predictions = classifier.predict(X_test)
print(predictions)

The output of the program:
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,005 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Gastrointestinal'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
2024-05-07 14:19:52,024 - LCPPN - WARNING - Fitting ConstantClassifier for node 'Norovirus'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 24
     21 classifier = LocalClassifierPerParentNode(local_classifier=XGBClassifier(objective='multi:softmax'))
     23 # Train local classifier per node
---> 24 classifier.fit(X_train, Y_train)
     26 # Predict
     27 predictions = classifier.predict(X_test)

File ~/anaconda3/lib/python3.9/site-packages/hiclass/LocalClassifierPerParentNode.py:109, in LocalClassifierPerParentNode.fit(self, X, y)
    106 self._initialize_local_classifiers()
    108 # Fit local classifiers in DAG
--> 109 self._fit_digraph()
    111 # TODO: Store the classes seen during fit
    112 
    113 # TODO: Add function to allow user to change local classifier
   (...)
    120 
    121 # Delete unnecessary variables
    122 self._clean_up()

File ~/anaconda3/lib/python3.9/site-packages/hiclass/LocalClassifierPerParentNode.py:348, in LocalClassifierPerParentNode._fit_digraph(self)
    346     self.hierarchy_.nodes[node]["classifier"] = ConstantClassifier()
    347     classifier = self.hierarchy_.nodes[node]["classifier"]
--> 348 classifier.fit(X, y)

File ~/anaconda3/lib/python3.9/site-packages/xgboost/core.py:730, in require_keyword_args.<locals>.throw_if.<locals>.inner_f(*args, **kwargs)
    728 for k, arg in zip(sig.parameters, args):
    729     kwargs[k] = arg
--> 730 return func(**kwargs)

File ~/anaconda3/lib/python3.9/site-packages/xgboost/sklearn.py:1471, in XGBClassifier.fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
   1466     expected_classes = self.classes_
   1467 if (
   1468     classes.shape != expected_classes.shape
   1469     or not (classes == expected_classes).all()
   1470 ):
-> 1471     raise ValueError(
   1472         f"Invalid classes inferred from unique values of `y`.  "
   1473         f"Expected: {expected_classes}, got {classes}"
   1474     )
   1476 params = self.get_xgb_params()
   1478 if callable(self.objective):

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1], got ['Respiratory::HiClass::Separator::Cold'
 'Respiratory::HiClass::Separator::Covid']
@tcsmaster
Copy link

If you want to use the softmax objective, you have to encode your label to the range [0, num_classes), which you can't do inside hiclass.

@mirand863
Copy link
Collaborator

mirand863 commented May 10, 2024

Hi @RamSnoussi,

Thank you for the interest in HiClass. As mentioned by @tcsmaster, you need to encode your labels. Here is an example of how to do it, but you have to be careful to call the method fit to encode all of them for all the levels at the same time and then you can encode each level individually with the method encode. I remember doing this encoding previously, so if you want I can look up and see if I can find my old code to share with you.

@tcsmaster
Copy link

If not @RamSnoussi, then I would really appreciate the code snippet.

@mirand863
Copy link
Collaborator

mirand863 commented May 15, 2024

If not @RamSnoussi, then I would really appreciate the code snippet.

The snippet I have is not well structured, but the algorithm goes like this:

from sklearn.preprocessing import LabelEncoder


np_y = np.array(y)  # convert y to a numpy array if it is not yet
flat_y = np.unique(np.append(np_y.flatten(), "hiclass::root"))  # flatten and return all unique labels from the hierarchy

# encode labels in the hierarchy
label_encoder = LabelEncoder()
label_encoder.fit(flat_y)
y = np.array(
    [label_encoder.transform(row) for row in np_y]
)

Then you can train the hierarchical classifier with the encoded labels and decode the labels after prediction with the method inverse_transform. Hope this helps :)

The code is available in this branch if you want to take a further look

@tcsmaster
Copy link

Thank you for the code. Does this also mean that the model needs to encounter all available labels during training?

@RamSnoussi
Copy link
Author

Hi @mirand863,
Despite the coding of the labels at the beginning, I think that adding separator '::HiClass::Separator::'causes the problem. Line 214 in 'LocalClassifierPerParentNode.py'

@mirand863
Copy link
Collaborator

Thank you for the code. Does this also mean that the model needs to encounter all available labels during training?

Hi @tcsmaster,

Yes, the model needs to see as many labels as possible during training. Just be careful to not leak data in case you have to split between training/test data. We can also discuss this in private. Please, feel free to email me at [email protected]

@mirand863
Copy link
Collaborator

mirand863 commented May 23, 2024

Hi @mirand863, Despite the coding of the labels at the beginning, I think that adding separator '::HiClass::Separator::'causes the problem. Line 214 in 'LocalClassifierPerParentNode.py'

Hi @RamSnoussi,

Can you please clarify what is the issue with the separator? I was able to execute this code without errors.

@RamSnoussi
Copy link
Author

Hi @mirand863,
The problem here is how encoding the separtor ' ::HiClass::Separator:: ' added implicitly by the hiclass? for example, the LCPPN in the 2nd level has ' Respiratory::HiClass::Separator::Covid ' as the target label. If I use the LabelEncoder for each label in Y_Train like this ' Respiratory ' as 0 and ' Covid ' as 1. During training, this label becomes ' 0::::HiClass::Separator::1 ' and it causes an error.

@RamSnoussi
Copy link
Author

Is there a solution how to use xgboost with hiclass?

@mirand863
Copy link
Collaborator

mirand863 commented Jun 3, 2024

Hi @mirand863, The problem here is how encoding the separtor ' ::HiClass::Separator:: ' added implicitly by the hiclass? for example, the LCPPN in the 2nd level has ' Respiratory::HiClass::Separator::Covid ' as the target label. If I use the LabelEncoder for each label in Y_Train like this ' Respiratory ' as 0 and ' Covid ' as 1. During training, this label becomes ' 0::::HiClass::Separator::1 ' and it causes an error.

Hi,

Sorry for the delay. Yes, the separator needs to be removed in this case. In the branch I sent you it has been removed, but was not easy to see. Here is a full diff with changes: main...cuml. If I remember correctly, you just need to remove the multiple occurences of .split(self.separator_)[-1], but maybe there are other changes that I forgot at the moment. I would recommend to review the changes and see if they apply for your use case.

@RamSnoussi
Copy link
Author

hi @mirand863,
I used the changes for the example given above. but, the error still persists as following:

Pass `sample_weight` as keyword args.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 27
     25 X_test = np.array([[35.5,  0. ,  1. ,  1. ,  3. ,  3. ,  0. ,  2. , 37.5]])
     26 classifier = LocalClassifierPerParentNode(local_classifier=XGBClassifier(objective='multi:softmax'))
---> 27 classifier.fit(X_train, Y_train)
     28 predictions = classifier.predict(X_test)
     29 print(predictions)

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/hiclass/LocalClassifierPerParentNode.py:112, in LocalClassifierPerParentNode.fit(self, X, y, sample_weight)
    109 super()._pre_fit(X, y, sample_weight)
    111 # Fit local classifiers in DAG
--> 112 super().fit(X, y)
    114 # TODO: Store the classes seen during fit
    115 
    116 # TODO: Add function to allow user to change local classifier
   (...)
    119 
    120 # Return the classifier
    121 return self

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/hiclass/HierarchicalClassifier.py:136, in HierarchicalClassifier.fit(self, X, y, sample_weight)
    113 """
    114 Fit a local hierarchical classifier.
    115 
   (...)
    133     Fitted estimator.
    134 """
    135 # Fit local classifiers in DAG
--> 136 self._fit_digraph()
    138 # Delete unnecessary variables
    139 self._clean_up()

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/hiclass/LocalClassifierPerParentNode.py:248, in LocalClassifierPerParentNode._fit_digraph(self, local_mode, use_joblib)
    246 self.logger_.info("Fitting local classifiers")
    247 nodes = self._get_parents()
--> 248 self._fit_node_classifier(nodes, local_mode, use_joblib)

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/hiclass/HierarchicalClassifier.py:352, in HierarchicalClassifier._fit_node_classifier(self, nodes, local_mode, use_joblib)
    347         classifiers = Parallel(n_jobs=self.n_jobs)(
    348             delayed(self._fit_classifier)(self, node) for node in nodes
    349         )
    351 else:
--> 352     classifiers = [self._fit_classifier(self, node) for node in nodes]
    353 for classifier, node in zip(classifiers, nodes):
    354     self.hierarchy_.nodes[node]["classifier"] = classifier

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/hiclass/HierarchicalClassifier.py:352, in <listcomp>(.0)
    347         classifiers = Parallel(n_jobs=self.n_jobs)(
    348             delayed(self._fit_classifier)(self, node) for node in nodes
    349         )
    351 else:
--> 352     classifiers = [self._fit_classifier(self, node) for node in nodes]
    353 for classifier, node in zip(classifiers, nodes):
    354     self.hierarchy_.nodes[node]["classifier"] = classifier

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/hiclass/LocalClassifierPerParentNode.py:237, in LocalClassifierPerParentNode._fit_classifier(self, node)
    235 if not self.bert:
    236     try:
--> 237         classifier.fit(X, y, sample_weight)
    238     except TypeError:
    239         classifier.fit(X, y)

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/xgboost/core.py:730, in require_keyword_args.<locals>.throw_if.<locals>.inner_f(*args, **kwargs)
    728 for k, arg in zip(sig.parameters, args):
    729     kwargs[k] = arg
--> 730 return func(**kwargs)

File ~/anaconda3/envs/hiclass/lib/python3.8/site-packages/xgboost/sklearn.py:1471, in XGBClassifier.fit(self, X, y, sample_weight, base_margin, eval_set, eval_metric, early_stopping_rounds, verbose, xgb_model, sample_weight_eval_set, base_margin_eval_set, feature_weights, callbacks)
   1466     expected_classes = self.classes_
   1467 if (
   1468     classes.shape != expected_classes.shape
   1469     or not (classes == expected_classes).all()
   1470 ):
-> 1471     raise ValueError(
   1472         f"Invalid classes inferred from unique values of `y`.  "
   1473         f"Expected: {expected_classes}, got {classes}"
   1474     )
   1476 params = self.get_xgb_params()
   1478 if callable(self.objective):

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1], got [7 9]

what is the problem here?

@RamSnoussi
Copy link
Author

hi @mirand863, when I use an older version of xgboost like 0.90 it works successfully.

@mirand863
Copy link
Collaborator

mirand863 commented Jun 26, 2024

Hi @RamSnoussi ,

It seems to me that your xgboost classifier expects the classes to start from 0 for each classifier. I guess you would need to use a label encoder for each local classifier, separately. Please see https://stackoverflow.com/questions/71996617/invalid-classes-inferred-from-unique-values-of-y-expected-0-1-2-3-4-5-got for reference.

Good to know it works in an older version.

Best regards,
Fabio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants