Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TabularNLPAutoML] Add the ability to pass text features directly to CatBoost #141

Open
EmotionEngineer opened this issue Nov 6, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@EmotionEngineer
Copy link

EmotionEngineer commented Nov 6, 2023

🐛 Bug

Comparing notebooks using text features, LAMA / CatBoost I get a significantly higher test RMSE using LAMA
Tried everything, in LAMA leave only CatBoost, adjust CB params manually. Maybe something wrong with my LAMA implementation?

To Reproduce

CatBoost Notebook
LAMA Notebook

Expected behavior

Comparable accuracy to CatBoost when using LightAutoML

@EmotionEngineer EmotionEngineer added the bug Something isn't working label Nov 6, 2023
@EmotionEngineer EmotionEngineer changed the title Low accuracy when using text features [TabularNLPAutoML] Add the ability to pass text features directly to CatBoost Nov 9, 2023
@EmotionEngineer
Copy link
Author

EmotionEngineer commented Nov 10, 2023

I've identified the issue to be related to CatBoost receiving embedding-encoded numeric values from LightAutoML instead of direct text features. In my case, utilizing the 'text_features' directly in CatBoost yields better results compared to using embeddings or TF-IDF from LightAutoML.

I suggest enhancing the functionality of the 'text_features' parameter in CatBoost by adding an option for 'direct', allowing users to leverage CatBoost's built-in text processing functions for improved performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant