Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the OneHotEncoder to be able to drop categories #339

Merged
merged 1 commit into from
Dec 26, 2024

Conversation

27pchrisl
Copy link
Contributor

Hi,

I've been working with a sparse dataset, in which my '?' category should really be represented as none of the generated features being hot when using the OneHotEncoder.

This contribution adds this as a backwards-compatible option to the encoder.

Copy link
Member

@andrewdalpino andrewdalpino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nice little change @27pchrisl thanks! I'll have to give some thought to it as to if it's the best way to accomplish the end goal.

/**
* @param string|array $drop The categories to drop during encoding
*/
public function __construct($drop = [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I think it's nice to be liberal with the type of this argument, it's tradition to prefer strict types unless necessary. In this case it's not necessary though.

@27pchrisl
Copy link
Contributor Author

Thanks Andrew!

@andrewdalpino
Copy link
Member

andrewdalpino commented Jul 9, 2024

Hey @27pchrisl I'm interested to know if you've thought of other approaches ... for example, filtering specific categories from the dataset before OneHotEncoding it. Would a "CategoryDropper" Transformer allow for the same outcome when paired with OneHotEncoder but also serve other useful purposes? I get that you'd have to replace the category with something (perhaps a missing data placeholder ex. '?') and so it's not really "dropping" the category but maybe this could be handled by making OneHotEncoder "missing data aware" and ignore those data.

I think if we can rule out there being no better alternatives than to handle the "dropping" of categories in the OneHotEncoder, then this is a go.

Also, I'm just a tiny bit concerned about there being no discrimination between feature columns here. Like if the same set of categories were used to describe different features. You wouldn't have control over which columns to operate on it would always be all of them. This is not a deal-breaker for me though - just something we would want to make special note of in the documenation.

@27pchrisl
Copy link
Contributor Author

27pchrisl commented Jul 9, 2024

Hi @andrewdalpino, yep I agree that if you have a feature where many categories should be not hot, the author should transform that outside of the OneHotEncoder so it can just do its own job. Similar to preparing using the MissingDataImputer. Then the OHE only needs to be told which single category should be dropped, probably defaulting to '?'.

I took inspiration from the signature from scikit-learn, which probably isn't the best source since python libraries tend to really overload their parameters ☺️

I'm using a very sparse dataset (CRM data), so I definitely need the capability for a none-hot category to prevent the model thinking the absence of a category is a category in itself. Absence represents poor quality data rather than a deliberate choice. My goal was to have the model ignore the feature in that case.

@andrewdalpino andrewdalpino changed the base branch from master to 2.6 December 26, 2024 18:31
@andrewdalpino andrewdalpino merged commit 90e8122 into RubixML:2.6 Dec 26, 2024
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants