Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Generalized Numericalizer interface #222

Open
ivansmokovic opened this issue Nov 3, 2020 · 3 comments
Open

Proposal: Generalized Numericalizer interface #222

ivansmokovic opened this issue Nov 3, 2020 · 3 comments
Assignees
Labels
feature New feature or request

Comments

@ivansmokovic
Copy link
Collaborator

As mentioned on Slack, I propose adding a generalized numericalizer interface that would enable users to trivially use more advanced numericalization methods like word word2vec embeddings, TF-IDF and so on. The existing Vocab class fits perfectly into this interface, so no big changes would be required. The interface would look like this:

class SmartNumericalizer(ABC):
​
    def update(tokens):
        passdef finalize():
        passdef numericalize(tokens):
        pass

The name is just a placeholder for now.

One implementation could be a Word2Vec numericalizer that could remember which tokens appeared in the dataset (through the update method, similarly to what Vocab does now) and load them when finalize is called. I assume TF-IDF could be implemented in a similar fashion.

The main reason for implementing this would be to make more advanced numericalization straightforward and avoid user intervention after batching (as is required now).

@ivansmokovic ivansmokovic added the feature New feature or request label Nov 3, 2020
@ivansmokovic ivansmokovic self-assigned this Nov 3, 2020
@mariosasko
Copy link
Collaborator

Before your rework of the Field class, my idea was to define the interface for numericalization/batching so I partially like this proposal. The existing Vocab fits nicely in the interface and can easily subclass it. IMO, it makes sense to have the update and the finalize method optional. By doing so, it's enough to override the numericalize method to define a simple numericalizer.

@ivansmokovic
Copy link
Collaborator Author

I agree. But take into account that the simple numericalizer use-case is already covered by passing a Callable as a numericalizer.

@mariosasko
Copy link
Collaborator

I know, I'm just saying it would be nice to support both approaches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants