You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As mentioned on Slack, I propose adding a generalized numericalizer interface that would enable users to trivially use more advanced numericalization methods like word word2vec embeddings, TF-IDF and so on. The existing Vocab class fits perfectly into this interface, so no big changes would be required. The interface would look like this:
One implementation could be a Word2Vec numericalizer that could remember which tokens appeared in the dataset (through the update method, similarly to what Vocab does now) and load them when finalize is called. I assume TF-IDF could be implemented in a similar fashion.
The main reason for implementing this would be to make more advanced numericalization straightforward and avoid user intervention after batching (as is required now).
The text was updated successfully, but these errors were encountered:
Before your rework of the Field class, my idea was to define the interface for numericalization/batching so I partially like this proposal. The existing Vocab fits nicely in the interface and can easily subclass it. IMO, it makes sense to have the update and the finalize method optional. By doing so, it's enough to override the numericalize method to define a simple numericalizer.
As mentioned on Slack, I propose adding a generalized
numericalizer
interface that would enable users to trivially use more advanced numericalization methods like word word2vec embeddings, TF-IDF and so on. The existingVocab
class fits perfectly into this interface, so no big changes would be required. The interface would look like this:The name is just a placeholder for now.
One implementation could be a Word2Vec numericalizer that could remember which tokens appeared in the dataset (through the
update
method, similarly to whatVocab
does now) and load them whenfinalize
is called. I assume TF-IDF could be implemented in a similar fashion.The main reason for implementing this would be to make more advanced numericalization straightforward and avoid user intervention after batching (as is required now).
The text was updated successfully, but these errors were encountered: