Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support combiner packing for tft.vocabulary #259

Open
cyc opened this issue Jan 21, 2022 · 3 comments
Open

Support combiner packing for tft.vocabulary #259

cyc opened this issue Jan 21, 2022 · 3 comments

Comments

@cyc
Copy link

cyc commented Jan 21, 2022

I'm sure this is likely on the roadmap already, but it would be very beneficial for data transformations that require many simultaneous vocabulary computations to support combiner packing the same way it is supported for tft.experimental.approximate_vocabulary.

This may be a duplicate of #180 (comment) but I think it is worth re-raising given that this has been implemented for approximate_vocabulary.

@zoyahav
Copy link
Member

zoyahav commented Jan 26, 2022

Due to implementation differences (i.e. vocabulary is not combiner based) this is a much more complex task to apply packing to vocabulary as opposed to approximate_vocabulary unfortunately, without causing regressions that is.
@iindyk has looked into this option in the past, he can comment further about feasibility here.

@cyc, could you please describe what it is you're hoping to accomlish for your pipeline through vocabulary analyzer packing?

@cyc
Copy link
Author

cyc commented Jan 27, 2022

@zoyahav the application is similar to what is described in #260. The issue is essentially the same as the previous issue, #180. It seems like having large numbers of unpacked analyzers has some negative effects on performance. I noticed that the implementation of the analyzer seemed to have changed between tft.vocabulary and tft.experimental.approximate_vocabulary so I was wondering if the implementation of tft.vocabulary would similarly change eventually.

Feel free to close this as "wontfix" if this isn't on the roadmap. I was more just inquiring about whether this has been considered or not.

@iindyk
Copy link
Contributor

iindyk commented Mar 7, 2022

The main reason why tft.experimental.approximate_vocabulary can be packed is that it has a limit on the number of unique tokens in the resulting vocabulary (top_k is required) that can be pre-allocated; and if the actual number of unique input values is larger than top_k, approximation happens allowing to still keep only top_k most frequent elements.

The reason why we can't use this logic in tft.vocabulary is that we can't make any assumptions about the number of unique input values and we need exact computation. It also needs needs to work for very large vocabularies (O(10^8) tokens) for which pre-allocation approach is suboptimal.

I looked into some other ways of packing the vocabulary computation (different from what is done for approximate_vocabulary), but they did not provide improvement, particularly with large vocabularies.

Also worth noting that if top_k in approximate_vocabulary is >= the actual number of unique input values, then the computation will be exact and coincide with the result of tft.vocabulary which may allow some users to switch to it without accuracy degradation.

Sorry for a late response, just came across this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants