-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Derive Clone
on Tokenizer
, add Encoding.into_tokens()
method
#1381
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me, but I am probably not the best person to review. Adding other reviewers.
In particular for me, I'm not sure where Clone is being called, so I can't picture offhand how it's being used. Any chance this is missing another file or two that has Clone in it?
There isn't any use of let tokenizer = Tokenizer.from_pretrained(name, None)?;
for path in paths {
let tokenizer = tokenizer.clone();
thread_pool::execute(move || {
// tokenize file at `path` ...
})
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi sorry for the delay.
I don't think we need into_tokens()
.
get_tokens().to_vec()
should be preferred IMHO.
If you're not using the Encoding
further then the compiler should be able to infer the move semantics (To be proven I admit).
The problem with into_
is that actually into_ids()
would also be very valid (it's actually much superior to tokens, tokens being really poor informationwise compared to ids).
I'll go ahead and accept the Clone
thing asap. There's indeed no need to not have it clone.
For the into_
I think we not include this particular one. Maybe a single method that split all the pieces if needed but I think the get_...to_vec()
is a bit more natural IMO).
To motivate the split_into()
thing, I think it would be nice to have a simple example showcasing how it's superior because of performance (or code simplicity).
Hey team,
For the What's In My Big Data? (WIMBD) project I found that we needed a couple additional features in the Rust library here. Currently we're depending on my fork of tokenizers but I was hoping we could get this merged and released so I can publish our
wimbd
crate.The changes proposed are:
Clone
trait for theTokenizer
struct. This proved super useful so we don't need to re-instantiate the same tokenizer over and over when we run worker jobs in threads..into_tokens() -> Vec<String>
method on theEncoding
class so we can easily get an ownedVec
of the raw tokens without having to create a new one.