-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ability to inspect a 'Sequence' pre-tokenizer. #1341
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Relatedly, what is the release policy for this project? Is there a possibility of making a patch release after this PR is merged (e.g., |
&self.pre_tokenizers | ||
} | ||
|
||
pub fn get_pre_tokenizers_mut(&mut self) -> &mut [PreTokenizerWrapper] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two new functions are just mimicking the interface you already have for the normalizer Sequence
.
cc @ArthurZucker @Narsil as I'm not sure what your process is for PR reviews and notifications. |
Hey, this is a feature addition so will not be in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appart from the renaming that does not seem necessary, have no issue with this
@@ -6,19 +6,27 @@ use serde::{Deserialize, Serialize}; | |||
#[derive(Clone, Debug, PartialEq)] | |||
#[macro_rules_attribute(impl_serde_type!)] | |||
pub struct Sequence { | |||
pretokenizers: Vec<PreTokenizerWrapper>, | |||
pre_tokenizers: Vec<PreTokenizerWrapper>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a breaking change to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good point. I did it as a fly-by to match PreTokenizer
but you're right. I'll revert that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though given this field is private I don't think it's a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted it in either case.
Sounds good, thanks @ArthurZucker! |
Does this enable changing anything from python itself? I don't think you can give mutable references across FFI safely, so I'm not sure how this works. |
@Narsil my use case is from a Rust library. I'm not using the Python API. |
To describe what I need a bit more: basically I take a tokenizer that's already been built (e.g., for an existing model) and perform a transformation to it to remove the prefix space options (either via the |
I see! It's obvious now. I'm not a huge fan of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Yeah likewise. I did it this way to keep things uniform with what was there already. |
Would it be possible to make a release right after this PR is merged so I can directly use it? Or, if not, when is the next release scheduled for? |
We don't have a super scheduled release plan. Couldn´t you use directly the git branch in the meantime ? |
Oh nice I didn't even know Cargo supported that. Very nice, thanks! |
I ran into this while attempting to create a modified copy of a
Tokenizer
instance after it has been created.