Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to inspect a 'Sequence' pre-tokenizer. #1341

Merged
merged 6 commits into from
Sep 21, 2023

Conversation

eaplatanios
Copy link
Contributor

I ran into this while attempting to create a modified copy of a Tokenizer instance after it has been created.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 19, 2023

The documentation is not available anymore as the PR was closed or merged.

@eaplatanios
Copy link
Contributor Author

Relatedly, what is the release policy for this project? Is there a possibility of making a patch release after this PR is merged (e.g., 0.14.1)?

&self.pre_tokenizers
}

pub fn get_pre_tokenizers_mut(&mut self) -> &mut [PreTokenizerWrapper] {
Copy link
Contributor Author

@eaplatanios eaplatanios Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two new functions are just mimicking the interface you already have for the normalizer Sequence.

@eaplatanios
Copy link
Contributor Author

cc @ArthurZucker @Narsil as I'm not sure what your process is for PR reviews and notifications.

@ArthurZucker
Copy link
Collaborator

Hey, this is a feature addition so will not be in a Patch release but rather a minor release.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appart from the renaming that does not seem necessary, have no issue with this

@@ -6,19 +6,27 @@ use serde::{Deserialize, Serialize};
#[derive(Clone, Debug, PartialEq)]
#[macro_rules_attribute(impl_serde_type!)]
pub struct Sequence {
pretokenizers: Vec<PreTokenizerWrapper>,
pre_tokenizers: Vec<PreTokenizerWrapper>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a breaking change to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point. I did it as a fly-by to match PreTokenizer but you're right. I'll revert that one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though given this field is private I don't think it's a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted it in either case.

@eaplatanios
Copy link
Contributor Author

Hey, this is a feature addition so will not be in a Patch release but rather a minor release.

Sounds good, thanks @ArthurZucker!

@Narsil
Copy link
Collaborator

Narsil commented Sep 20, 2023

Does this enable changing anything from python itself?

I don't think you can give mutable references across FFI safely, so I'm not sure how this works.
Do you have a simple script to understand what you're trying to do?

@eaplatanios
Copy link
Contributor Author

@Narsil my use case is from a Rust library. I'm not using the Python API.

@eaplatanios
Copy link
Contributor Author

To describe what I need a bit more: basically I take a tokenizer that's already been built (e.g., for an existing model) and perform a transformation to it to remove the prefix space options (either via the Prepend normalizer which is fine because it has a public API for that, or via the ByteLevel pre-tokenizer which I can't access if held within a Sequence right now).

@Narsil
Copy link
Collaborator

Narsil commented Sep 20, 2023

I see! It's obvious now.

I'm not a huge fan of the get_ prefixes of such methods in libs. But if there's précédent, uniformity is better!

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@eaplatanios
Copy link
Contributor Author

Yeah likewise. I did it this way to keep things uniform with what was there already.

@eaplatanios
Copy link
Contributor Author

eaplatanios commented Sep 20, 2023

Would it be possible to make a release right after this PR is merged so I can directly use it? Or, if not, when is the next release scheduled for?

@Narsil
Copy link
Collaborator

Narsil commented Sep 21, 2023

We don't have a super scheduled release plan.

Couldn´t you use directly the git branch in the meantime ?

@Narsil Narsil merged commit 18bd5e8 into huggingface:main Sep 21, 2023
12 checks passed
@eaplatanios
Copy link
Contributor Author

Oh nice I didn't even know Cargo supported that. Very nice, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants