Added ability to inspect a 'Sequence' pre-tokenizer. #1341

eaplatanios · 2023-09-19T00:35:36Z

I ran into this while attempting to create a modified copy of a Tokenizer instance after it has been created.

HuggingFaceDocBuilderDev · 2023-09-19T11:22:34Z

The documentation is not available anymore as the PR was closed or merged.

eaplatanios · 2023-09-19T18:18:09Z

Relatedly, what is the release policy for this project? Is there a possibility of making a patch release after this PR is merged (e.g., 0.14.1)?

eaplatanios · 2023-09-19T20:41:26Z

tokenizers/src/pre_tokenizers/sequence.rs

+        &self.pre_tokenizers
+    }
+
+    pub fn get_pre_tokenizers_mut(&mut self) -> &mut [PreTokenizerWrapper] {


These two new functions are just mimicking the interface you already have for the normalizer Sequence.

eaplatanios · 2023-09-20T00:54:15Z

cc @ArthurZucker @Narsil as I'm not sure what your process is for PR reviews and notifications.

ArthurZucker · 2023-09-20T16:10:58Z

Hey, this is a feature addition so will not be in a Patch release but rather a minor release.

ArthurZucker

Appart from the renaming that does not seem necessary, have no issue with this

ArthurZucker · 2023-09-20T16:11:25Z

tokenizers/src/pre_tokenizers/sequence.rs

@@ -6,19 +6,27 @@ use serde::{Deserialize, Serialize};
 #[derive(Clone, Debug, PartialEq)]
 #[macro_rules_attribute(impl_serde_type!)]
 pub struct Sequence {
-    pretokenizers: Vec<PreTokenizerWrapper>,
+    pre_tokenizers: Vec<PreTokenizerWrapper>,


This seems like a breaking change to me.

Yeah that's a good point. I did it as a fly-by to match PreTokenizer but you're right. I'll revert that one.

Though given this field is private I don't think it's a breaking change.

I reverted it in either case.

eaplatanios · 2023-09-20T16:33:43Z

Hey, this is a feature addition so will not be in a Patch release but rather a minor release.

Sounds good, thanks @ArthurZucker!

Narsil · 2023-09-20T16:36:16Z

Does this enable changing anything from python itself?

I don't think you can give mutable references across FFI safely, so I'm not sure how this works.
Do you have a simple script to understand what you're trying to do?

eaplatanios · 2023-09-20T16:56:07Z

@Narsil my use case is from a Rust library. I'm not using the Python API.

eaplatanios · 2023-09-20T16:57:22Z

To describe what I need a bit more: basically I take a tokenizer that's already been built (e.g., for an existing model) and perform a transformation to it to remove the prefix space options (either via the Prepend normalizer which is fine because it has a public API for that, or via the ByteLevel pre-tokenizer which I can't access if held within a Sequence right now).

Narsil · 2023-09-20T17:32:42Z

I see! It's obvious now.

I'm not a huge fan of the get_ prefixes of such methods in libs. But if there's précédent, uniformity is better!

Narsil

LGTM

eaplatanios · 2023-09-20T17:43:14Z

Yeah likewise. I did it this way to keep things uniform with what was there already.

eaplatanios · 2023-09-20T21:04:24Z

Would it be possible to make a release right after this PR is merged so I can directly use it? Or, if not, when is the next release scheduled for?

Narsil · 2023-09-21T06:10:12Z

We don't have a super scheduled release plan.

Couldn´t you use directly the git branch in the meantime ?

eaplatanios · 2023-09-21T17:49:02Z

Oh nice I didn't even know Cargo supported that. Very nice, thanks!

eaplatanios added 3 commits September 18, 2023 17:33

Added ability to inspect a 'Sequence' pre-tokenizer.

07b6ab3

Added ability to inspect a 'Sequence' pre-tokenizer.

f4ea1ed

Added ability to inspect a 'Sequence' pre-tokenizer.

e492f9d

Linting error.

a9ce58f

eaplatanios commented Sep 19, 2023

View reviewed changes

Fix.

28c18dc

ArthurZucker reviewed Sep 20, 2023

View reviewed changes

Revert rename,

7b3f7e2

Narsil approved these changes Sep 20, 2023

View reviewed changes

Narsil merged commit 18bd5e8 into huggingface:main Sep 21, 2023
12 checks passed

eaplatanios mentioned this pull request Jan 22, 2024

Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. #1443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ability to inspect a 'Sequence' pre-tokenizer. #1341

Added ability to inspect a 'Sequence' pre-tokenizer. #1341

eaplatanios commented Sep 19, 2023

HuggingFaceDocBuilderDev commented Sep 19, 2023 •

edited

Loading

eaplatanios commented Sep 19, 2023

eaplatanios Sep 19, 2023 •

edited

Loading

eaplatanios commented Sep 20, 2023

ArthurZucker commented Sep 20, 2023

ArthurZucker left a comment

ArthurZucker Sep 20, 2023

eaplatanios Sep 20, 2023

eaplatanios Sep 20, 2023

eaplatanios Sep 20, 2023

eaplatanios commented Sep 20, 2023

Narsil commented Sep 20, 2023

eaplatanios commented Sep 20, 2023

eaplatanios commented Sep 20, 2023

Narsil commented Sep 20, 2023

Narsil left a comment

eaplatanios commented Sep 20, 2023

eaplatanios commented Sep 20, 2023 •

edited

Loading

Narsil commented Sep 21, 2023

eaplatanios commented Sep 21, 2023

Added ability to inspect a 'Sequence' pre-tokenizer. #1341

Added ability to inspect a 'Sequence' pre-tokenizer. #1341

Conversation

eaplatanios commented Sep 19, 2023

HuggingFaceDocBuilderDev commented Sep 19, 2023 • edited Loading

eaplatanios commented Sep 19, 2023

eaplatanios Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

eaplatanios commented Sep 20, 2023

ArthurZucker commented Sep 20, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Sep 20, 2023

Choose a reason for hiding this comment

eaplatanios Sep 20, 2023

Choose a reason for hiding this comment

eaplatanios Sep 20, 2023

Choose a reason for hiding this comment

eaplatanios Sep 20, 2023

Choose a reason for hiding this comment

eaplatanios commented Sep 20, 2023

Narsil commented Sep 20, 2023

eaplatanios commented Sep 20, 2023

eaplatanios commented Sep 20, 2023

Narsil commented Sep 20, 2023

Narsil left a comment

Choose a reason for hiding this comment

eaplatanios commented Sep 20, 2023

eaplatanios commented Sep 20, 2023 • edited Loading

Narsil commented Sep 21, 2023

eaplatanios commented Sep 21, 2023

HuggingFaceDocBuilderDev commented Sep 19, 2023 •

edited

Loading

eaplatanios Sep 19, 2023 •

edited

Loading

eaplatanios commented Sep 20, 2023 •

edited

Loading