Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add display capabilities to tokenizers objects #1542

Closed
wants to merge 104 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
61804d9
initial commit
ArthurZucker Jun 3, 2024
a56da5f
will this work?
ArthurZucker Jun 3, 2024
f1a6a97
make it work for the model for now
ArthurZucker Jun 3, 2024
4a49530
updates
ArthurZucker Jun 3, 2024
f4af616
update
ArthurZucker Jun 3, 2024
88630dc
add metaspace
ArthurZucker Jun 3, 2024
b9d44da
update
ArthurZucker Jun 3, 2024
a90ec22
does not work
ArthurZucker Jun 3, 2024
2224275
current modifications
ArthurZucker Jun 4, 2024
4d9204e
current status
ArthurZucker Jun 4, 2024
4c2aca1
working shit
ArthurZucker Jun 4, 2024
904ce70
this kinda works
ArthurZucker Jun 4, 2024
6413810
finallllly!
ArthurZucker Jun 4, 2024
fda66f5
nits
ArthurZucker Jun 4, 2024
20c9fc4
updates
ArthurZucker Jun 4, 2024
86c77b6
almost there
ArthurZucker Jun 4, 2024
a429642
update
ArthurZucker Jun 4, 2024
3cec010
more nits
ArthurZucker Jun 4, 2024
8d77286
nit
ArthurZucker Jun 4, 2024
e48cd3a
Update bindings/python/src/pre_tokenizers.rs
ArthurZucker Jun 6, 2024
27576e5
ips
ArthurZucker Jun 4, 2024
0d9a452
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker Jun 6, 2024
35373de
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker Jun 6, 2024
1c6d272
update
ArthurZucker Jun 8, 2024
df51116
update and fix
ArthurZucker Jun 8, 2024
4b4b833
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker Jun 8, 2024
59a89c9
only commit one line
ArthurZucker Jun 8, 2024
ac9b849
update
ArthurZucker Jun 8, 2024
a3f7439
update the added vocab string
ArthurZucker Jun 8, 2024
cf5b6f3
nit
ArthurZucker Jun 10, 2024
5d33243
fix sequence's display
ArthurZucker Jun 10, 2024
b73c43d
update display for normalizer sequence
ArthurZucker Jun 10, 2024
3e16df7
Merge branch 'main' of github.com:huggingface/tokenizers into add-dis…
ArthurZucker Jun 10, 2024
b214d77
style
ArthurZucker Jun 10, 2024
0654831
small nit
ArthurZucker Jun 10, 2024
a15e3cc
updates to cleanup
ArthurZucker Jun 10, 2024
6023192
update
ArthurZucker Jun 10, 2024
ebf1258
update
ArthurZucker Jun 10, 2024
477a9b5
nits
ArthurZucker Jun 10, 2024
93a1e63
fix some stuff
ArthurZucker Jun 10, 2024
7591f2b
update sequence for pre_tokenizers using fold
ArthurZucker Jun 10, 2024
f50e4e0
update
ArthurZucker Jun 10, 2024
4f15052
proper padding derive
ArthurZucker Jun 10, 2024
85c7b69
update trunctation for consistency
ArthurZucker Jun 10, 2024
0a16ca0
clean
ArthurZucker Jun 10, 2024
35d442d
styling
ArthurZucker Jun 10, 2024
a3cc764
update added tokens decoder as getter
ArthurZucker Jun 10, 2024
5b20fa7
update init property
ArthurZucker Jun 10, 2024
15f877e
nit
ArthurZucker Jun 10, 2024
9c45e8f
update sequences and basic enums to show xxxx.Sequence
ArthurZucker Jun 10, 2024
4a34870
update
ArthurZucker Jun 10, 2024
e0d35e0
update
ArthurZucker Jun 10, 2024
fe95add
some finishing touch
ArthurZucker Jun 10, 2024
2770099
Update bindings/python/Cargo.toml
ArthurZucker Jun 10, 2024
3d0eb0a
nit
ArthurZucker Jun 10, 2024
11a3601
gracefully handle errors for the proc macro
ArthurZucker Jun 11, 2024
f6fa136
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker Jun 11, 2024
2a54482
remove derive_more
ArthurZucker Jun 11, 2024
998b2a3
update my custom macro
ArthurZucker Jun 11, 2024
4df6cc2
replace derive more
ArthurZucker Jun 11, 2024
a9c6c61
Merge branch 'main' into add-display
ArthurZucker Jun 11, 2024
aefdc91
stash
ArthurZucker Jun 11, 2024
f67af9c
updates
ArthurZucker Jun 12, 2024
4c3f37a
update display derive
ArthurZucker Jun 12, 2024
292475f
blindly fix stuff
ArthurZucker Jun 12, 2024
99cb054
maybe work
ArthurZucker Jun 12, 2024
5c930e9
remove tests from vendored parsing
ArthurZucker Jun 12, 2024
f87bb97
update
ArthurZucker Jun 12, 2024
c4b4f3c
simplify some stuff
ArthurZucker Jun 12, 2024
e712079
current status, not bad but not soooooo good
ArthurZucker Jun 13, 2024
5540136
is this a good start?
ArthurZucker Jun 14, 2024
ba03c16
small changes
ArthurZucker Jun 14, 2024
d0e741b
format does not work yet
ArthurZucker Jun 14, 2024
19afb66
some cleanup of unnecessary things
ArthurZucker Jun 16, 2024
9559dea
nit
ArthurZucker Jun 16, 2024
e53f4ca
current status
ArthurZucker Jun 16, 2024
18238dd
let's just go with this
ArthurZucker Jun 16, 2024
269ff21
update
ArthurZucker Jun 17, 2024
e799602
Merge branch 'main' into add-display
ArthurZucker Jul 15, 2024
93ad593
update
ArthurZucker Jul 15, 2024
3aa0138
derive auto display
ArthurZucker Jul 19, 2024
011340b
nit
ArthurZucker Jul 19, 2024
3fc31d0
nice
ArthurZucker Jul 19, 2024
51d3f61
updates
ArthurZucker Jul 19, 2024
acb8196
Merge branch 'main' into add-display
ArthurZucker Jul 19, 2024
951b6e6
deos
ArthurZucker Jul 19, 2024
c2a320c
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker Jul 19, 2024
2048c02
fix build
ArthurZucker Jul 19, 2024
104fe0c
Use pyo3 smd v0.21 (#1574)
EricLBuehler Jul 20, 2024
7db6109
stash commit, wanna make sure this is recorded
ArthurZucker Jul 21, 2024
c7cd927
what works a bit ?
ArthurZucker Jul 25, 2024
e4cf65a
update
ArthurZucker Jul 25, 2024
39ffc28
fix tokenizer's wrapping
ArthurZucker Jul 27, 2024
0a3bb18
fix normalizer display
ArthurZucker Jul 27, 2024
c436b23
fix!
ArthurZucker Jul 27, 2024
e5b059f
final touch?
ArthurZucker Jul 27, 2024
ff825a7
full autodebug
ArthurZucker Jul 28, 2024
c30df0c
remove dict and dir as it's gonna be a bit more involved
ArthurZucker Jul 28, 2024
b78e11c
remove pub where it is not necessary
ArthurZucker Jul 30, 2024
a99c645
fmt =
ArthurZucker Jul 30, 2024
9022470
formating
ArthurZucker Aug 2, 2024
64b8df0
remove non needed fm
ArthurZucker Aug 2, 2024
27cad45
so we only need format when the visibility is not pub but pub(crate)
ArthurZucker Aug 2, 2024
ceabef3
Merge branch 'main' into add-display
ArthurZucker Aug 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion bindings/python/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,12 @@ serde = { version = "1.0", features = [ "rc", "derive" ]}
serde_json = "1.0"
libc = "0.2"
env_logger = "0.11"
pyo3 = { version = "0.21" }
numpy = "0.21"
ndarray = "0.15"
itertools = "0.12"
derive_more = "0.99.17"
pyo3 = { version = "0.21", features = ["multiple-pymethods"] }
pyo3_special_method_derive_0_21 = {path = "../../../pyo3-special-method-derive/pyo3_special_method_derive_0_21"}

[dependencies.tokenizers]
path = "../../tokenizers"
Expand Down
Empty file added bindings/python/grep
Empty file.
13 changes: 9 additions & 4 deletions bindings/python/src/decoders.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ use crate::utils::PyPattern;
use pyo3::exceptions;
use pyo3::prelude::*;
use pyo3::types::*;
use pyo3_special_method_derive_0_21::{AutoDebug, AutoDisplay, Repr, Str};
use serde::de::Error;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use tk::decoders::bpe::BPEDecoder;
Expand All @@ -28,9 +29,11 @@ use super::error::ToPyResult;
/// This class is not supposed to be instantiated directly. Instead, any implementation of
/// a Decoder will return an instance of this class when instantiated.
#[pyclass(dict, module = "tokenizers.decoders", name = "Decoder", subclass)]
#[derive(Clone, Deserialize, Serialize)]
#[derive(Clone, Deserialize, Serialize, Str, Repr)]
#[format(fmt = "{}")]
pub struct PyDecoder {
#[serde(flatten)]
#[format]
pub(crate) decoder: PyDecoderWrapper,
Comment on lines +36 to 37
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

visibility here forces us to add format

}

Expand Down Expand Up @@ -478,9 +481,10 @@ impl PySequenceDecoder {
}
}

#[derive(Clone)]
#[derive(Clone, AutoDisplay, AutoDebug)]
pub(crate) struct CustomDecoder {
inner: PyObject,
#[format(skip)]
pub inner: PyObject,
Comment on lines +486 to +487
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not implemented yet so skipping for now

}

impl CustomDecoder {
Expand Down Expand Up @@ -531,8 +535,9 @@ impl<'de> Deserialize<'de> for CustomDecoder {
}
}

#[derive(Clone, Deserialize, Serialize)]
#[derive(Clone, Deserialize, Serialize, AutoDisplay, AutoDebug)]
#[serde(untagged)]
#[format(fmt = "{}")]
pub(crate) enum PyDecoderWrapper {
Custom(Arc<RwLock<CustomDecoder>>),
Wrapped(Arc<RwLock<DecoderWrapper>>),
Comment on lines +540 to 543
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will directly display Arc<RwLock<CustomDecoder>>

Expand Down
7 changes: 4 additions & 3 deletions bindings/python/src/models.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@ use std::collections::HashMap;
use std::path::{Path, PathBuf};
use std::sync::{Arc, RwLock};

use super::error::{deprecation_warning, ToPyResult};
use crate::token::PyToken;
use crate::trainers::PyTrainer;
use pyo3::exceptions;
use pyo3::prelude::*;
use pyo3::types::*;
use pyo3_special_method_derive_0_21::{Repr, Str};
use serde::{Deserialize, Serialize};
use tk::models::bpe::{BpeBuilder, Merges, Vocab, BPE};
use tk::models::unigram::Unigram;
Expand All @@ -16,16 +18,15 @@ use tk::models::ModelWrapper;
use tk::{Model, Token};
use tokenizers as tk;

use super::error::{deprecation_warning, ToPyResult};

/// Base class for all models
///
/// The model represents the actual tokenization algorithm. This is the part that
/// will contain and manage the learned vocabulary.
///
/// This class cannot be constructed directly. Please use one of the concrete models.
#[pyclass(module = "tokenizers.models", name = "Model", subclass)]
#[derive(Clone, Serialize, Deserialize)]
#[derive(Clone, Serialize, Deserialize, Str, Repr)]
#[format(fmt = "{}")]
pub struct PyModel {
#[serde(flatten)]
pub model: Arc<RwLock<ModelWrapper>>,
Expand Down
22 changes: 15 additions & 7 deletions bindings/python/src/normalizers.rs
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
use std::sync::{Arc, RwLock};

use crate::error::ToPyResult;
use crate::utils::{PyNormalizedString, PyNormalizedStringRefMut, PyPattern};
use pyo3::exceptions;
use pyo3::prelude::*;
use pyo3::types::*;

use crate::error::ToPyResult;
use crate::utils::{PyNormalizedString, PyNormalizedStringRefMut, PyPattern};
use pyo3_special_method_derive_0_21::{AutoDebug, AutoDisplay, Dict, Dir, Repr, Str};
use serde::ser::SerializeStruct;
use serde::{Deserialize, Deserializer, Serialize, Serializer};
use tk::normalizers::{
Expand Down Expand Up @@ -43,9 +43,11 @@ impl PyNormalizedStringMut<'_> {
/// This class is not supposed to be instantiated directly. Instead, any implementation of a
/// Normalizer will return an instance of this class when instantiated.
#[pyclass(dict, module = "tokenizers.normalizers", name = "Normalizer", subclass)]
#[derive(Clone, Serialize, Deserialize)]
#[derive(Clone, Serialize, Deserialize, Str, Repr, Dir)]
#[format(fmt = "{}")]
pub struct PyNormalizer {
#[serde(flatten)]
#[format]
pub(crate) normalizer: PyNormalizerTypeWrapper,
}

Expand Down Expand Up @@ -477,7 +479,10 @@ impl PyNmt {
/// Precompiled normalizer
/// Don't use manually it is used for compatiblity for SentencePiece.
#[pyclass(extends=PyNormalizer, module = "tokenizers.normalizers", name = "Precompiled")]
#[derive(Str)]
#[format(fmt = "PreCompiled")]
pub struct PyPrecompiled {}

#[pymethods]
impl PyPrecompiled {
#[new]
Expand Down Expand Up @@ -513,8 +518,9 @@ impl PyReplace {
}
}

#[derive(Debug, Clone)]
#[derive(AutoDebug, Clone, AutoDisplay)]
pub(crate) struct CustomNormalizer {
#[format(fmt = "Custom Normalizer")]
inner: PyObject,
}
impl CustomNormalizer {
Expand Down Expand Up @@ -556,8 +562,9 @@ impl<'de> Deserialize<'de> for CustomNormalizer {
}
}

#[derive(Debug, Clone, Deserialize)]
#[derive(AutoDebug, Clone, Deserialize, AutoDisplay)]
#[serde(untagged)]
#[format(fmt = "{}")]
pub(crate) enum PyNormalizerWrapper {
Custom(CustomNormalizer),
Wrapped(NormalizerWrapper),
Expand All @@ -575,8 +582,9 @@ impl Serialize for PyNormalizerWrapper {
}
}

#[derive(Debug, Clone, Deserialize)]
#[derive(Clone, Deserialize, AutoDisplay, AutoDebug)]
#[serde(untagged)]
#[format(fmt = "{}")]
pub(crate) enum PyNormalizerTypeWrapper {
Sequence(Vec<Arc<RwLock<PyNormalizerWrapper>>>),
Single(Arc<RwLock<PyNormalizerWrapper>>),
Expand Down
18 changes: 12 additions & 6 deletions bindings/python/src/pre_tokenizers.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ use tokenizers as tk;

use super::error::ToPyResult;
use super::utils::*;

use pyo3_special_method_derive_0_21::{AutoDebug, AutoDisplay, Dict, Dir, Repr, Str};
/// Base class for all pre-tokenizers
///
/// This class is not supposed to be instantiated directly. Instead, any implementation of a
Expand All @@ -34,10 +34,12 @@ use super::utils::*;
name = "PreTokenizer",
subclass
)]
#[derive(Clone, Serialize, Deserialize)]
#[derive(Clone, Serialize, Deserialize, Str, Repr, Dir, Dict)]
#[format(fmt = "{}")] // don't format the Py wrapper
pub struct PyPreTokenizer {
#[serde(flatten)]
pub(crate) pretok: PyPreTokenizerTypeWrapper,
#[format]
pretok: PyPreTokenizerTypeWrapper,
}

impl PyPreTokenizer {
Expand Down Expand Up @@ -425,6 +427,8 @@ impl PyPunctuation {

/// This pre-tokenizer composes other pre_tokenizers and applies them in sequence
#[pyclass(extends=PyPreTokenizer, module = "tokenizers.pre_tokenizers", name = "Sequence")]
#[derive(AutoDisplay)]
#[format(fmt = "Sequence.{}")]
pub struct PySequence {}
#[pymethods]
impl PySequence {
Expand Down Expand Up @@ -587,7 +591,7 @@ impl PyUnicodeScripts {
}
}

#[derive(Clone)]
#[derive(Clone, AutoDisplay, AutoDebug)]
pub(crate) struct CustomPreTokenizer {
inner: PyObject,
}
Expand Down Expand Up @@ -631,8 +635,9 @@ impl<'de> Deserialize<'de> for CustomPreTokenizer {
}
}

#[derive(Clone, Deserialize)]
#[derive(Clone, Deserialize, AutoDisplay, AutoDebug)]
#[serde(untagged)]
#[format(fmt = "{}")]
pub(crate) enum PyPreTokenizerWrapper {
Custom(CustomPreTokenizer),
Wrapped(PreTokenizerWrapper),
Expand All @@ -650,8 +655,9 @@ impl Serialize for PyPreTokenizerWrapper {
}
}

#[derive(Clone, Deserialize)]
#[derive(Clone, Deserialize, AutoDisplay, AutoDebug)]
#[serde(untagged)]
#[format(fmt = "{}")]
pub(crate) enum PyPreTokenizerTypeWrapper {
Sequence(Vec<Arc<RwLock<PyPreTokenizerWrapper>>>),
Single(Arc<RwLock<PyPreTokenizerWrapper>>),
Expand Down
9 changes: 5 additions & 4 deletions bindings/python/src/processors.rs
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
use std::convert::TryInto;
use std::sync::Arc;

use crate::encoding::PyEncoding;
use crate::error::ToPyResult;
use pyo3::exceptions;
use pyo3::prelude::*;
use pyo3::types::*;

use crate::encoding::PyEncoding;
use crate::error::ToPyResult;
use pyo3_special_method_derive_0_21::{Repr, Str};
use serde::{Deserialize, Serialize};
use tk::processors::bert::BertProcessing;
use tk::processors::byte_level::ByteLevel;
Expand All @@ -27,7 +27,8 @@ use tokenizers as tk;
name = "PostProcessor",
subclass
)]
#[derive(Clone, Deserialize, Serialize)]
#[derive(Clone, Deserialize, Serialize, Str, Repr)]
#[format(fmt = "{}")]
pub struct PyPostProcessor {
#[serde(flatten)]
pub processor: Arc<PostProcessorWrapper>,
Expand Down
27 changes: 14 additions & 13 deletions bindings/python/src/tokenizer.rs
Original file line number Diff line number Diff line change
@@ -1,12 +1,23 @@
use std::collections::{hash_map::DefaultHasher, HashMap};
use std::hash::{Hash, Hasher};

use super::decoders::PyDecoder;
use super::encoding::PyEncoding;
use super::error::{PyError, ToPyResult};
use super::models::PyModel;
use super::normalizers::PyNormalizer;
use super::pre_tokenizers::PyPreTokenizer;
use super::trainers::PyTrainer;
use crate::processors::PyPostProcessor;
use crate::utils::{MaybeSizedIterator, PyBufferedIterator};
use numpy::{npyffi, PyArray1};
use pyo3::class::basic::CompareOp;
use pyo3::exceptions;
use pyo3::intern;
use pyo3::prelude::*;
use pyo3::types::*;
use pyo3_special_method_derive_0_21::{Repr, Str};
use std::collections::BTreeMap;
use tk::models::bpe::BPE;
use tk::tokenizer::{
Model, PaddingDirection, PaddingParams, PaddingStrategy, PostProcessor, TokenizerImpl,
Expand All @@ -15,17 +26,6 @@ use tk::tokenizer::{
use tk::utils::iter::ResultShunt;
use tokenizers as tk;

use super::decoders::PyDecoder;
use super::encoding::PyEncoding;
use super::error::{PyError, ToPyResult};
use super::models::PyModel;
use super::normalizers::PyNormalizer;
use super::pre_tokenizers::PyPreTokenizer;
use super::trainers::PyTrainer;
use crate::processors::PyPostProcessor;
use crate::utils::{MaybeSizedIterator, PyBufferedIterator};
use std::collections::BTreeMap;

/// Represents a token that can be be added to a :class:`~tokenizers.Tokenizer`.
/// It can have special options that defines the way it should behave.
///
Expand Down Expand Up @@ -462,9 +462,10 @@ type Tokenizer = TokenizerImpl<PyModel, PyNormalizer, PyPreTokenizer, PyPostProc
/// The core algorithm that this :obj:`Tokenizer` should be using.
///
#[pyclass(dict, module = "tokenizers", name = "Tokenizer")]
#[derive(Clone)]
#[derive(Clone, Str, Repr)]
#[format(fmt = "{}")]
pub struct PyTokenizer {
tokenizer: Tokenizer,
pub tokenizer: Tokenizer,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a requirement

}

impl PyTokenizer {
Expand Down
1 change: 0 additions & 1 deletion bindings/python/src/utils/normalization.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ use pyo3::prelude::*;
use pyo3::types::*;
use tk::normalizer::{char_to_bytes, NormalizedString, Range, SplitDelimiterBehavior};
use tk::pattern::Pattern;

/// Represents a Pattern as used by `NormalizedString`
#[derive(Clone, FromPyObject)]
pub enum PyPattern {
Expand Down
1 change: 1 addition & 0 deletions tokenizers/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ fancy-regex = { version = "0.13", optional = true}
getrandom = { version = "0.2.10" }
esaxx-rs = { version = "0.1.10", default-features = false, features=[]}
monostate = "0.1.12"
pyo3_special_method_derive_0_21 = {path = "../../pyo3-special-method-derive/pyo3_special_method_derive_0_21"}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not forget to remove


[features]
default = ["progressbar", "onig", "esaxx_fast"]
Expand Down
5 changes: 2 additions & 3 deletions tokenizers/src/decoders/bpe.rs
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
use crate::tokenizer::{Decoder, Result};

use pyo3_special_method_derive_0_21::{AutoDebug, AutoDisplay};
use serde::{Deserialize, Serialize};

#[derive(Deserialize, Clone, Debug, Serialize)]
#[derive(Deserialize, Clone, AutoDebug, Serialize, AutoDisplay)]
/// Allows decoding Original BPE by joining all the tokens and then replacing
/// the suffix used to identify end-of-words by whitespaces
#[serde(tag = "type")]
Expand Down
6 changes: 3 additions & 3 deletions tokenizers/src/decoders/byte_fallback.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
use crate::tokenizer::{Decoder, Result};
use monostate::MustBe;

use pyo3_special_method_derive_0_21::{AutoDebug, AutoDisplay};
use serde::{Deserialize, Serialize};

#[derive(Deserialize, Clone, Debug, Serialize, Default)]
#[derive(Deserialize, Clone, AutoDebug, Serialize, Default, AutoDisplay)]
/// ByteFallback is a simple trick which converts tokens looking like `<0x61>`
/// to pure bytes, and attempts to make them into a string. If the tokens
/// cannot be decoded you will get � instead for each inconvertable byte token
#[non_exhaustive]
#[format(fmt = "ByteFallback")]
pub struct ByteFallback {
#[serde(rename = "type")]
type_: MustBe!("ByteFallback"),
Expand Down
4 changes: 2 additions & 2 deletions tokenizers/src/decoders/ctc.rs
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
use crate::decoders::wordpiece;
use crate::tokenizer::{Decoder, Result};

use itertools::Itertools;
use pyo3_special_method_derive_0_21::{AutoDebug, AutoDisplay};
use serde::{Deserialize, Serialize};

#[derive(Debug, Clone, Serialize, Deserialize)]
#[derive(AutoDebug, Clone, Serialize, Deserialize, AutoDisplay)]
/// The CTC (Connectionist Temporal Classification) decoder takes care
/// of sanitizing a list of inputs token.
/// Due to some alignement problem the output of some models can come
Expand Down
5 changes: 3 additions & 2 deletions tokenizers/src/decoders/fuse.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
use crate::tokenizer::{Decoder, Result};
use monostate::MustBe;
use pyo3_special_method_derive_0_21::{AutoDebug, AutoDisplay};
use serde::{Deserialize, Serialize};

#[derive(Clone, Debug, Serialize, Deserialize, Default)]
#[derive(Clone, AutoDebug, Serialize, Deserialize, Default, AutoDisplay)]
/// Fuse simply fuses all tokens into one big string.
/// It's usually the last decoding step anyway, but this
/// decoder exists incase some decoders need to happen after that
/// step
#[non_exhaustive]
#[format(fmt = "Fuse")]
pub struct Fuse {
#[serde(rename = "type")]
type_: MustBe!("Fuse"),
Expand Down
Loading
Loading