-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add display capabilities to tokenizers objects #1542
Closed
Closed
Changes from 98 commits
Commits
Show all changes
104 commits
Select commit
Hold shift + click to select a range
61804d9
initial commit
ArthurZucker a56da5f
will this work?
ArthurZucker f1a6a97
make it work for the model for now
ArthurZucker 4a49530
updates
ArthurZucker f4af616
update
ArthurZucker 88630dc
add metaspace
ArthurZucker b9d44da
update
ArthurZucker a90ec22
does not work
ArthurZucker 2224275
current modifications
ArthurZucker 4d9204e
current status
ArthurZucker 4c2aca1
working shit
ArthurZucker 904ce70
this kinda works
ArthurZucker 6413810
finallllly!
ArthurZucker fda66f5
nits
ArthurZucker 20c9fc4
updates
ArthurZucker 86c77b6
almost there
ArthurZucker a429642
update
ArthurZucker 3cec010
more nits
ArthurZucker 8d77286
nit
ArthurZucker e48cd3a
Update bindings/python/src/pre_tokenizers.rs
ArthurZucker 27576e5
ips
ArthurZucker 0d9a452
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker 35373de
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker 1c6d272
update
ArthurZucker df51116
update and fix
ArthurZucker 4b4b833
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker 59a89c9
only commit one line
ArthurZucker ac9b849
update
ArthurZucker a3f7439
update the added vocab string
ArthurZucker cf5b6f3
nit
ArthurZucker 5d33243
fix sequence's display
ArthurZucker b73c43d
update display for normalizer sequence
ArthurZucker 3e16df7
Merge branch 'main' of github.com:huggingface/tokenizers into add-dis…
ArthurZucker b214d77
style
ArthurZucker 0654831
small nit
ArthurZucker a15e3cc
updates to cleanup
ArthurZucker 6023192
update
ArthurZucker ebf1258
update
ArthurZucker 477a9b5
nits
ArthurZucker 93a1e63
fix some stuff
ArthurZucker 7591f2b
update sequence for pre_tokenizers using fold
ArthurZucker f50e4e0
update
ArthurZucker 4f15052
proper padding derive
ArthurZucker 85c7b69
update trunctation for consistency
ArthurZucker 0a16ca0
clean
ArthurZucker 35d442d
styling
ArthurZucker a3cc764
update added tokens decoder as getter
ArthurZucker 5b20fa7
update init property
ArthurZucker 15f877e
nit
ArthurZucker 9c45e8f
update sequences and basic enums to show xxxx.Sequence
ArthurZucker 4a34870
update
ArthurZucker e0d35e0
update
ArthurZucker fe95add
some finishing touch
ArthurZucker 2770099
Update bindings/python/Cargo.toml
ArthurZucker 3d0eb0a
nit
ArthurZucker 11a3601
gracefully handle errors for the proc macro
ArthurZucker f6fa136
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker 2a54482
remove derive_more
ArthurZucker 998b2a3
update my custom macro
ArthurZucker 4df6cc2
replace derive more
ArthurZucker a9c6c61
Merge branch 'main' into add-display
ArthurZucker aefdc91
stash
ArthurZucker f67af9c
updates
ArthurZucker 4c3f37a
update display derive
ArthurZucker 292475f
blindly fix stuff
ArthurZucker 99cb054
maybe work
ArthurZucker 5c930e9
remove tests from vendored parsing
ArthurZucker f87bb97
update
ArthurZucker c4b4f3c
simplify some stuff
ArthurZucker e712079
current status, not bad but not soooooo good
ArthurZucker 5540136
is this a good start?
ArthurZucker ba03c16
small changes
ArthurZucker d0e741b
format does not work yet
ArthurZucker 19afb66
some cleanup of unnecessary things
ArthurZucker 9559dea
nit
ArthurZucker e53f4ca
current status
ArthurZucker 18238dd
let's just go with this
ArthurZucker 269ff21
update
ArthurZucker e799602
Merge branch 'main' into add-display
ArthurZucker 93ad593
update
ArthurZucker 3aa0138
derive auto display
ArthurZucker 011340b
nit
ArthurZucker 3fc31d0
nice
ArthurZucker 51d3f61
updates
ArthurZucker acb8196
Merge branch 'main' into add-display
ArthurZucker 951b6e6
deos
ArthurZucker c2a320c
Merge branch 'add-display' of github.com:huggingface/tokenizers into …
ArthurZucker 2048c02
fix build
ArthurZucker 104fe0c
Use pyo3 smd v0.21 (#1574)
EricLBuehler 7db6109
stash commit, wanna make sure this is recorded
ArthurZucker c7cd927
what works a bit ?
ArthurZucker e4cf65a
update
ArthurZucker 39ffc28
fix tokenizer's wrapping
ArthurZucker 0a3bb18
fix normalizer display
ArthurZucker c436b23
fix!
ArthurZucker e5b059f
final touch?
ArthurZucker ff825a7
full autodebug
ArthurZucker c30df0c
remove dict and dir as it's gonna be a bit more involved
ArthurZucker b78e11c
remove pub where it is not necessary
ArthurZucker a99c645
fmt =
ArthurZucker 9022470
formating
ArthurZucker 64b8df0
remove non needed fm
ArthurZucker 27cad45
so we only need format when the visibility is not pub but pub(crate)
ArthurZucker ceabef3
Merge branch 'main' into add-display
ArthurZucker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,23 @@ | ||
use std::collections::{hash_map::DefaultHasher, HashMap}; | ||
use std::hash::{Hash, Hasher}; | ||
|
||
use super::decoders::PyDecoder; | ||
use super::encoding::PyEncoding; | ||
use super::error::{PyError, ToPyResult}; | ||
use super::models::PyModel; | ||
use super::normalizers::PyNormalizer; | ||
use super::pre_tokenizers::PyPreTokenizer; | ||
use super::trainers::PyTrainer; | ||
use crate::processors::PyPostProcessor; | ||
use crate::utils::{MaybeSizedIterator, PyBufferedIterator}; | ||
use numpy::{npyffi, PyArray1}; | ||
use pyo3::class::basic::CompareOp; | ||
use pyo3::exceptions; | ||
use pyo3::intern; | ||
use pyo3::prelude::*; | ||
use pyo3::types::*; | ||
use pyo3_special_method_derive_0_21::{Repr, Str}; | ||
use std::collections::BTreeMap; | ||
use tk::models::bpe::BPE; | ||
use tk::tokenizer::{ | ||
Model, PaddingDirection, PaddingParams, PaddingStrategy, PostProcessor, TokenizerImpl, | ||
|
@@ -15,17 +26,6 @@ use tk::tokenizer::{ | |
use tk::utils::iter::ResultShunt; | ||
use tokenizers as tk; | ||
|
||
use super::decoders::PyDecoder; | ||
use super::encoding::PyEncoding; | ||
use super::error::{PyError, ToPyResult}; | ||
use super::models::PyModel; | ||
use super::normalizers::PyNormalizer; | ||
use super::pre_tokenizers::PyPreTokenizer; | ||
use super::trainers::PyTrainer; | ||
use crate::processors::PyPostProcessor; | ||
use crate::utils::{MaybeSizedIterator, PyBufferedIterator}; | ||
use std::collections::BTreeMap; | ||
|
||
/// Represents a token that can be be added to a :class:`~tokenizers.Tokenizer`. | ||
/// It can have special options that defines the way it should behave. | ||
/// | ||
|
@@ -462,9 +462,11 @@ type Tokenizer = TokenizerImpl<PyModel, PyNormalizer, PyPreTokenizer, PyPostProc | |
/// The core algorithm that this :obj:`Tokenizer` should be using. | ||
/// | ||
#[pyclass(dict, module = "tokenizers", name = "Tokenizer")] | ||
#[derive(Clone)] | ||
#[derive(Clone, Str, Repr)] | ||
#[format(fmt = "{}")] | ||
pub struct PyTokenizer { | ||
tokenizer: Tokenizer, | ||
#[format(fmt = "{}")] | ||
pub tokenizer: Tokenizer, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a requirement |
||
} | ||
|
||
impl PyTokenizer { | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -63,6 +63,7 @@ fancy-regex = { version = "0.13", optional = true} | |
getrandom = { version = "0.2.10" } | ||
esaxx-rs = { version = "0.1.10", default-features = false, features=[]} | ||
monostate = "0.1.12" | ||
pyo3_special_method_derive_0_21 = {path = "../../pyo3-special-method-derive/pyo3_special_method_derive_0_21"} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do not forget to remove |
||
|
||
[features] | ||
default = ["progressbar", "onig", "esaxx_fast"] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not implemented yet so skipping for now