Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for pre-tokenizer that creates words based length alone. #1697

Open
filbeofITK opened this issue Dec 10, 2024 · 0 comments
Open

Request for pre-tokenizer that creates words based length alone. #1697

filbeofITK opened this issue Dec 10, 2024 · 0 comments

Comments

@filbeofITK
Copy link

Hello! I would like to request a fast pre-tokenizer to be implemented, which only splits the input to continuous pre-defined length segments. I know that this is not a common issue in NLP, but for my use-case it is necessary. I'm trying to process DNA data and that has no spaces or any type of separators, so I'm trying to use fixed length tokens.

Implementing this for someone that actually knows Rust and the backend would probably take less than half an hour but I don't want to learn a new language for this.

Biggest thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant