Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on Extending Algorithm to Other Languages #30

Open
dsdanielpark opened this issue Dec 12, 2023 · 2 comments
Open

Inquiry on Extending Algorithm to Other Languages #30

dsdanielpark opened this issue Dec 12, 2023 · 2 comments

Comments

@dsdanielpark
Copy link

Impressed by Your Project

Dear alasdairforsythe,

I am genuinely impressed by your wonderful project and appreciate your sharing it. Thank you sincerely.

Inquiry on Documentation and Algorithm

I'm curious to know if there is any simple explanation or documentation about the entire development process of your project.

If not, could you please provide a brief description of the overall algorithm, even if it's very approximate? I am familiar with concepts like BPE, BBPE, unigram, ngram, and word piece, as well as various packages like SentencePiece, TikToken, tokenizers, and transformers. Therefore, feel free to skip any basic information and directly share what improvements you've made, the overall development process, your objectives, and the approaches you took to solve specific problems.

Inquiry on Extending Algorithm to Other Languages

I read on Reddit that your focus was on speed improvements, but I noticed you also reduced the vocab size. Could you elaborate on your overall approach to this?

Additionally, I am curious about where to start with your package to develop an efficient tokenizer for Korean. While I'm considering the BBPE method for creating an efficient Korean vocab, your advanced work in this area has prompted me to reach out for guidance.

Thank you for your time and insights.

Sincerely,
Daniel

@alasdairforsythe
Copy link
Owner

I'll answer briefly:
The training algorithm uses brute force to find the optimal set of tokens to represent your chosen dataset, given any specific tokenization algorithm. And you can see it works, because if you run it multiple times, you get the same tokens out (give or take a couple that are roughly equal.) All the cleverness of my code is to get it to be able to do this quickly enough. So basically, the training process doesn't have an opinion about information theory or compression - I didn't even bother to research that. It just tries everything, and my specialty of micro-optimization means I programmed it to be fast enough to be usable.

@dsdanielpark
Copy link
Author

alasdairforsythe

Thank you for the kind response. I will also try optimizing BBPE and some algorithms and provide feedback on TokenMonster. Additionally, if I have any questions while creating the tokenizer, I will make sure to ask. Thank you for this wonderful project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants