-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add utf8 character support #12
base: master
Are you sure you want to change the base?
Conversation
Wow, this is just what I'm going to do, thank you. |
@VitoVan You can also try the original code on byte input without any modification. In my experiment, the trained LSTM model can actually learn the utf-8 encoding of chinese character. I didn't see any broken codepoint in the generated text. |
------ADDED 2015-6-12 16:43:56------ @5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese? |
I assume this code is backwards compatible to previous datasets? |
I think so, haven't test it. Sent from my phone. —Reply to this email directly or view it on GitHub. |
This patch increases the size of vocab a lot. I have a dataset of 16M. The origin code generates a vocab with size 230 but this code generates a vocab with 180128, which need 241G memory to load. |
I just realize that my dataset is not UTF8. But this may break the support for other input stream than text. And the vocab generated from UTF8 dataset is also bigger that the origin size. |
Presumably the advantage is that the model doesnt have to spend effort on learning how to construct unicode code points, and wont ever write invalid unicode code points. But the increase in vocab size increase will vastly increase the number of parameters in the fully-connected Linear layers, as far as I can see. Based on my calcs at https://www.reddit.com/r/MachineLearning/comments/3ejizl/karpathy_charrnn_doubt/ctfndk6 , the number of weights is:
eg, if rnn_size is 128, and vocab_size is eg 96 then the number of weights is: 128K, which takes 512KB of memory (4 bytes per float) but if vocab_size is 180,128, then the number of weights is: 115M, which takes 460MB of memory |
Hmmm, but actually, I dont remember there are so many chinese characters. I think there are only 10 to 20 thousand in normal usage? |
What is the status on this? |
I tested by feeding the model a Chinese novel, and it produces some interesting results.