Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utf8 character support #12

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

5kg
Copy link

@5kg 5kg commented May 27, 2015

I tested by feeding the model a Chinese novel, and it produces some interesting results.

@VitoVan
Copy link

VitoVan commented Jun 12, 2015

Wow, this is just what I'm going to do, thank you.

@5kg
Copy link
Author

5kg commented Jun 12, 2015

@VitoVan You can also try the original code on byte input without any modification.

In my experiment, the trained LSTM model can actually learn the utf-8 encoding of chinese character. I didn't see any broken codepoint in the generated text.

@VitoVan
Copy link

VitoVan commented Jun 12, 2015

------ADDED 2015-6-12 16:43:56------
Sorry, I'm new to Lua, so I may have follow stupid question:
------ADDED 2015-6-12 16:43:56------

@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?

@inDream inDream mentioned this pull request Jun 17, 2015
@karpathy
Copy link
Owner

I assume this code is backwards compatible to previous datasets?

@VitoVan
Copy link

VitoVan commented Jun 17, 2015

I think so, haven't test it.

  Sent from my phone.
On 17 Jun 2015 6:30 pm, Andrej [email protected] wrote:I assume this code is backwards compatible to previous datasets?

—Reply to this email directly or view it on GitHub.

@wb14123
Copy link

wb14123 commented Jun 17, 2015

This patch increases the size of vocab a lot. I have a dataset of 16M. The origin code generates a vocab with size 230 but this code generates a vocab with 180128, which need 241G memory to load.

@wb14123
Copy link

wb14123 commented Jun 17, 2015

I just realize that my dataset is not UTF8. But this may break the support for other input stream than text. And the vocab generated from UTF8 dataset is also bigger that the origin size.

@hughperkins
Copy link
Contributor

@5kg I haven't try yet, well then, if the original code works well, what's the meaning of this pull request? Make the learning faster on Chinese?

Presumably the advantage is that the model doesnt have to spend effort on learning how to construct unicode code points, and wont ever write invalid unicode code points.

But the increase in vocab size increase will vastly increase the number of parameters in the fully-connected Linear layers, as far as I can see. Based on my calcs at https://www.reddit.com/r/MachineLearning/comments/3ejizl/karpathy_charrnn_doubt/ctfndk6 , the number of weights is:

4 * rnn_size * ( vocab_size + rnn_size + 2 ) + (rnn_size + 1) * vocab_size

eg, if rnn_size is 128, and vocab_size is eg 96 then the number of weights is: 128K, which takes 512KB of memory (4 bytes per float)

but if vocab_size is 180,128, then the number of weights is: 115M, which takes 460MB of memory

@hughperkins
Copy link
Contributor

Hmmm, but actually, I dont remember there are so many chinese characters. I think there are only 10 to 20 thousand in normal usage?

@InnovativeInventor
Copy link

What is the status on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants