-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change internal representation to UTF-8 #9
Comments
UTF-8 is a good choice for internal representation, nevertheless, the modules that are responsible for similar words finding should in my opinion use UTF-32 internally:
Pros of using UTF-32 internally in the above classes
Cons of using UTF-32 internally in the above classes
I think that the pros far outweight the cons. |
From my point of view:
|
Currently, we are using UCS-2 as internal encoding, which disallows us to use Unicode characters outside of BMP.
We should change the internal representation, the current plans is to use UTF-8:
char
andstring
datatypesSimWordsFinder::Find
will have to interpret the UTF-8 encoding and understand that one Unicode character can be represented as multiple code units. Maybe the input word will be converted to UTF-32, but I do not think so, because both lexicon and error model will be in UTF-8The alternative to UTF-8 is to use UTF-32, but
The text was updated successfully, but these errors were encountered: