-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatise handling of diacritics #26
Comments
More about the second case: The reason we want such base letter + combining diacritic as a multichar symbol in all other cases is that it makes life easier to treat things that looks like single letters as actually single symbols even when the underlying Unicode is not a single code point. It is only |
Excellent work in commit 573af75. Only problem is: it fails on macOS, probably due to a different version of |
Another comment: would it be possible to filter out all non-diacritic characters, to avoid both noise and extra cpu time when compiling and composing the generated regex? |
|
I changed awk to gawk but not the sed command yet, I think also gnused is used; we have a configure script in langs for checking that could be useful but not sure if everyone runs configure in core even |
One has to run |
mmh I have now some checks in core for gnu sed and gawk and the unicode filter scripts use the configured programs. |
This covers two distinct cases:
In the first case, the pseudo code could go something like this:
In the second case, the pseudocode could be something like the following:
extract all multichar symbols from the fst get rid of everything that looks like tags, flag diacritics and internal symbols make a regex to mandatorily turn a multichar base letter + (one or more) combining \ diacritics into a sequence of single symbols apply that regex to tokeniser FST's on the __surface__ side
With routines like the above integrated into the build system, no-one should ever have to worry about these issues anymore 🙂
The text was updated successfully, but these errors were encountered: