Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confirm that there are blank control character cmap entries in fonts #3205

Open
chrissimpkins opened this issue Mar 18, 2021 · 3 comments
Open
Assignees
Labels

Comments

@chrissimpkins
Copy link
Member

chrissimpkins commented Mar 18, 2021

We recognized that some Noto fonts (e.g., Noto Sans Hebrew) include bidi control characters.

According to @raphlinus:

I believe that BiDi control characters (that includes 200E and 200F, along with 202A-202E and 2066-2069, are handled entirely in the text shaping and layout engine, and do not need cmap entries in the font.

and @simoncozens:

you don't normally have explicit glyphs in the font for control characters, since they are handled higher up the text-processing stack and won't appear in runs to be shaped.

It would be useful to add a check for control characters in this range and flag at WARN level. It does not violate spec, but appears to be unnecessary and (likely very slightly) increases file size.

Related https://github.com/googlefonts/noto-fonts/issues/2036

@davelab6 davelab6 added this to the 0.7.35 milestone Mar 18, 2021
@davelab6 davelab6 changed the title Confirm that there are no bidi control character cmap entries in fonts Confirm that there are blank control character cmap entries in fonts Mar 18, 2021
@davelab6
Copy link
Contributor

As I posted on googlefonts/noto-fonts#2036, Behdad proposed its safer to have empty control chars for scripts that need it, so let's check for that.

@chrissimpkins
Copy link
Member Author

This is the list of code points based on the current state of the conversation:

0x00AD                            // SOFT HYPHEN
0x034F                            // COMBINING GRAPHEME JOINER
0x061C                            // ARABIC LETTER MARK
(0x200C <= c && c <= 0x200F)      // ZERO WIDTH NON-JOINER..RIGHT-TO-LEFT MARK
(0x202A <= c && c <= 0x202E)      // LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE
(0x2066 <= c && c <= 0x2069)      // LEFT-TO-RIGHT ISOLATE..POP DIRECTIONAL ISOLATE
0xFEFF                            // BYTE ORDER MARK

@davelab6
Copy link
Contributor

Yes, great :)

So, it seems to me we ought to map which scripts should include which (empty) control characters, and then check what the script is, and per script, if they exist in cmap with no ink data.

Behdad said he isn't aware of such a mapping. A harfbuzz community member contributed some interesting information to the hb wiki last month, that might help create such a mapping: harfbuzz/harfbuzz#2862

He also offered this tip:

ZWJ/ZWNJ is useful in all Arabic-joining-like and all Brahmi-like scripts. That's everything going to Arabic, Indic, Khmer, Myanmar, and USE shapers in HarfBuzz. The mapping is in:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants