Support embedded directions #21

Hywan · 2015-01-26T12:56:21Z

A string can contain both left-to-right and right-to-left text. We need a better algorithm to guess the current direction of a text :-).

boast · 2015-01-26T14:49:44Z

Hey there, coming from reddit :) Some suggestions for an algorithm to solve this issue:

Check if the string contains LRM (0x200e) or RLM (0x200f) (and treat ARM 0x061c ‭arabic letter mark as "alias" for RLM), as they are specifically used to mark the string, in which it should be interpreted.
- If it contains both directions, return BIDI (should add this as constant)
- else if it only contains LRM, return LTR
- else if it only contains RLM and / or ARM, return RTL
Set default assumption on the first character
Check if we find any markers (LRM, LRE, LRO (may LRI) and RLM, RLE, RLO (may RLI), ARM) which would imply a direction change compared to the first character, if so, return BIDI
Check the string if it contains a character from the opposing direction, if so, return BIDI, if not, return the respective direction based on the assumption we have from the first string.

Does this sound reasonable? As I cannot think of any sane way to detect that "私 - is a japanese letter" "should" be LTR, the user has decide by himself what to do with BIDI text.

Hywan · 2015-01-26T15:11:22Z

@boast It sounds reasonable yes. I didn't check how other implemetation deals with it. Any PR :-)?

boast · 2015-01-26T15:39:41Z

As for reference implementations: https://github.com/waiting-for-dev/string-direction

Or http://en.wikipedia.org/wiki/Bi-directional_text on that topic (notice the table with the classifications). I'll work on it tonight 👍 However, probably need to refactor some methods into helper protected methods to do the checks more granulated.

Hywan · 2015-01-27T08:17:23Z

@boast Thank you! :-)

boast · 2015-01-29T13:31:28Z

I tried my best to adapt the coding style. No tests broken (or lets say: some tests failed on my Ubuntu Dev Machine before I changed anything, seems like those collator and normalizer tests (especially when they are not available) are broken?) and added a new one following more or less the spec described above.

Hywan · 2015-03-26T19:17:24Z

ping?

boast · 2015-08-03T12:20:23Z

Hey there, thank you for the ping. I was occupied this half year with doing my bachelor degree in CS. ;) We should define our definitive approach for this problem together and then I / we can work out the implementation. My knowledge about the problem comes specifically from these sources:

http://unicode.org/reports/tr9/
https://en.wikipedia.org/wiki/Bi-directional_text (extensive list)

IMHO, we should first decide on the actual "goal" and "usecase" of this method. Why and when is the information "which direction is this text going" needed? Because one can go crazy on the "strong", "weak" and "normal" characters and contexts...

Hywan · 2015-08-03T12:48:28Z

So far, we use getCharDirection to decide the behavior of append, prepend and other methods. This method only checks the first character. We must check the last character first. Second, it should be great to have a method to know if we have bi-directional text. I don't know really why it can be useful yet but I am sure it will be. We can also add methods to force to change the direction of the text (maybe we would like to write french in reverse order 😉). And a most useful usage is:

Iterate over direction portions. It can be particularly useful when transforming it into HTML for instance (or PDF, text etc.).
Also, with the append and prepend methods for instance, we can say: $str->append('text', $str::RTL); to force appending something in the opposite direction (to have bi-directionnal text thus).

Hywan · 2015-08-03T12:48:48Z

PS: How your bachelor goes 😉?

Hywan · 2015-08-03T12:59:39Z

Another use case:

When comparing strings, we would compare portion of directions, not the whole string at once. This some usages I think of.

boast · 2015-09-08T10:51:08Z

Hey Ivan,

thanks for asking - my bachelor is done now, so I think, I will find some time to contribute.

I will try to implement the algorithm according to the UNICODE BIDIRECTIONAL ALGORITHM. Especially the table Bidirectional Character Types looks very interesting and exactly what is lacking as of now ("weak" characters as numbers and punctuation are not handled correctly by our algorithm).

Hywan · 2015-09-08T11:01:55Z

Excellent news!

boast · 2015-10-14T09:08:39Z

Just a short update: I wrote a small script which parses the official bidi-classes from the unicode consortium (http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt). It generates an optimized regex (not working atm, I miss something XD). The regex get quiet large though, but may some more optimizations are possible. The script is a small console app (Bin-folder) which allows easy regeneration if the spec should change.

After my regex works, I will implement the unicode bidi algorithm from http://www.unicode.org/reports/tr9/.

Hywan · 2015-10-14T11:39:12Z

Why do we need such a regular expressions?

boast · 2015-10-14T11:43:50Z

We need to distinguish between the different types of bidirectional
characters. Especially as some characters "change" their directions
depending on context (read: surrounding characters). It's quite complex at
the start, but as soon as you have the groups and get the hang of it, you
can exclude a lot of cases very fast.

On Wed, 14 Oct 2015 13:39 Ivan Enderlin [email protected] wrote:

Why do we need such a regular expressions?

—
Reply to this email directly or view it on GitHub
#21 (comment).

Hywan · 2015-10-14T11:44:25Z

Ok :-).

Hywan added the enhancement label Jan 26, 2015

Hywan self-assigned this Jan 27, 2015

boast linked a pull request Jan 29, 2015 that will close this issue

#21 Added support for embedded directions #23

Open

Hywan added difficulty: hard in progress labels Aug 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support embedded directions #21

Support embedded directions #21

Hywan commented Jan 26, 2015

boast commented Jan 26, 2015

Hywan commented Jan 26, 2015

boast commented Jan 26, 2015

Hywan commented Jan 27, 2015

boast commented Jan 29, 2015

Hywan commented Mar 26, 2015

boast commented Aug 3, 2015

Hywan commented Aug 3, 2015

Hywan commented Aug 3, 2015

Hywan commented Aug 3, 2015

boast commented Sep 8, 2015

Hywan commented Sep 8, 2015

boast commented Oct 14, 2015

Hywan commented Oct 14, 2015

boast commented Oct 14, 2015

Hywan commented Oct 14, 2015

Support embedded directions #21

Support embedded directions #21

Comments

Hywan commented Jan 26, 2015

boast commented Jan 26, 2015

Hywan commented Jan 26, 2015

boast commented Jan 26, 2015

Hywan commented Jan 27, 2015

boast commented Jan 29, 2015

Hywan commented Mar 26, 2015

boast commented Aug 3, 2015

Hywan commented Aug 3, 2015

Hywan commented Aug 3, 2015

Hywan commented Aug 3, 2015

boast commented Sep 8, 2015

Hywan commented Sep 8, 2015

boast commented Oct 14, 2015

Hywan commented Oct 14, 2015

boast commented Oct 14, 2015

Hywan commented Oct 14, 2015