From 5d861be15698d9c9f5293a16ab3993ec6e54a4b1 Mon Sep 17 00:00:00 2001 From: Henri Sivonen Date: Wed, 18 Dec 2024 21:09:33 +0200 Subject: [PATCH] Focus normalizer crate doc on functionality and usage (#5917) Fixes #5164 --- components/normalizer/README.md | 74 ++++++++++++++------------------ components/normalizer/src/lib.rs | 58 +++++++++++-------------- 2 files changed, 58 insertions(+), 74 deletions(-) diff --git a/components/normalizer/README.md b/components/normalizer/README.md index 527e12626b6..d1417d3af64 100644 --- a/components/normalizer/README.md +++ b/components/normalizer/README.md @@ -7,47 +7,39 @@ Normalizing text into Unicode Normalization Forms. This module is published as its own crate ([`icu_normalizer`](https://docs.rs/icu_normalizer/latest/icu_normalizer/)) and as part of the [`icu`](https://docs.rs/icu/latest/icu/) crate. See the latter for more details on the ICU4X project. -## Implementation notes - -The normalizer operates on a lazy iterator over Unicode scalar values (Rust `char`) internally -and iterating over guaranteed-valid UTF-8, potentially-invalid UTF-8, and potentially-invalid -UTF-16 is a step that doesn’t leak into the normalizer internals. Ill-formed byte sequences are -treated as U+FFFD. - -The normalizer data layout is not based on the ICU4C design at all. Instead, the normalization -data layout is a clean-slate design optimized for the concept of fusing the NFD decomposition -into the collator. That is, the decomposing normalizer is a by-product of the collator-motivated -data layout. - -Notably, the decomposition data structure is optimized for a starter decomposing to itself, -which is the most common case, and for a starter decomposing to a starter and a non-starter -on the Basic Multilingual Plane. Notably, in this case, the collator makes use of the -knowledge that the second character of such a decomposition is a non-starter. Therefore, -decomposition into two starters is handled by generic fallback path that looks the -decomposition from an array by offset and length instead of baking a BMP starter pair directly -into a trie value. - -The decompositions into non-starters are hard-coded. At present in Unicode, these appear -to be special cases falling into three categories: - -1. Deprecated combining marks. -2. Particular Tibetan vowel sings. -3. NFKD only: half-width kana voicing marks. - -Hopefully Unicode never adds more decompositions into non-starters (other than a character -decomposing to itself), but if it does, a code update is needed instead of a mere data update. - -The composing normalizer builds on the decomposing normalizer by performing the canonical -composition post-processing per spec. As an optimization, though, the composing normalizer -attempts to pass through already-normalized text consisting of starters that never combine -backwards and that map to themselves if followed by a character whose decomposition starts -with a starter that never combines backwards. - -As a difference with ICU4C, the composing normalizer has only the simplest possible -passthrough (only one inversion list lookup per character in the best case) and the full -decompose-then-canonically-compose behavior, whereas ICU4C has other paths between these -extremes. The ICU4X collator doesn't make use of the FCD concept at all in order to avoid -doing the work of checking whether the FCD condition holds. +## Functionality + +The top level of the crate provides normalization of input into the four normalization forms defined in [UAX #15: Unicode +Normalization Forms](https://www.unicode.org/reports/tr15/): NFC, NFD, NFKC, and NFKD. + +Three kinds of contiguous inputs are supported: known-well-formed UTF-8 (`&str`), potentially-not-well-formed UTF-8, +and potentially-not-well-formed UTF-8. Additionally, an iterator over `char` can be wrapped in a normalizing iterator. + +The `uts46` module provides the combination of mapping and normalization operations for [UTS #46: Unicode IDNA +Compatibility Processing](https://www.unicode.org/reports/tr46/). This functionality is not meant to be used by +applications directly. Instead, it is meant as a building block for a full implementation of UTS #46, such as the +[`idna`](https://docs.rs/idna/latest/idna/) crate. + +The `properties` module provides the non-recursive canonical decomposition operation on a per `char` basis and +the canonical compositon operation given two `char`s. It also provides access to the Canonical Combining Class +property. These operations are primarily meant for [HarfBuzz](https://harfbuzz.github.io/) via the +[`icu_harfbuzz`](https://docs.rs/icu_harfbuzz/latest/icu_harfbuzz/) crate. + +Notably, this normalizer does _not_ provide the normalization “quick check” that can result in “maybe” in +addition to “yes” and “no”. The normalization checks provided by this crate always give a definitive +non-“maybe” answer. + +## Examples + +```rust +let nfc = icu_normalizer::ComposingNormalizerBorrowed::new_nfc(); +assert_eq!(nfc.normalize("a\u{0308}"), "ä"); +assert!(nfc.is_normalized("ä")); + +let nfd = icu_normalizer::DecomposingNormalizerBorrowed::new_nfd(); +assert_eq!(nfd.normalize("ä"), "a\u{0308}"); +assert!(!nfd.is_normalized("ä")); +``` diff --git a/components/normalizer/src/lib.rs b/components/normalizer/src/lib.rs index d93efbaceec..e86af3c2eca 100644 --- a/components/normalizer/src/lib.rs +++ b/components/normalizer/src/lib.rs @@ -23,47 +23,39 @@ //! This module is published as its own crate ([`icu_normalizer`](https://docs.rs/icu_normalizer/latest/icu_normalizer/)) //! and as part of the [`icu`](https://docs.rs/icu/latest/icu/) crate. See the latter for more details on the ICU4X project. //! -//! # Implementation notes +//! # Functionality //! -//! The normalizer operates on a lazy iterator over Unicode scalar values (Rust `char`) internally -//! and iterating over guaranteed-valid UTF-8, potentially-invalid UTF-8, and potentially-invalid -//! UTF-16 is a step that doesn’t leak into the normalizer internals. Ill-formed byte sequences are -//! treated as U+FFFD. +//! The top level of the crate provides normalization of input into the four normalization forms defined in [UAX #15: Unicode +//! Normalization Forms](https://www.unicode.org/reports/tr15/): NFC, NFD, NFKC, and NFKD. //! -//! The normalizer data layout is not based on the ICU4C design at all. Instead, the normalization -//! data layout is a clean-slate design optimized for the concept of fusing the NFD decomposition -//! into the collator. That is, the decomposing normalizer is a by-product of the collator-motivated -//! data layout. +//! Three kinds of contiguous inputs are supported: known-well-formed UTF-8 (`&str`), potentially-not-well-formed UTF-8, +//! and potentially-not-well-formed UTF-8. Additionally, an iterator over `char` can be wrapped in a normalizing iterator. //! -//! Notably, the decomposition data structure is optimized for a starter decomposing to itself, -//! which is the most common case, and for a starter decomposing to a starter and a non-starter -//! on the Basic Multilingual Plane. Notably, in this case, the collator makes use of the -//! knowledge that the second character of such a decomposition is a non-starter. Therefore, -//! decomposition into two starters is handled by generic fallback path that looks the -//! decomposition from an array by offset and length instead of baking a BMP starter pair directly -//! into a trie value. +//! The `uts46` module provides the combination of mapping and normalization operations for [UTS #46: Unicode IDNA +//! Compatibility Processing](https://www.unicode.org/reports/tr46/). This functionality is not meant to be used by +//! applications directly. Instead, it is meant as a building block for a full implementation of UTS #46, such as the +//! [`idna`](https://docs.rs/idna/latest/idna/) crate. //! -//! The decompositions into non-starters are hard-coded. At present in Unicode, these appear -//! to be special cases falling into three categories: +//! The `properties` module provides the non-recursive canonical decomposition operation on a per `char` basis and +//! the canonical compositon operation given two `char`s. It also provides access to the Canonical Combining Class +//! property. These operations are primarily meant for [HarfBuzz](https://harfbuzz.github.io/) via the +//! [`icu_harfbuzz`](https://docs.rs/icu_harfbuzz/latest/icu_harfbuzz/) crate. //! -//! 1. Deprecated combining marks. -//! 2. Particular Tibetan vowel sings. -//! 3. NFKD only: half-width kana voicing marks. +//! Notably, this normalizer does _not_ provide the normalization “quick check” that can result in “maybe” in +//! addition to “yes” and “no”. The normalization checks provided by this crate always give a definitive +//! non-“maybe” answer. //! -//! Hopefully Unicode never adds more decompositions into non-starters (other than a character -//! decomposing to itself), but if it does, a code update is needed instead of a mere data update. +//! # Examples //! -//! The composing normalizer builds on the decomposing normalizer by performing the canonical -//! composition post-processing per spec. As an optimization, though, the composing normalizer -//! attempts to pass through already-normalized text consisting of starters that never combine -//! backwards and that map to themselves if followed by a character whose decomposition starts -//! with a starter that never combines backwards. +//! ``` +//! let nfc = icu_normalizer::ComposingNormalizerBorrowed::new_nfc(); +//! assert_eq!(nfc.normalize("a\u{0308}"), "ä"); +//! assert!(nfc.is_normalized("ä")); //! -//! As a difference with ICU4C, the composing normalizer has only the simplest possible -//! passthrough (only one inversion list lookup per character in the best case) and the full -//! decompose-then-canonically-compose behavior, whereas ICU4C has other paths between these -//! extremes. The ICU4X collator doesn't make use of the FCD concept at all in order to avoid -//! doing the work of checking whether the FCD condition holds. +//! let nfd = icu_normalizer::DecomposingNormalizerBorrowed::new_nfd(); +//! assert_eq!(nfd.normalize("ä"), "a\u{0308}"); +//! assert!(!nfd.is_normalized("ä")); +//! ``` extern crate alloc;