Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Datagen API LocaleExpansionMode #4629

Closed
sffc opened this issue Feb 24, 2024 · 26 comments
Closed

Implement Datagen API LocaleExpansionMode #4629

sffc opened this issue Feb 24, 2024 · 26 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement)

Comments

@sffc
Copy link
Member

sffc commented Feb 24, 2024

As discussed in #4606 (comment).

If this is done before 1.5, we can remove pub fn with_base_language_handling added in #4606.

CC @robertbastian

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement) labels Feb 24, 2024
@sffc sffc added this to the ICU4X 2.0 milestone Feb 24, 2024
@sffc sffc changed the title Implement another new Datagen API in 2.0 Implement another new Datagen API Feb 24, 2024
@sffc sffc changed the title Implement another new Datagen API Implement Datagen API LocaleExpansionMode Feb 24, 2024
@sffc
Copy link
Member Author

sffc commented Feb 24, 2024

@robertbastian Another API proposal.

pub enum LocaleExpansionMode {
    /// Locales always removed if identical to their parent
    Deduplicated,
    /// Locales always retained if explicitly requested or supported
    Exhaustive,
    /// Locales removed if identical to their parent, but not if the parent is `und`
    Hybrid,
    /// Only explicitly requested locales are retained
    Preresolved,
}

pub enum RuntimeFallbackLocation {
    Internal,
    External,
}

impl DatagenDriver {
    pub fn with_locale_expansion_mode(...)
    pub fn with_runtime_fallback_location(...)
}

Note: "hybrid" is repurposed.

These are two separate flags with an interdependent matrix of defaults.

Explicit LEM Exporter SBIF Resolved LEM Resolved RFL
None True Hybrid Internal
None False Exhaustive External
Deduplicated True Deduplicated Internal
Deduplicated False Deduplicated External
Exhaustive True Exhaustive Internal
Exhaustive False Exhaustive External
Hybrid True Hybrid Internal
Hybrid False Hybrid External
Preresolved True Preresolved External
Preresolved False Preresolved External

LEM = LocaleExpansionMode
RFL = RuntimeFallbackLocation
SBIF = Supports Built-In Fallback?

I highlighted cells that might be interesting.

Obviously if there is an explicit RFL, use it (or err if not allowed by SBIF).

A nice part about this design is that the flat enums lend themselves well to CLI.

@robertbastian
Copy link
Member

I think this is also ok, however it looks harder to reach in an incremental change in 1.5. I also think the defaults table adds a lot of complexity that clients will have to understand.

@sffc
Copy link
Member Author

sffc commented Feb 26, 2024

it looks harder to reach in an incremental change in 1.5

We can add the two new functions to DatagenDriver and deprecate the old ones. If both the old functions and new functions are present, we return an error.

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Feb 26, 2024
@sffc sffc added this to icu4x 2.0 Feb 26, 2024
@sffc sffc moved this to Needs discussion to unblock in icu4x 2.0 Feb 26, 2024
@sffc
Copy link
Member Author

sffc commented Feb 27, 2024

@jedel1043 Thoughts on #4629 (comment)?

@sffc
Copy link
Member Author

sffc commented Feb 27, 2024

Also note @robertbastian that this design permits baked data to be built with internal fallback in Preresolved and Exhaustive (formerly Hybrid) modes, which is not currently possible.

@robertbastian
Copy link
Member

robertbastian commented Feb 27, 2024

I'd phrase this all more constructively:

pub enum LocaleInclusionMode {
    /// All requested locales, their ancestors, and their descendants are included, unless
    /// they fall back to a locale with identical data.
    ///
    /// Use this for minimal data size.
    Deduplicated,

    /// All requested locales, their ancestors, and their descendants are included, unless
    /// they fall back to a *non-root* locale with identical data.
    ///
    /// Use this if you need to determine at runtime if a base language has been
    /// included.
    DeduplicatedVariants,

    /// All requested locales, their ancestors, and their descendants are included.
    /// Unsupported locales will use datagen-time fallback to supported locales (or `und`).
    ///
    /// Use this mode if you want your data to work both without and with runtime fallback.
    Preresolved,

    /// Only explicitly requested locales, as well as `und`, are included. 
    /// Unsupported locales will use datagen-time fallback to supported locales (or `und`).
    ///
    /// Use this mode if you are not doing runtime fallback and only need these exact locales
    /// to work.
    ExplicitPreresolved,
}

@jedel1043
Copy link
Contributor

jedel1043 commented Feb 27, 2024

@jedel1043 Thoughts on #4629 (comment)?

That API is also fine. Documentation wise, we could just tell users to not use LocaleExpansionMode::Deduplicated if they need to make requested locale queries, which should be easy to document.

@sffc
Copy link
Member Author

sffc commented Feb 27, 2024

Naming bikeshed, with some help from Gemini:

  • The mode currently known as Runtime
    • Deduplicated
    • Compact
    • CompactInheritance (Emphasizes the space-saving aspect and the hierarchical nature of locale inclusion)
    • CompactForFallback
    • Compressed (Leans into the data size optimization aspect. Implies intelligent handling of locales for efficiency.)
    • CompressedForFallback
    • AdaptiveMinimization (Highlights both the space-saving aspect and the ability to intelligently adapt to supported locales.)
    • Cascading (Metaphor: Flow of data from broader locales to specific ones, much like water in a river system.)
    • MinimalRedundancy (Emphasizes both the inheritance structure and the focus on avoiding duplicate data.)
    • SmartInheritance
    • CompactRedundancy
  • The new mode proposed in Add BaseLanguageHandling option to Datagen #4606
    • DeduplicatedWithBaseLanguages
    • RootAwareInheritance
    • CompactRootAware
    • BaselineInheritance (Emphasizes the root locale as a guaranteed foundation. "Baseline" hints at a fundamental starting point.)
    • RootSensitiveInheritance (Very similar to "Root-Aware Inheritance", but emphasizes sensitivity to the root relationship for accuracy.)
    • Ancestral (Metaphor: Family tree representing locale relationships, with the root as the common origin.)
    • RootAware (Prioritizes a non-root base locale while emphasizing the ability to identify it.)
    • BaseAwareInheritance
  • The mode currently known as Hybrid
    • Comprehensive (Communicates the ability to seamlessly adjust to different locale support scenarios.)
    • ComprehensiveFallback (Focuses on the robust fallback mechanism to ensure data is always available, even with runtime decisions)
    • ComprehensiveInheritance
    • ComprehensiveWithFallback
    • PreresolvedWithFallback
    • Universal (Puts the focus on the fail-safe mechanism. Suggests broad compatibility)
    • ResilientFallback (Puts the emphasis on the fallback mechanism's dependability for consistent data availability.)
    • Adaptive (Metaphor: Network where data is accessible through various connected nodes (locales).)
    • Robust (Plainly states a focus on reliable access to a wide range of locales.)
    • RobustRedundancy
  • The mode currently known as Preresolved
    • Preresolved
    • Explicit
    • ExplicitSelection (Clearly communicates that only specifically requested locales are included, ideal for scenarios with fixed requirements)
    • TargetedSelection (Similar to "Explicit Selection" but slightly more active in its language.)
    • PreciseSelection (Similar to previous options but emphasizes the exactness of the included locales for controlled scenarios.)
    • Focused (Metaphor: Spotlighting only the necessary data points (locales).)
    • Selective

How about:

  1. Compact
  2. Baseline
  3. Adaptive
  4. Selective

@sffc
Copy link
Member Author

sffc commented Feb 27, 2024

As far as functionality goes, I can still see the desire to add modes that include ancestors but not descendants, although @robertbastian pointed out that they can be modeled by locale variants instead. This is not relevant to the mode currently called Preresolved but it applies to the other three.

The situation: currently en-001 implies all of the following: en, en-001, en-GB, and all other variants that inherit directly or indirectly from en-001. It may be desirable to trade coverage for data size and only include the ancestors, like en, but not the descendants, like en-GB.

@sffc
Copy link
Member Author

sffc commented Feb 28, 2024

How about

  1. CompactSelective
  2. CompactExhaustive
  3. BaselineSelective
  4. BaselineExhaustive
  5. RedundantSelective
  6. RedundantExhaustive
  7. Explicit

Or maybe we just go back to toggle switches

  1. Deduplicate: always, except-base, or never
  2. Relatives: all, only ancestors, or none

In tabular form,

DeduplicationMode RelativesMode Name
Always All The thing currently named Runtime
Always Only Ancestors Not currently available; might be useful
Always None Not currently available; probably not useful
Except Base All The thing proposed in #4606
Except Base Only Ancestors Not currently available; might be useful
Except Base None Not currently available; probably not useful
Never All The thing currently named Hybrid
Never Only Ancestors Not currently available; might be useful
Never None The thing currently named Preresolved

@jedel1043
Copy link
Contributor

jedel1043 commented Feb 28, 2024

Or maybe we just go back to toggle switches

  1. Deduplicate: always, except-base, or never
  2. Relatives: all, only ancestors, or none

This sounds nice, since the first alternative would make it increasingly harder to add more fallback options thanks to the exponential growth of fallback types. Though, it could also be a bit unnecessary if we don't expect to have any more fallback options.

@sffc
Copy link
Member Author

sffc commented Mar 15, 2024

New draft:

#[non_exhaustive]
pub enum RuntimeFallbackLocation {
    Internal,
    External,
}

#[non_exhaustive]
pub enum DeduplicationStrategy {
    Maximal,
    #[default]
    RetainBaseLanguages,
    NoDeduplication,
}

#[non_exhaustive]
enum Locales {
    WithFallback(LocalesWithFallback),
    WithoutFallback(LocalesWithoutFallback),
}

#[non_exhaustive]
pub struct LocalesWithFallback {
    pub runtime_fallback_location: Option<RuntimeFallbackLocation>,
    pub deduplication_strategy: DeduplicationStrategy,
    /// private: set via constructor
    locales: Vec<LanguageWithExpansion>,
}

impl LocalesWithFallback {
    pub fn for_locales(impl IntoIterator<Item = LanguageWithExpansion>) -> Self;
}

#[non_exhaustive]
pub struct LocalesWithoutFallback  {
    pub locales: Vec<LanguageIdentifier>,
}

impl FromIterator<Item = LanguageIdentifier> for LocalesWithoutFallback {}
/// defaults to LocaleWithExpansion::with_variants
impl FromIterator<Item = LanguageIdentifier> for LocalesWithFallback {}
impl FromIterator<Item = LanguageWithExpansion> for LocalesWithFallback {}

#[derive(Debug, PartialEq, Clone, Hash)]
struct LanguageWithExpansion {
    langid: LanguageIdentifier,
    include_ancestors: bool,
    include_descendants: bool,
}

/// agrees with strings
impl From<LanguageIdentifier> for LanguageWithExpansion {}

impl FromStr for LanguageExpansion {}
impl Display for LanguageExpansion

impl LanguageWithExpansion {
    // en-US
    pub fn with_variants(L) -> Self {
        Self { langid, true, true } 
    }
    
    // ^en-US
    pub fn without_variants(L) -> Self {
        Self { langid, true, false } 
    }

    // @en-US
    pub fn preresolved(L) -> Self {
        Self {
            langid, false, false
        }
    }
}

// resolve this against the exporter like
foo.runtime_fallback_location.unwrap_or_else(|| 
    if exporter.supports_runtime_fallback() {
        RuntimeFallbackLocation::Internal
    } else { 
        RuntimeFallbackLocation::External 
    }
)

Example call site:

driver2.set_locales(Locales::WithFallback({
    let mut x = LocalesWithFallback::for_locales([
        LanguageWithExpansion::without_variants(langid!("de")),
        langid!("en").into(),
    ]);
    x.deduplication_strategy = DeduplicationStrategy::Maximal;
    x
})).export(...);

CLI:

icu4x-datagen --keys all --locales en-US,de-AT --without-fallback
icu4x-datagen --keys all --locales ^en,de-AT --fallback-runtime-location --fallback-deduplication-strategy

# Language expansion (@, ^) requires fallback

LGTM: @sffc @robertbastian

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Mar 15, 2024
@sffc sffc moved this from Needs discussion to unblock to Unclaimed for sprint in icu4x 2.0 Mar 15, 2024
@sffc
Copy link
Member Author

sffc commented Mar 19, 2024

I'd still like LanguageWithExpansion to be an enum if possible. Brainstorm:

pub enum LocaleFamily {
    Maximal(LanguageIdentifier),
    Minimal(LanguageIdentifier),
    Preresolved(LanguageIdentifier),
}

@sffc
Copy link
Member Author

sffc commented Mar 20, 2024

Suggestion from @robertbastian: in the output, print clearly that how the DeduplicationStrategy and RuntimeFallbackLocation are being determined. Make it a log::warn to encourage users to be explicit. Note that the CLI can always set explicit options on the API, so the messages can be different.

API:

log::warn!("Set runtime_fallback_location = {:?} due to the supplied exporter", ...runtime_fallback_location);

log::warn!("Set deduplication_strategy = {:?} due to runtime_fallback_location", ...deduplication_strategy);

CLI:

log::warn!("Set --runtime-fallback-location={} due to --format=mod", args.runtime_fallback_location);

log::warn!("Set --deduplication-strategy={} due to --runtime-fallback-location", args.deduplication_strategy);

Example CLI invocation:

icu4x-datagen --keys ... --locales ... --deduplication-strategy maximal --runtime-fallback-location internal

Error CLI invocations:

icu4x-datagen --keys ... --locales ... --without-fallback --deduplication-strategy none

LGTM: @robertbastian @sffc

@sffc
Copy link
Member Author

sffc commented Mar 20, 2024

Current:

pub struct LocaleWithExpansion {
    langid: LanguageIdentifier,
    include_ancestors: bool,
    include_descendants: bool,
}

impl LocaleWithExpansion {
    /// en-US
    pub fn with_variants(langid: LanguageIdentifier) -> Self
    /// ^en-US
    pub fn without_variants(langid: LanguageIdentifier) -> Self
    /// @en-US
    pub fn preresolved(langid: LanguageIdentifier) -> Self
}

impl LocalesWithFallback {
    pub fn for_all_locales() -> Self
}

Proposed:

#[derive(Default, Copy, Clone)]
#[non_exhaustive]
pub struct FallbackOptions {
    /// If not set, determined by `exporter`
    pub runtime_fallback_location: Option<...>,
    /// If not set, determined by `runtime_fallback_location`
    pub deduplication_strategy: Option<...>,
}

#[derive(Default, Copy, Clone)]
#[non_exhaustive]
pub struct NoFallbackOptions {}

pub struct LocaleFamily {
    langid: LanguageIdentifier,
    include_ancestors: bool,
    include_descendants: bool,
}

impl LocaleFamily {
    /// en-US
    ///
    /// For example, the family ::with_descendants("en-001") contains "und", "en" (ancestors), "en-001", "en-GB", "en-ZA" (descendants)
    pub fn with_descendants(langid: LanguageIdentifier) -> Self
    /// ^en-US
    ///
    /// For example, the family ::without_descendants("en-001") contains "und", "en" (ancestors), "en-001"
    pub fn without_descendants(langid: LanguageIdentifier) -> Self
    /// @en-US
    pub fn single(langid: LanguageIdentifier) -> Self
    
    /// The locale family containing all locales, equivalent to `::with_descendants("und")`.
    pub fn all() -> Self
}

impl DatagenDriver {
    pub fn with_locales[_and_fallback](locales: impl FromIterator<Item = LocaleFamily>, options: FallbackOptions) -> Self;
    pub fn with_locales_no_fallback(locales: impl FromIterator<Item = LanguageIdentifier>, options: NoFallbackOptions) -> Self;
}

LGTM: @sffc @robertbastian

@sffc
Copy link
Member Author

sffc commented Mar 22, 2024

@robertbastian Currently, passing --locales und results in und being included in the output, but it does not imply descendants. This is incompatible with the proposal for --locales und to start meaning "all locales that are descended from und" (i.e. all locales).

A few options:

  1. Adopt the new behavior and special-case the old behavior internally somehow
  2. Make a special LocaleWithExpansion constructor for the wildcard, on the CLI represented as '*' (users would need to quote the asterisk on the CLI), and retain the current behavior for und

I lean toward 2 because I think --locales und to mean "all locales" is surprising more than it is clever.

@robertbastian
Copy link
Member

Option 2 sounds good. We already have full on the CLI, so I think we should continue using that instead of *

@sffc
Copy link
Member Author

sffc commented Mar 25, 2024

The new API is more verbose. Old:

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales(SELECTED_LOCALES)
            .with_fallback_mode(FallbackMode::Hybrid)

New:

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales_and_fallback(
                SELECTED_LOCALES
                    .into_iter()
                    .map(LocaleFamily::with_descendants),
                {
                    let mut options = FallbackOptions::default();
                    options.deduplication_strategy = Some(DeduplicationStrategy::NoDeduplication);
                    options
                },
            )

Should we instead flatten this to

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales(SELECTED_LOCALES) // implies with_descendants
            .with_fallback()
            .with_deduplication(DeduplicationStrategy::NoDeduplication)

Full API

impl DatagenDriver {
    /// Appends these locales to the selected locale list. If fallback is enabled, implies with_descendants
    pub fn with_locales(self, impl IntoIterator<LangaugeIdentifier>) { ... }

    /// Appends these locales to the selected locale list.
    pub fn with_locale_families(self, impl IntoIterator<LocaleFamily>) { ... }

    /// Either without_fallback or with_fallback must be called.
    /// If without_fallback is used, an error will occur if any of these functions were also used:
    /// - with_locale_families
    /// - with_runtime_fallback_location
    /// - with_deduplication
    pub fn without_fallback(self) { ... }

    /// Either without_fallback or with_fallback must be called.
    pub fn with_fallback(self) { ... }

    /// Only valid in combination with with_fallback
    /// Default dependent on the exporter
    pub fn with_runtime_fallback_location(self, RuntimeFallbackLocation) { ... }

    /// Only valid in combination with with_fallback
    /// Default dependent on the runtime fallback location
    pub fn with_deduplication(self, DeduplicationStrategy) { ... }
}

@robertbastian
Copy link
Member

It depends what you're doing. If we have the langid -> locale family conversion implementations, and you use default options, then the invocation will be

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales_and_fallback(SELECTED_LOCALES, Default::default())

which is actually more concise than currently. The whole idea of this design was to model the relationship of all the different flags better, which flattening does not.

@sffc
Copy link
Member Author

sffc commented Mar 25, 2024

How about

impl DatagenDriver {
    pub fn with_langids_and_fallback(self,
        impl IntoIterator<LangaugeIdentifier>, options: FallbackOptions) { ... }
    pub fn with_locale_families_and_fallback(self,
        impl IntoIterator<LocaleFamily>, options: FallbackOptions) { ... }
    pub fn with_langids_no_fallback(self,
        impl IntoIterator<LangaugeIdentifier>, options: NoFallbackOptions) { ... }
}

@sffc
Copy link
Member Author

sffc commented Mar 25, 2024

If we have the langid -> locale family conversion implementations

Can you clarify how you envision accomplishing this?

Without overloading, I can't define a function that takes "either a LanguageIdentifier iterator or a LocaleFamily iterator".

@sffc
Copy link
Member Author

sffc commented Mar 25, 2024

On LocaleFamily: currently we have

  • with_descendants(en-001) = und, en, en-001, en-GB
  • without_descendants(en-001) = und, en, en-001
  • single(en-001) = en-001

I suggested include_und be its own option, but @robertbastian prefers modeling it with LocaleFamily, so now we essentially need to add

  • with_descendants_without_und(en-001) = en, en-001, en-GB
  • without_descendants_or_und(en-001) = en, en-001

I might be okay modeling this as

  • without_ancestors(en-001) = en-001, en-GB

and just saying that you should always use base languages with this function.

@robertbastian
Copy link
Member

Can you clarify how you envision accomplishing this?

I thought we had this before, but maybe it's not possible with the current design.

I don't think conciseness is the ultimate goal. Having many simple methods makes the API way more complicated, because it cannot be followed in a decision-tree fashion. With the two methods with_locales and with_locales_no_fallback, it's clear that that is the first decision, and then the types inside let you take the next decisions. If you have with_locales, with_locale_families, without_fallback, with_fallback, with_deduplication, with_runtime_fallback_location (plus all the other methods on the driver), you have no idea where to start.

Regarding LocaleFamily, I think with_descendants_without_und and without_descendants_or_und are confusing. und is not that special, and we cannot assume that there is always und, base language, regional language; some chains will be longer, and then without_ancestors is the right tool.

sffc added a commit that referenced this issue Mar 26, 2024
More testing to make sure #4629 doesn't break people
@sffc
Copy link
Member Author

sffc commented Mar 26, 2024

@robertbastian I want to propose the following narrow changes to the API in order to improve brevity... these would be impacting 2.0, not 1.5:

  1. Change with_locales_and_fallback(...) to with_locale_families(...) instead of with_locales(...)
  2. Add with_language_identifiers(...) that takes impl IntoIterator<Item = LanguageIdentifier>, FallbackOptions
  3. Un-deprecate with_all_locales(...) and make it take FallbackOptions

@robertbastian
Copy link
Member

Not opposed in general, I'd like to make this decision once we have a final API. I don't want too many methods on the driver, and these don't add any functionality.

@sffc
Copy link
Member Author

sffc commented Mar 27, 2024

This issue is mostly done in #4710. Known follow-ups in the 1.5 timeframe:

  • Add BaseLanguageHandling: Implement DeduplicationStrategy::RetainBaseLanguages #4836
  • Add a way to remove und via LocaleFamily
  • Make a standalone PR to add with_language_identifiers(impl IntoIterator, FallbackOptions)
  • Consider making LocaleFamily::full() a const item
  • Consider refactoring LanguageIdentifiersOrLocaleFamilies or LocaleInclude in the internal intermediate CLI config code

All of these PRs can be standalone and should await @robertbastian's approval.

sffc added a commit to sffc/omnicu that referenced this issue Apr 24, 2024
@sffc sffc moved this from Unclaimed for sprint to Being worked on in icu4x 2.0 Apr 25, 2024
sffc added a commit that referenced this issue Apr 30, 2024
Part of #4629

I rewrote `select_locales_for_key` and I think it is a lot cleaner and
comprehensible now.

---------

Co-authored-by: Robert Bastian <[email protected]>
sffc added a commit that referenced this issue May 2, 2024
@sffc sffc closed this as completed Sep 17, 2024
@github-project-automation github-project-automation bot moved this from Being worked on to Done in icu4x 2.0 Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-medium Size: Less than a week (larger bug fix or enhancement)
Projects
Status: Done
Development

No branches or pull requests

3 participants