Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-source data principle #5763

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions documents/design/principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,23 @@ ICU4C/ICU4J exposes certain pieces of data through user-facing APIs such as Date

Runtime customizability of locale data can sometimes come at a performance or memory cost.

## Locale data from multiple sources works seamlessly

*What:* If data is available for a particular constructor and locale, the resulting behavior should not change based on where the data was sourced, with a narrow exception for data that primarily impacts performance characteristics.
Copy link
Member

@robertbastian robertbastian Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this basically says that behaviour can never change even if CLDR data changes. If I'm sourcing from more recent CLDR, behaviour should often change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point; that obviously wasn't the intention. Is this better?


*Why:* Locale data can be loaded from multiple sources: for example, some data might be baked into the binary, some might be loaded from the operating system, and some might be downloaded on demand in the form of language packs.

Examples that violate this policy:

- A datagen option to remove less-used time zones from a data payload that contains time zone display names for all time zones. Some data sources might make different decisions, resulting in different behavior for the end user depending on how the data loading pipeline was configured and which data was loaded first.\*

Examples that are consistent with this policy:

- A datagen option to tweak the bounds of pre-calculated Chinese year offsets exported into a payload, causing more or fewer years to fall back to expensive calculations at runtime. This impacts performance, but the resulting behavior is the same.
- A datagen option to remove time zone names from a locale that equal the root time zone names, and a corresponding runtime code change to check both payloads. This does not normally impact behavior as observed by the user; hower, it could still impact behavior in edge cases involving different sources using different CLDR versions, so the ICU4X-WG should discuss and make an informed decision.

\* If such an optimization is desired, consider using two data payloads, one for "core" and one for "extended", instead of a datagen option. Alternatively, restructure the data to use data marker attributes, which can be safely filtered by datagen.

## Modular Code and Data with static analysis

*What:* Both the code and the data should be written so that you only bring what you need. Code and data should be modular not only on a "class" level, but also within a class, such that you don't carry code and data for a feature of a class that you aren't using. Code and data slicing should be able to be determined using static code analysis. We should be able to look at the functions being called, and from that, build exactly the code and data bundle that the app needs.
Expand Down