-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Datagen API and CLI with FallbackOptions #4710
Conversation
This PR doesn't export any APIs, but it plumbs the old APIs through the new APIs with no detectable output diffs. I can merge this now or keep working in this PR. |
provider/datagen/src/driver.rs
Outdated
return false; | ||
}; | ||
let mut it = config.locales.iter(); | ||
matches!((it.next(), it.next()), (Some(lid), None) if lid == &LocaleWithExpansion::wildcard()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you make wildcard()
an associated const
you can match against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'd rather keep it a function because it's more flexible.
provider/datagen/src/driver.rs
Outdated
pub runtime_fallback_location: Option<RuntimeFallbackLocation>, | ||
pub deduplication_strategy: Option<DeduplicationStrategy>, | ||
/// private: set via constructor | ||
locales: HashSet<LocaleWithExpansion>, | ||
pub und_inclusion: Option<UndInclusion>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you special-casing und
? It will either be included due to fallback, or the user can include it through @und
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; I wanted to make it easier to exclude und
, which is desirable when building language packs. (Frank was having this problem building a language pack for Chrome.) Strictly speaking, this can be modeled as a LocaleFamily
variant, such as "ancestors and descendants except for und". However, I don't see a super clean way to handle it. The options would be:
--locales en-001
impliesund
,en
, ... and the user needs to specify--without-und
to disableund
(what this changelist implies)- Pros: Puts the special case of
und
as its own flag for users who need this behavior - Cons: Entanglement of options
- Pros: Puts the special case of
--locales en-001
impliesen
, ..., (removeund
from LocaleFamily with_ancestors) and--with-und
(default behavior) includesund
, and the user can specify--without-und
to disable the inclusion ofund
- Pros: A cleaner model with less conflict between options
- Cons: Still an extra flag to imply this behavior
--locales en-001
impliesund
,en
, ..., and a new LocaleFamily type_en-001
impliesen
, ...- Pros: Clean model with all locale resolution handling in the same mechanism
- Cons: Every locale needs to be prefixed with
_
in order to avoid includingund
. Need to also invent a syntax for the "ancestors, no descendants, no und" mode.
--locales en-001
impliesund
,en
, ..., and a new LocaleFamily type!und
removesund
from the selection- Pros: All locale resolution happens in the same option. New include/exclude syntax could be used to model other things.
- Cons: Include/exclude syntax could be error-prone. We now have an ordered list of rules instead of an unordered set of families.
Do you see anything else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting we'd do something like 3. Including whole locale families but excluding und
is a niche (and risky) enough behaviour that introducing a new locale family symbol shouldn't be the main problem. If the main use case is language packs, there'd usually only be a single und-less family per invocation anyway.
I don't understand why we need a "ancestors, no descendants, no und" mode. und
is an ancestor, and it's not special. If you're building en
language packs, you don't want to duplicate en-001
the same way you don't want to duplicate und
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do in a follow-up
provider/datagen/src/driver.rs
Outdated
pub runtime_fallback_location: Option<RuntimeFallbackLocation>, | ||
pub deduplication_strategy: Option<DeduplicationStrategy>, | ||
/// private: set via constructor | ||
locales: HashSet<LocaleWithExpansion>, | ||
pub und_inclusion: Option<UndInclusion>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; I wanted to make it easier to exclude und
, which is desirable when building language packs. (Frank was having this problem building a language pack for Chrome.) Strictly speaking, this can be modeled as a LocaleFamily
variant, such as "ancestors and descendants except for und". However, I don't see a super clean way to handle it. The options would be:
--locales en-001
impliesund
,en
, ... and the user needs to specify--without-und
to disableund
(what this changelist implies)- Pros: Puts the special case of
und
as its own flag for users who need this behavior - Cons: Entanglement of options
- Pros: Puts the special case of
--locales en-001
impliesen
, ..., (removeund
from LocaleFamily with_ancestors) and--with-und
(default behavior) includesund
, and the user can specify--without-und
to disable the inclusion ofund
- Pros: A cleaner model with less conflict between options
- Cons: Still an extra flag to imply this behavior
--locales en-001
impliesund
,en
, ..., and a new LocaleFamily type_en-001
impliesen
, ...- Pros: Clean model with all locale resolution handling in the same mechanism
- Cons: Every locale needs to be prefixed with
_
in order to avoid includingund
. Need to also invent a syntax for the "ancestors, no descendants, no und" mode.
--locales en-001
impliesund
,en
, ..., and a new LocaleFamily type!und
removesund
from the selection- Pros: All locale resolution happens in the same option. New include/exclude syntax could be used to model other things.
- Cons: Include/exclude syntax could be error-prone. We now have an ordered list of rules instead of an unordered set of families.
Do you see anything else?
provider/datagen/src/driver.rs
Outdated
pub runtime_fallback_location: Option<RuntimeFallbackLocation>, | ||
pub deduplication_strategy: Option<DeduplicationStrategy>, | ||
/// private: set via constructor | ||
locales: HashSet<LocaleWithExpansion>, | ||
pub und_inclusion: Option<UndInclusion>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting we'd do something like 3. Including whole locale families but excluding und
is a niche (and risky) enough behaviour that introducing a new locale family symbol shouldn't be the main problem. If the main use case is language packs, there'd usually only be a single und-less family per invocation anyway.
I don't understand why we need a "ancestors, no descendants, no und" mode. und
is an ancestor, and it's not special. If you're building en
language packs, you don't want to duplicate en-001
the same way you don't want to duplicate und
.
/// Sets this driver to generate the given locales assuming runtime fallback. | ||
/// | ||
/// Use the [`langid!`] macro from the prelude to create an | ||
/// explicit list, or [`DatagenProvider::locales_for_coverage_levels`] for CLDR coverage levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or create language families
Plan for this PR:
LGTM: @robertbastian |
.map(|l| l.parse::<LanguageIdentifier>()) | ||
.map(|l| { | ||
l.parse::<LanguageIdentifier>() | ||
.map(LocaleFamily::with_descendants) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you're doing this in a follow-up: here, as on the main CLI, we should parse directly to LocaleFamily
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder. Working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you fixed it in datagen but not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no you didn't change it in datagen, that's why CI is failing
provider/datagen/src/driver.rs
Outdated
let mut sorted_locales = | ||
locales.iter().map(ToString::to_string).collect::<Vec<_>>(); | ||
sorted_locales.sort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found something cool for this: https://internals.rust-lang.org/t/add-a-vec-sorted-function/11948/3. We already have itertools
in datagen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it returns std::vec::IntoIter
, so it still does the same thing under the hood, just a bit nicer syntax and a bit more prone to error because if you call any other iterator methods like .map()
then you end up double-allocating. It also seems a bit risky that std::vec::IntoIter::collect
uses specialization. But I can do this if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think it's nicer because it avoids an intermediate variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking
@robertbastian CI is passing as of a245cbb. I started adding the LocaleFamily string parsing logic (thanks for the reminder) but don't have it quite ready yet. Do you approve this pending the remainder of the internal plumbing for LocaleFamily in the CLI? Also, if you prefer I can roll back to a245cbb (or c5ddea3, which has some renames and hopefully passes CI) and merge that and put the LocaleFamily parsing in a follow-up. |
I approve at c5ddea3, and also the rest if you get CI to pass. |
@@ -94,7 +94,7 @@ fn main() -> eyre::Result<()> { | |||
let locales_intermediate: LanguageIdentifiersOrLocaleFamilies = match config.locales { | |||
config::LocaleInclude::All => LocaleFamilies(vec![LocaleFamily::full()]), | |||
config::LocaleInclude::None => LanguageIdentifiers(vec![]), | |||
config::LocaleInclude::Explicit(set) => LanguageIdentifiers(set.into_iter().collect()), | |||
config::LocaleInclude::Explicit(set) => LocaleFamilies(set.into_iter().collect()), | |||
config::LocaleInclude::CldrSet(levels) => LanguageIdentifiers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: I think we've been treating cldr sets as families
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This locales_intermediate
just preserves the type things come in as. They get converted to LocaleFamilies
in with_fallback and they stay as LanguageIdentifier
s in without_fallback.
LocaleFamilies(lfs) => lfs | ||
.into_iter() | ||
.map(|family| family.write_to_string().parse()) | ||
.collect::<Result<Vec<LanguageIdentifier>, icu_locid::ParserError>>()?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This logic is a bit strange but I think it gets the job done. In Clap we parse the list to a Vec<LocaleFamily>
, and then here I re-parse it as a Vec<LanguageIdentifier>
. I guess another way to do this would be to parse to a Vec<String>
in Clap and then pick the concrete type here.
This is ready to land. @robertbastian last approved it modulo making CI/tests pass, which is now the case. |
#4629