Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement without_descendants and without_ancestors in datagen #4846

Merged
merged 29 commits into from
Apr 30, 2024

Conversation

sffc
Copy link
Member

@sffc sffc commented Apr 25, 2024

Part of #4629

I rewrote select_locales_for_key and I think it is a lot cleaner and comprehensible now.

@sffc sffc marked this pull request as ready for review April 25, 2024 02:08
@sffc sffc requested review from robertbastian, Manishearth and a team as code owners April 25, 2024 02:08
@sffc sffc removed request for a team and Manishearth April 25, 2024 02:08
@@ -23,7 +23,7 @@
//! DatagenDriver::new()
//! .with_keys([icu_provider::hello_world::HelloWorldV1Marker::KEY])
//! .with_all_locales()
//! .export(&DatagenProvider::new_latest_tested(), exporter)
//! .export(&icu_provider::hello_world::HelloWorldProvider, exporter)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't impact behavior; it just makes it so that this crate doesn't need to depend on the provider feature of icu_datagen

provider/blob/tests/test_versions.rs Show resolved Hide resolved
static VALUES: [&<icu::collator::provider::CollationMetadataV1Marker as icu_provider::DataMarker>::Yokeable; 87usize] = [&AF, &AM, &AR, &AR, &AR, &AR, &AM, &AR, &AF, &AR, &AF, &AM, &AF, &AF, &DA, &AF, &AF, &AM, &AF, &AF, &AF, &AF, &AR, &AF, &AF, &AF, &AF, &FR_CA, &AF, &AF, &AR, &AF, &AM, &AR, &AR, &AF, &AF, &AR, &AF, &AF, &AR, &AM, &AR, &AR, &AR, &AR, &AR, &AR, &AR, &AM, &LT, &AF, &AR, &AR, &AM, &AR, &DA, &AR, &AM, &AF, &AR, &AR, &AF, &AR, &AF, &AM, &AR, &AF, &AF, &AF, &AR, &AF, &AR, &AR, &TH, &AF, &AF, &AF, &AR, &AR, &UND, &AR, &AF, &VI, &AF, &AF, &AR];
static KEYS: [&str; 87usize] = ["af", "am", "ar", "as", "az", "be", "bg", "bn", "br", "bs", "ceb", "chr", "cs", "cy", "da", "de-u-co-phonebk", "dsb", "el", "en-US-posix", "eo", "es", "et", "fa", "ff-Adlm", "fi", "fil", "fo", "fr-CA", "fy", "gl", "gu", "ha", "he", "hi", "hr", "hsb", "hu", "hy", "ig", "is", "ja", "ka", "kk", "km", "kn", "ko", "kok", "ku", "ky", "lo", "lt", "lv", "mk", "ml", "mn", "mr", "mt", "my", "ne", "no", "or", "pa", "pl", "ps", "ro", "ru", "si", "sk", "sl", "sq", "sr", "sv", "ta", "te", "th", "tk", "to", "tr", "ug", "uk", "und", "ur", "uz", "vi", "wo", "yo", "zh"];
static VALUES: [&<icu::collator::provider::CollationMetadataV1Marker as icu_provider::DataMarker>::Yokeable; 88usize] = [&AF, &AM, &AR, &AR, &AR, &AR, &AM, &AR, &AF, &AR, &AF, &AM, &AF, &AF, &DA, &AF, &AF, &AF, &AM, &AF, &AF, &AF, &AF, &AR, &AF, &AF, &AF, &AF, &FR_CA, &AF, &AF, &AR, &AF, &AM, &AR, &AR, &AF, &AF, &AR, &AF, &AF, &AR, &AM, &AR, &AR, &AR, &AR, &AR, &AR, &AR, &AM, &LT, &AF, &AR, &AR, &AM, &AR, &DA, &AR, &AM, &AF, &AR, &AR, &AF, &AR, &AF, &AM, &AR, &AF, &AF, &AF, &AR, &AF, &AR, &AR, &TH, &AF, &AF, &AF, &AR, &AR, &UND, &AR, &AF, &VI, &AF, &AF, &AR];
static KEYS: [&str; 88usize] = ["af", "am", "ar", "as", "az", "be", "bg", "bn", "br", "bs", "ceb", "chr", "cs", "cy", "da", "de-AT-u-co-phonebk", "de-u-co-phonebk", "dsb", "el", "en-US-posix", "eo", "es", "et", "fa", "ff-Adlm", "fi", "fil", "fo", "fr-CA", "fy", "gl", "gu", "ha", "he", "hi", "hr", "hsb", "hu", "hy", "ig", "is", "ja", "ka", "kk", "km", "kn", "ko", "kok", "ku", "ky", "lo", "lt", "lv", "mk", "ml", "mn", "mr", "mt", "my", "ne", "no", "or", "pa", "pl", "ps", "ro", "ru", "si", "sk", "sl", "sq", "sr", "sv", "ta", "te", "th", "tk", "to", "tr", "ug", "uk", "und", "ur", "uz", "vi", "wo", "yo", "zh"];
Copy link
Member Author

@sffc sffc Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only behavioral change from this PR. de-AT-u-co-phonebk becomes a selected locale because we pull extensions up into all supported locales, not just explicit locales.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does reveal an interesting attribute of the deduplicator behavior: de-AT-u-co-phonebk is retained because it inherits from de-AT, which has different data, before it inherits from de-u-co-phonebk. However, de-AT is deduplicated because it inherits from de. This means that we could deduplicate de-AT-u-co-phonebk in a second pass. 🤔


// Fill in missing extensions and aux keys from parent locales,
// and calculate which langids are ancestors and descendants.
for current_langid in locales_map.keys().cloned().collect::<Vec<_>>() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of cloning going on here because otherwise I would need multiple mutable borrows of the HashMap at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be clearer to build a new one instead of mutating

Copy link
Member

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algorithm looks good, just would like a bit less mutation going on.

provider/blob/tests/test_versions.rs Show resolved Hide resolved
Self {
langid: Some(langid),
include_ancestors: true,
include_descendants: false,
}
}

/// The family containing all descendants of the selected locale.
///
/// This family is primarily useful if the root locale is not desired.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it's the primary use case, your example is another (more general) one.

is_selected: bool,
data_locales: HashSet<DataLocale>,
}
let mut locales_map: HashMap<LanguageIdentifier, LocalesMapValue> = Default::default();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: writing this as

Suggested change
let mut locales_map: HashMap<LanguageIdentifier, LocalesMapValue> = Default::default();
let mut locales_map = HashMap::<LanguageIdentifier, LocalesMapValue>::default();

makes the code more maintainable, as that expression is context-independent.

}
if explicit.contains(&locale) {
return true;
for family in families {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put this all in an else, as it doesn't happen when families is empty

if families.is_empty() {
// If no locales are selected but fallback is enabled, select the root locale
let value = locales_map.entry(LanguageIdentifier::UND).or_default();
value.is_selected = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not big fan of keeping state in the locales_map, would prefer building a new collection

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but the code was already mutating the map in locales_map.retain; I'm just mutating it a bit more. I don't really see the point in re-building the structure. When there aren't aux keys or extensions at play, expensive mutations are not necessary, so we'd basically just end up cloning the map a few times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this code I would rather optimize for readability than for performance. Even with the mutation you're doing a lot of cloning. It's unclear to me as a reader how all the mutations interact, intermediate immutable steps would help here.


// Fill in missing extensions and aux keys from parent locales,
// and calculate which langids are ancestors and descendants.
for current_langid in locales_map.keys().cloned().collect::<Vec<_>>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be clearer to build a new one instead of mutating

Comment on lines 994 to 995
.map(|family| family.include_ancestors)
.unwrap_or(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is really worth, performance wise, to have family be an optional borrow, instead of just creating the families for all LanguageIdentifiers? This false is the same as single's, but here it's just a false literal.

Same below for include_descendants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The presence of the family is how I indicate that the langid was explicit. If it is None, the langid is supported but not explicit.

@sffc sffc requested a review from robertbastian April 30, 2024 00:45
@sffc
Copy link
Member Author

sffc commented Apr 30, 2024

I did a bunch more refactoring to reduce the amount of mutation. I agree it makes it easier to follow. Hope you like it :)

robertbastian
robertbastian previously approved these changes Apr 30, 2024
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Outdated Show resolved Hide resolved
provider/datagen/src/driver.rs Show resolved Hide resolved
@sffc sffc requested a review from robertbastian April 30, 2024 16:28
LocalesWithOrWithoutFallback::WithoutFallback { langids: locales } => {
let mut sorted_locales = locales
LocalesWithOrWithoutFallback::WithoutFallback { langids } => {
let mut sorted_locales = langids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted_langids?

@sffc sffc merged commit 006e2c2 into unicode-org:main Apr 30, 2024
30 checks passed
@sffc sffc deleted the without-ancestors branch April 30, 2024 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants