Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split numbering systems out of decimal data #5822

Merged
merged 10 commits into from
Nov 15, 2024

Conversation

Manishearth
Copy link
Member

@Manishearth Manishearth commented Nov 14, 2024

Fixes #5818

Before:

 decimal/symbols@2, <lookup>, 1316B, 252 identifiers
 decimal/symbols@2, <total>, 4308B, 2436B, 49 unique payloads

After:

 decimal/digits@1, <lookup>, 207B, 27 identifiers
 decimal/digits@1, <total>, 1080B, 1060B, 27 unique payloads
 decimal/symbols@2, <lookup>, 804B, 184 identifiers
 decimal/symbols@2, <total>, 1749B, 881B, 31 unique payloads

Saving ~1.5kB, a good third of the data size. A lot of the wins are just in deduplication.

I'm also going to try moving the tinystr into the VarZeroVec and seeing what happens.

I may also try and store the digits more compactly as an enum { Sequential(char), Many(ZeroVec<char>) }. A downside of this is that the Sequential case would need UTF8 validation every time, though we could make it so that that's just the wire format and we expand to a digit array on data load.

Todo: add configurability for this.

@sffc In the long run CompactDecimal / etc should also be using this data. In that case, should we just always generate all known decimal systems? How would the unification work across keys?

@Manishearth
Copy link
Member Author

Manishearth commented Nov 14, 2024

Digits becomes much larger in "all" mode. I've added code for that but not hooked it in yet.

 decimal/digits@1, <lookup>, 550B, 77 identifiers
 decimal/digits@1, <total>, 3080B, 3420B, 77 unique payloads

@Manishearth Manishearth force-pushed the decimaldigits branch 9 times, most recently from 455e5e9 to 6c46fc5 Compare November 14, 2024 21:13
@sffc
Copy link
Member

sffc commented Nov 14, 2024

@sffc In the long run CompactDecimal / etc should also be using this data. In that case, should we just always generate all known decimal systems? How would the unification work across keys?

I don't really understand the question? CompactDecimalFormatter depends on FixedDecimalFormatter and so it should already be using the new data markers.

@Manishearth
Copy link
Member Author

Manishearth commented Nov 14, 2024

Ah, we lose out on some of our wins when we handle the fact that the symbols data can differ for a given locale between numbering systems.

 decimal/symbols@2, <lookup>, 1316B, 252 identifiers
 decimal/symbols@2, <total>, 2740B, 1368B, 49 unique payloads

sffc
sffc previously approved these changes Nov 15, 2024
provider/source/src/decimal/mod.rs Outdated Show resolved Hide resolved
provider/source/src/decimal/mod.rs Outdated Show resolved Hide resolved
sffc
sffc previously approved these changes Nov 15, 2024
tutorials/data_provider.md Show resolved Hide resolved
components/decimal/src/provider.rs Outdated Show resolved Hide resolved
sffc
sffc previously approved these changes Nov 15, 2024
sffc
sffc previously approved these changes Nov 15, 2024
@Manishearth Manishearth merged commit 53eac4d into unicode-org:main Nov 15, 2024
28 checks passed
@Manishearth Manishearth deleted the decimaldigits branch November 15, 2024 08:07
Manishearth added a commit that referenced this pull request Nov 15, 2024
Depends on #5822

Minor win:

```
Before: decimal/symbols@2, <total>, 2740B, 1368B, 49 unique payloads
After:  decimal/symbols@2, <total>, 2657B, 1285B, 49 unique payloads
```


And an 8 byte win on FDF stack size.


<!--
Thank you for your pull request to ICU4X!

Reminder: try to use [Conventional
Comments](https://conventionalcomments.org/) to make comments clearer.

Please see
https://github.com/unicode-org/icu4x/blob/main/CONTRIBUTING.md for
general
information on contributing to ICU4X.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split numbering systems data out from decimal symbols
2 participants