ICU4X is an effort under Unicode guidance to develop a set of Unicode Components with focus on modularity, composability, and FFI supporting the scope of ECMA-402 and written in accordance with modern internationalization techniques and standards.
The effort, currently driven mostly by engineering resources from Google and Mozilla, has been in development since February 2020 and we’re now looking to establish the criteria for evaluation of its value proposition and business fit.
The following is the vision for a stable, production ready 1.0 release and milestones on the path toward it.
Primarily, ICU4X aims to offer a more modular approach to internationalization, allowing client-side and resource-constrained environments to fine-tune the payload they vendor-in to maximize the quality and breadth of the internationalization coverage and relax the constraints internationalization imposes on product release cycles.
Secondarily, ICU4X aims to provide a single high quality Internationalization API implementation written in a modern language, with low memory overhead and great runtime performance, exposed to many target ecosystems via robust FFI. This will reduce the maintenance cost of internationalization stacks and improve access to high quality internationalization in multi-stack environments.
Finally, ICU4X will offer a powerful and flexible data management system, enabling sophisticated software release models and long-lived software processes to maintain data selection and updates.
The initial stakeholders are Unicode, Google and Mozilla.
Unicode provides the knowledge, experience and guidance, as well as a non-corporate environment in which such an industry wide effort can be developed. Mozilla and Google provide engineering resources driving the development and offering business needs, product environment, and high-fidelity opportunities to validate the project against popular mature software.
As the project matures, we hope to attract three more classes of stakeholders:
Thanks to Unicode guidance, strong engineering resources, mature CLDR base, and leveraging the experience of ECMA-402, ICU4C and ICU4J, ICU4X has a chance to expose the latest, modern and well designed APIs based on the lessons-learned of the last 30 years of industry development.
ICU4X will enable high quality, rich and modern internationalization for environments and software deployment types which are unable to benefit from the currently existing industry options.
With those options at hand, ICU4X should become an attractive project for the widely understood Internationalization Community, which currently often maintains pieces of ICU-like APIs for their target needs, or a patchwork of wrappers and heavily fine-tuned deployments of ICU.
Besides Google and Mozilla, many other organizations aim to target their products and services at the whole world.
Currently, most of them consume ICU, or maintain their own equivalent of such for environments where monolithic C++ and Java are not providing the right payload or safety characteristics.
With the maturing of ICU4X we hope to attract those organizations to consider ICU4X and contribute to it.
Rust programming language has evolved one of the most unique and highly productive software communities in the world. Rust has been voted the most loved programming language in an annual StackOverflow survey for four years in a row.
Internationalization is a notoriously challenging domain of software, and the Rust community has proven itself to handle challenging problems very well. Currently the ecosystem uses Unicode, and ICU-like APIs in many crucial places, including rustc compiler itself and multiple high quality layout implementations such as xi-editor. With the rising focus on higher-level libraries such as GUI toolkits, game and web engine components, high-quality internationalization solutions are increasingly in demand.
By introducing ICU4X to the community, we have a chance to attract high-quality contributors to ICU4X, much like we were able to build a robust community around ECMA-402.
ICU4X 0.1 has been released on October 23rd 2020. The release contained a smallest viable subset of API surface and design model to establish a release.
ICU4X 0.1’s selection of components aimed to:
- Validate low-level models around data management
- Establish project culture using simple building blocks such as Locale and Plural Rules
- Introduce a low-level foundation for string operations with CodePointSet
- Expose a single, high-level, highly requested API - DateTimeFormat
- Release a meta-package ICU
With the upcoming 0.2 release, we aim to close the gap between our 0.1 features and requirements of ECMA-402 for Intl.Locale
, Intl.PluralRules
, and Intl.DateTimeFormat
, as discussed below.
Developing a high-quality solution for the internationalization industry's needs is a noble goal which requires a lot of time to design correctly. All three current stakeholders are well positioned to justify such effort.
This allows ICU4X to strive for project quality rather than short-term business needs, but software projects disconnected from business needs are at risk of developing the “Ivory Tower” syndrome which exposes risk of disconnection from real world alignment.
We strive create a roadmap that balances industry-level excellence with business alignments that allow us to frequently evaluate our ability to deliver on the value proposition. We want to recognize if the effort is not yielding expected results as early as possible.
To achieve this, we have identified a number of milestones, aligned with project planning, at which we will be able to test the prototypes of ICU4X against real business needs. This lets us validate and learn from the results, providing an opportunity to adapt our roadmap to ensure that ICU4X meets the needs of real-world products.
We identified a number of opportunities to evaluate progress of ICU4X on the path to 1.0 release against production environments:
With the upcoming 0.2 release, ICU4X will provide at least three components that can be validated against a comprehensive internationalization test harness - test262 - used by ECMA-402:
The former two should be able to pass the full test scope.
DateTime API is larger, and we hope to be able to pass enough of the test corpus to make our performance measurements meaningful for comparison with production quality date and time formatting solutions.
If successful, this test will allow us to validate the claim that ICU4X can be a backend target for ECMA-402 level Internationalization APIs, and reason about performance characteristics of that subset of ICU4X.
If we were to successfully expose a simple component of ICU4X via Wasm or FFI to another programming environment, such as Dart, JS, Python or PHP, we would be able to validate the claim that ICU4X can provide “write-once-use-everywhere” solution to low maintenance, high quality internationalization solutions in multiple environments.
The proposed target for such a component is a FixedDecimal
. The chosen API is feature complete and allows us to validate mutable and immutable scenarios.
Such a component would require less data, have a simple API surface and require less code to be vendored in, while providing a high quality output for a highly requested feature.
If that test were to be successful, we would validate an additional ICU4X proposition: the ability to design internationalization components in a modular fashion, serving fully featured formatters to those who need to support full ECMA-402 and ICU4C/ICU4J needs, as well as serving a more modular subset of features to those who only need core functionality.
Mozilla Firefox Engine - Gecko - currently uses Rust components for Locale and Plural Rules.
Those components have been donated to ICU4X and their stripped versions served as a bedrock for the 0.1 release. Since then the implementations in ICU4X have been maturing and soon will reach feature and performance parity with the original libraries.
This will allow us to replace the standalone components with their evolved ICU4X derivatives, which will serve as a product market validation bringing ICU4X to 100m+ users on Windows, Linux, MacOS and Android.
As a result of the upcoming ECMA-402 Intl.Segmenter API, the Mozilla Platform Internationalization Team is facing a challenge that ICU4X is directly aiming to solve.
Mozilla currently maintains its own segmentation engine for its layout needs, and pulling in ICU4C Segmenter would bring a substantial payload and result in code and data duplication.
Replacing lwbrk with ICU4C for layout needs would require a substantial effort to customize data management to handle Mozilla’s custom data.
ICU4X offers an environment in which we are developing a new Unicode UAX#14/UAX#29 compatible segmentation API which will use the foundational CodePointSet API, and fit the needs of both Layout and ECMA-402.
Google contributed an ML based segmenter model for Thai, Burmese, Khmer and Lao, that cut down data size by ~75% and increased precision.
Irregexp is a Google regular expression engine, used in Google and Mozilla JS engines.
Due to the problem known as catastrophic backtracking work is underway to develop a non-backtracking based engine as part of Irregexp. Irregexp currently uses ICU4C, but Mozilla is interested in investing in ICU4X Unicode Properties API to provide a better aligned and more performant API for irregexp needs and plans to implement the binding for irregexp to ICU4X.
Fuchsia maintains a thin wrapper around ICU4C exposed to Rust and would like to replace that with ICU4X. In case of a successful test262 April test, we’ll be in a good position to offer Fuchsia the ability to test a replacement of the same subset of APIs backed by ICU4C to ICU4X.
We do need to make sure to avoid data duplication, from existing ICU4C library and newly added ICU4X dependency. It would be great if ICU4X could use data already present in Fuchsia.
In order to balance the technical and business requirements of the project, the following selection of APIs and features is proposed for the 1.0 release:
Data Provider is the most crucial part of the ICU4X value proposition. High quality, flexible data management is required to prove the technical characteristics of the product, as well as provide business features that make ICU4X attractive to customers.
DataProvider is necessary to meet performance requirements, validate FFI models, and enable production use of ICU4X.
Additionally, flexible data provider gives users greater control over data / payload size and resource loading models.
For production readiness, we’ll need to validate the DataProvider management model using synchronous and asynchronous I/O, and prove the concept of chained data management.
Locale is a fundamental data type in internationalization and an ECMA-402 component.
It validates ICU4X ability to design clean, ergonomic APIs and maintain a lean and modular architecture allowing customers to use just this component with or without data, if needed.
Locale is also important for our performance claims, since large software such as Firefox handle hundreds of locale parsings and matchings during startup and a high quality of that component carries onto many hot-path scenarios such as language negotiation.
Finally, implementing and maintaining a core component with great performance characteristics, test coverage, and documentation allows us to establish high-quality culture within the project and expect that other components match it.
Pluralization is similar to Locale in that it is a core operation required by many other components, it’s a lean module and an ECMA-402 component.
As such it carries similar value and justification for inclusion in 1.0 scope.
Unicode Properties are required for many low level components including a number of our business and strategic alignments.
By providing a high quality API backed by a powerful data management solution, ICU4X becomes attractive for segmentation, collation, and unicode regular expression targets which we already see interest from.
Unicode Properties API is also critical for our ability to harness the dynamic Rust community as the GUI/layout needs arise.
Segmentation aligns with business needs, allowing us to attract Mozilla investment and in return getting ability to test our model in production supplying the needs of both Gecko layout engine and ECMA-402 component.
The segmentation API carries also similar strategic value for the project as the CodePointSet as an attractor for the Rust community, but with a higher-level API allows us to validate performance claims in a performance-critical part of all layout needs which also serves our goal of attracting corporate interest in the project.
Date and Time formatting are the most common high-level requests out of internationalization APIs and a core part of ECMA-402.
On the technical side, this API allows us to validate our value proposition with a high-level, complex, highly demanded API. We have a chance to explore our modularization capabilities, high-volume data management, and non-trivial algorithmic requirements.
On the business side, we get a chance to supply the most common user needs, and cater to the Internationalization and Rust communities.
Number formatting represents a blend of low-level foundational APIs such as Locale, Plural Rules and CodePointSet, with high-level APIs such as DateTimeFormat.
It is required for DateTimeFormat, and will exercise our ability to maintain modularity and lean profile of our APIs, while building a large feature set that ECMA-402 NumberFormat carries.
FFI poses a lot of technical challenges that we have to overcome to validate the business value of ICU4X being a solution to the high-maintenance cost of keeping separate implementations for many target environments.
We’ll need to validate the ability to produce bindings to other programming languages and technically prove that the DataProvider model can fit those requirements.
With the above listed components, ICU4X 1.0 will provide a comprehensive internationalization solution and validate its value proposition.
Components listed below are worth considering for 1.0 as resources and timing permits, but will not be necessary to prove the product.
Collation is one of the core ECMA-402 components, and builds on top of the Unicode Properties API which we’re investing in.
At the same time, technically it provides similar value to 1.0 scope as Segmenter, but without a clear business need making the core driver the ECMA-402 completeness.
As such, it is not required to prove the value of the project or evaluate its characteristics and market fit, which allows us to push it past 1.0.
Depending on the ability to modularize DateTimeFormat, Duration or List format may serve as high-level, simple targets for FFI validation.
If a subset of DateTimeFormat is used, then the business and technical needs for such components are lowered and all such APIs can be implemented post 1.0.
For the current roadmap, please see ICU4X 1.0 Roadmap.