-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Tensorflow Core as a product #331
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,174 @@ | ||
# Tensorflow Core as a product RFC | ||
|
||
| Status | Proposed | | ||
:-------------- |:---------------------------------------------------- | | ||
| **Authors** | Yanhua Sun ([email protected]) | | ||
| **Sponsor** | Rohan Jain ([email protected]) | | ||
| **Updated** | 2020-11-13 | | ||
|
||
|
||
## Objective | ||
|
||
This RFC proposes Tensorflow core as a standalone product and describes the high level components and the principles behind it. | ||
|
||
## Motivation | ||
|
||
* Tensorflow APIs: TensorFlow’s Python API surface contains over 3,300 API symbols, not including tf.compat or class methods, or symbols in supplemental packages like TF Probability, TF Addons, TFX packages, TF Graphics, and more. This huge amount of TensorFlow’s APIs surface is described as “expansive” and “challenging to understand”; its symbols are often referred to as “unintuitively named”, and bucketed together in confusing ways. This reduces discoverability for documentation, is confusing for users, and increases the amount of prior knowledge required just to get started with TensorFlow. Among these APIs, some exist for historical reasons to keep backward compatibility, some are only used in specific domains, some are mostly duplicated with minor differences. Meanwhile, a subset of these APIs are essential and fundamental to build any ML application. They also serve as the basic building blocks for advanced APIs. We define this set of APIs as the Tensorflow core API, which should be lightweight, stable, performant, composable and complete to serve as the foundation of high level APIs. | ||
* Tensorflow packages: Currently Tensorflow is a monolithic package including all the APIs, functions and runtime. Even though sometimes users just need a small set of functions, the whole package needs to be downloaded, installed and built. This all adds up to a long development cycle, leading to lots of pain for users. Providing users with smaller meta packages can significantly reduce the pain here, leading to a faster development cycle. | ||
* Tensorflow internal code structure: Tensorflow can be intimidating’ to develop and to maintain. Among many reasons contributing to it, bad layering is a big one. A lightweight and tight core with other components built on top of it can help make better layers, getting rid of cyclical dependencies and mitigating confusions. | ||
To resolve some of the problems mentioned above, we propose layering the system to create a “Tensorflow core” package. This package will enable a simple, performant and robust numerical computation library specialized for machine learning that allows advanced users to build and train the ML model of their choice. | ||
## User Benefit | ||
Users of TF core include: all users who use the Tensorflow package, and users who will use Tensorflow core as a standalone package. Since the Tensorflow package will depend on Tensorflow core, all Tensorflow users are also Tensorflow core users. Some users directly interact with Core, while other users will indirectly use core through other high level libraries, which are built on top of Core. Some examples include: | ||
* Framework builders such as Keras and Sonnet, or authors of new frameworks who will now layer on top of a well established polished product. We expect that users who want to extend Keras / Sonnet will interact directly with this layer | ||
* Libraries like tf.data, tf.distribute, TFP, TF Agents that provide additional functionality to users such as input pipelines, higher level distribution APIs etc. | ||
* Users who require only a numerical core (possibly incl. automatic differentiation) for their work and who would be burdened by the (mental) overhead of additional functionality e.g. users working in the numerical and scientific computing field | ||
* Advanced users (researchers or production) who might desire more flexibility than provided by higher level APIs, or whose use cases are not covered by our existing high-level APIs | ||
|
||
The benefits for different user groups are: | ||
Faster build time, smaller binary size: For users who just need the core functions, they can depend on the Tensorflow core package instead of the monolithic TF whole package. This leads to faster build time and smaller binary size. | ||
|
||
Stable and performant Tensorflow core: As it's lightweight and more maintainable, it's easier to make Tensorflow core stable, performant and well tested benefitting all users. | ||
|
||
Easier development experience within Tensorflow codebase: By making the layer clearer and making the Tensorflow core self contained, it will help developers to navigate the code base. | ||
|
||
## Proposal | ||
|
||
This RFC proposes the high level principles on achieving the goals, and also the high level components the Tensorflow core will include. Details on implementation are not the focus of this RFC, they will be discussed later. | ||
|
||
### Principles | ||
|
||
Purpose: Prioritize users' needs, product excellence in the first place, minimize any internal constraints that get in the way of the product purpose. | ||
|
||
General and complete: This set of core APIs together with the extension mechanics should be complete and comprehensive so that high level libraries can be built on core APIs without hacking into it. In future, even when high level use cases evolve, the core APIs should still be sufficient to serve these new use cases without changes. | ||
|
||
Minimal and stable: The set of APIs should be small. This helps reduce the burden of maintenance, and also helps keep the APIs more stable. Another way to put this principle is orthogonality, which means that two instructions are orthogonal if they cannot be used to accomplish the same task. APIs which lack orthogonality naturally lead to cognitive burden for users, not the least of which is the dreaded feeling of confusion: “there’s more than one way to do things”. | ||
|
||
Backwards compatibility: As a result of this effort, no Python user code will need to change. This is basically a refactoring and implementation layering exercise within Tensorflow that would improve the overall quality and maintainability of the product. Additionally, we will migrate some of the implementation from Python to C++. A set of new C++/C APIs will be created without affecting the existing Python APIs. | ||
|
||
### Keys to get there | ||
|
||
Understand users’ needs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this will require a SIG for the core. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or "users" here refer to the high-level API owners, tf-keras/data/distribute etc. then one could argue internal communication might be more effective/efficient at the initial stage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please involve SIGs leads at least. |
||
|
||
Core team will constantly talk with users, get users involved for all the phases, starting from requirements, design phase, until the launch phase and afterward. We care about all core users and their feedback. Meanwhile the core team will set up a few high touch teams to work together. | ||
|
||
Extensible and composable | ||
|
||
In our principle, we mentioned that the set of APIs needs to be minimal, providing everything directly is not the goal, and also not sustainable. Meanwhile, this minimal set of APIs should also be sufficient for users to build their library without hacking into core APIs, or requesting to change/add more APIs. So, it sounds contradictory to have both principles at the same time with each other. | ||
|
||
The key here is to provide an extensible and composable mechanism, together with the set of core APIs. | ||
|
||
Performance | ||
|
||
TF Core will be extremely lightweight and performant. We’ll try to do as much computation as possible in C++ with a thin layer on top. For example, Python APIs will be one thin layer on top. This also involves building a clean C API layer where state of the art runtimes like TFRT and Graph Compilers like MLIR can plug in and improve performance. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What Is the python layer here? A 1-to-1 wrapper? Or an higher level layer with a reduced surface? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would like see layered Python APIs, with Keras, tf-data, distributed, etc built on core APIs and extension/composite mechanics. This core APIs are a reduced and complete surface. One high level API could be composed by a few core APIs. |
||
|
||
|
||
### Guidelines on evaluating the core APIs | ||
|
||
* Is this API serving any use cases? | ||
|
||
When we add/remove APIs, the first thing we should ask is ‘what is the use case for this APIs’. This API should serve users’ purpose, not developers’ purpose. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that we cannot do too much claims for users without a regular SIG. Also what about custom c++ ops in SIGS? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are some essential and basic C++ Ops. Also a lot of custom c++ ops can be replaced with python code. currently one of main reason for these custom C++ ops is performance. We do have some plans to bring the python code performance closer to custom c++ op performance, so that custom c++ is not required, instead, same functions are implemented in the high level languages, this provides easier programability and flexibility to modify. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this was my point and this seems to me to give a quite high responsibility to the core API set perimeter definition + compiler stack performance to not create a bottleneck. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. quick question, when I add a new hardware device(TensorFlow modular plugin), for ABI stable, I need register Ops/Kernels with C API, How could register them into TF-Core? also , I see (https://github.com/tensorflow/community/pull/77), if user need a new Python API(eg. for plugin device new Ops/Kernels), they need to interact TF-Core(C++ libraries) with C API, What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sun51 In the meantime can you expose a little but more if and how this is going to interact with https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/mlir/tfr. |
||
|
||
* Is this API similar or redundant with existing ones? | ||
|
||
Corresponding to the principle ‘Primitive/Essential/Orthogonality’: If one API can be written by using existing a few APIs together with mechanics core provides, then the API is not essential. Instead of including that particular APIs, the primitive APIs that API is composed of should be included in the core. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the "few" term Is different in the c++ domain instead of python domain |
||
|
||
* Is this API easy to use? | ||
Usability - representation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some formatting (especially line breaks) are broken in this doc when rendered. For example, you need a blank line here to break the line. |
||
Usability - interaction | ||
Usability - flow | ||
Predictable - consistent, adaptable, discoverable. | ||
Error messages | ||
Are the error messages for the APIs easy to understand? Are the error messages clear enough to give the feedback on what is going wrong, and how to fix it? | ||
|
||
* Is this API performant? | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On what reference runtime? TFRT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A general suggestion is to avoid over-emphasizing "performant" without going into detailed analysis to the trade-offs, e.g. why this API design enables high performance implementation and its alternatives do not. |
||
There should be both micro benchmarks and model benchmarks to monitor the performance of API. It should achieve the best performance. | ||
|
||
* Is this API well documented? | ||
|
||
Adding an API is not only about adding the code. It also involves writing good documentations, providing examples, providing best practices on how to use this API. | ||
|
||
* Is this API well tested? | ||
|
||
A well tested API, should include (1) unit tests (2) integration tests, to test all combinations and possible use cases. If it is infeasible to test some cases, it should be explicitly listed. (3) Various hardware platforms (including CPU, GPU, TPU, etc). Extension tests need to be included to support users to subclass things (like CompositeTensors). Some good examples to refer to: DistStrat handles their testing strategy such that as new things are built out users automatically get tests for them as part of the standard suite invocation. It's great to know that my code won't break as APIs get extended. | ||
|
||
In the following sections, we describe what kinds of APIs will be part of Tensorflow core. This RFC describes at a high level what kinds of APIs will be a part of core without listing every possible API endpoint that we will expose. The objective is to categorize what kinds of APIs will be part of core and what kinds we will omit from Core. | ||
|
||
Note: core APIs is just a small set of existing TF APIs, they are expected to be more stable, essential, faster, better tested. All current existing TF APIs will not be changed, and will continue to be supported. | ||
|
||
### Data structures in Core | ||
|
||
Tensor: Immutable multiple dimension denser arrays | ||
Variables: Mutable shared persistent data. | ||
TensorArray: Dynamic-sized, write once Tensor arrays. | ||
Lookup tables: Hash tables | ||
Element data type like tf.dtypes | ||
Module: named container for variables and state | ||
|
||
We’ll allow users to extend and create user defined data structures using | ||
ExtensionType (as known as CompositeTensor, RFC): provides a pair of protocols for defining new types that can be directly supported by TensorFlow APIs. (This API is also known as the "CompositeTensor" API). Example ExtensionType in core includes IndexSlices. | ||
TypeSpec:(https://github.com/tensorflow/community/pull/269) provides a standard way to define and reason about TensorFlow-specific types that are supported by TensorFlow APIs. | ||
|
||
Note: We are designing a new full type system. Variables will inherit from core.Tensor. | ||
TensorArray | ||
|
||
### The APIs on the structures mentioned above | ||
Creation functions for Tensor etc. | ||
APIs to query the attributes on Tensors include rank, shape, size.Creation | ||
modify, squeeze, expand Tensorflow Shape. | ||
get the static value from Tensor. | ||
Clip operation | ||
Concat/Split/Stack/Unstack/Pad | ||
Slicing APIs (scatter/gather for now but we’ll expand __getitem__ so that gather isn’t needed) | ||
Search/Sort/Unique/Reorder | ||
Transform | ||
Flatten and pack | ||
Bitwise operations | ||
Conversion with other framework (TensorProto, ndarray) | ||
|
||
### Functional components of TF core | ||
* Control flow: tf.cond, tf.while | ||
* tf.function: Trace compile tensorflow programs to accelerate them | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can the quality of the compiler performance in a function have an impact on the evaluation about extending the core? Or we will be forwarded on compiler improvements? |
||
* Automatic differentiation: GradientTape, forward mode gradients, higher order gradients etc. | ||
* AutoGraph: Convert plain Python code to Tensorflow code. | ||
|
||
### Scalability support | ||
* TF Save/Load: Serializing / Deserializing a tensorflow program (tf.function) to disk. | ||
* Distribution support: Low level APIs for distribution (tf.device, collectives) | ||
* Automatic batching: pfor, vectorized_map to automatically batch computation. | ||
|
||
### Development support | ||
* Debugging: APIs for error handling, error messages; APIs for debugging Tensorflow programs. | ||
* Testing: APIs and functions for users to test Tensorflow programs. | ||
* Basic Utilities: Essential and primitive utilities functions, APIs. | ||
|
||
### Math/NN APIs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about RNG ops (both stateless RNGs and tf.random.Generator)? |
||
* Basic Math ops like tf.add, tf.matmul etc. which closely match what numpy offers | ||
* Basic NN related like tf.conv2d, padding, max pooling etc. | ||
|
||
### The following will not be included in the Core: | ||
Model, Layer high level apis (Keras) | ||
Losses, Metrics, Optimizers (Keras) | ||
High level Distribution APIs (tf.distribute) | ||
Input pipeline APIs (tf.data) | ||
Mobile APIs (tf lite) | ||
XLA/MLIR | ||
Summary / Visualization APIs (tensorboard) | ||
Audio/Signal processing (tf.audio / tf.signal) | ||
Computer Vision (tf.image) | ||
|
||
### Dependencies | ||
For backward compatibility, all existing TF applications will continue working without any change. Internally, we expect some code refactor and structure change, this includes TF distribution strategy, TF data, etc. For new users, we will encourage them to try the TF core package when a small set of APIs are enough. | ||
|
||
### Engineering Impact | ||
We Expect faster build time, smaller binary size for users depending on the Core package. | ||
The TF-core team will maintain this code. | ||
|
||
### Platforms and Environments | ||
This will work on all platforms supported by Tensorflow. This should not impact automatic code generation or mobile stripping tooling or transformation tools. | ||
* Execution environments (Cloud services, accelerator hardware): no impact. | ||
|
||
|
||
### Compatibility | ||
* Does the design conform to the backwards & forwards compatibility [requirements](https://www.tensorflow.org/programmers_guide/version_compat)? | ||
Yes, all the existing TF APIs will not be affected by this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs some more nuance.
When I first read "TF core" I thought it was about separating the C++ library from the python; but below it seems like you want to include some python API as part of core.
So for example when it comes to XLA/MLIR, either JITted or AOT (e.g., the functionality in saved_model_cli), do you plan to keep those in core, or do you expect that all XLA functionality be split off in separate symbolic libraries? If so, how do you expect to support
jit_compile=True
extension of tf.function?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for separating C++ library from the Python one. The Python API might serve for prototyping purpose if we'd like to collect feedbacks from a broader audience though, e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/distribute/v1/all_reduce.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try and clarify the proposal a bit here. There are couple of directions this RFC is pushing towards
Python API modularity - Currently TF core includes Keras, tf.data, distribution strategies etc. without any clear layering and dependency structure. We want to define a small core on which all these libraries can be layered on top of. We don't plan to remove any APIs from TF or add too many new APIs etc.
C++ first API - We want to think of the "Core API" as primarily a C++ one with a thin (probably 1-1 mapping) layer of python on top. Users can choose to interact with either API. Arguably, one can think of this as an implementation detail but its a real important implementation detail that we wanted to bring folks attention to.
Concretely, the first step we're taking now is defining the boundaries of what this small core API surface is going to look like and ensuring that the other TF APIs (tf.data, distribute, keras etc.) are dependent on it (and not the other way). This involves removing circular dependencies, BUILD file refactoring, moving some files around, creating right extension points / interfaces in TF core with no API surface changes expected whatsoever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohan100jain +1
If there is design meeting kindly let us know. I would like to join the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, design review meeting is scheduled at 10AM PT Dec 15, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, we need more time to figure out things better, the design review is cancelled.