diff --git a/README.md b/README.md index 6a0c64ce..7fe5d7f1 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,7 @@ PyPi Package (Linx) | [![](https://github.com/google-ai-edge/ai-edge-torch/ac * Python versions: 3.9, 3.10, 3.11 * Operating system: Linux * PyTorch: ![torch](https://img.shields.io/badge/torch-2.4.0.dev20240429-blue) - * TensorFlow: [![tf-nightly](https://img.shields.io/badge/tf--nightly-2.17.0.dev20240430-blue)](https://pypi.org/project/tf-nightly/) + * TensorFlow: [![tf-nightly](https://img.shields.io/badge/tf--nightly-2.17.0.dev20240509-blue)](https://pypi.org/project/tf-nightly/) diff --git a/ai_edge_torch/generative/README.md b/ai_edge_torch/generative/README.md index 3644c1c4..6f8de532 100644 --- a/ai_edge_torch/generative/README.md +++ b/ai_edge_torch/generative/README.md @@ -57,13 +57,13 @@ Once you re-author the model and validate its numerical accuracy, you can conver For example, in the `generative/examples/test_models/toy_model_with_kv_cache.py`, you can define inputs for both signatures: Sample inputs for the `prefill` signature: -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/1791dec62f1d3f60e7fe52138640d380f58b072d/ai_edge_torch/generative/examples/test_models/toy_model_with_kv_cache.py#L105-L108 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/examples/test_models/toy_model_with_kv_cache.py#L105-L108 Sample inputs for the `decode` signature: -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/1791dec62f1d3f60e7fe52138640d380f58b072d/ai_edge_torch/generative/examples/test_models/toy_model_with_kv_cache.py#L111-L114 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/examples/test_models/toy_model_with_kv_cache.py#L111-L114 Then export the model to TFLite with: -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/1791dec62f1d3f60e7fe52138640d380f58b072d/ai_edge_torch/generative/examples/test_models/toy_model_with_kv_cache.py#L133-L139 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/examples/test_models/toy_model_with_kv_cache.py#L133-L139 Please note that using the `prefill` and `decode` method conventions are required for easy integration into the Mediapipe LLM Inference API.
@@ -78,7 +78,7 @@ The user needs to implement the entire LLM Pipeline themselves, and call TFLite This approach provides users with the most control. For example, they can implement streaming, get more control over system memory or implement advanced features such as constrained grammar decoding, speculative decoding etc. -A very simple text generation pipeline based on a decoder-only-transformer is provided [here](https://github.com/google-ai-edge/ai-edge-torch-archive/blob/main/ai_edge_torch/generative/examples/c%2B%2B/text_generator_main.cc) for reference. Note that this example serves as a starting point, and users are expected to implement their own pipelines based on their model's specific requirements. +A very simple text generation pipeline based on a decoder-only-transformer is provided [here](https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/c%2B%2B/text_generator_main.cc) for reference. Note that this example serves as a starting point, and users are expected to implement their own pipelines based on their model's specific requirements. #### Use MediaPipe LLM Inference API @@ -105,7 +105,7 @@ model-explorer 'gemma-2b.tflite' Gemma-2b visualization demo -For an end-to-end example showing how to author, convert, quantize and execute, please refer to the steps [here](https://github.com/google-ai-edge/ai-edge-torch-archive/blob/main/ai_edge_torch/generative/examples/README.md) +For an end-to-end example showing how to author, convert, quantize and execute, please refer to the steps [here](https://github.com/google-ai-edge/ai-edge-torch/blob/main/ai_edge_torch/generative/examples/README.md)
## What to expect diff --git a/ai_edge_torch/generative/examples/README.md b/ai_edge_torch/generative/examples/README.md index aff62dec..87491208 100644 --- a/ai_edge_torch/generative/examples/README.md +++ b/ai_edge_torch/generative/examples/README.md @@ -22,10 +22,10 @@ For each of the example models, we have a model definition file (e.g. tiny_llama Here we use `TinyLlama` as an example to walk you through the authoring steps. #### Define model's structure -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/e54638dd4a91ec09115f9ded1bd5540f3f1a4e68/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py#L43-L74 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py#L46-L77 #### Define model's forward function -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/e54638dd4a91ec09115f9ded1bd5540f3f1a4e68/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py#L79-L101 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py#L79-L104 Now, you will have an `nn.Module` named `TinyLlama`, the next step is to restore the weights from orginal checkpoint into the new model. @@ -37,12 +37,12 @@ place to simplify the `state_dict` mapping process (`utilities/loader.py`). The user needs to provide a layer name tempelate (TensorNames) for the source model. This tempelate is then used to create an updated `state_dict` that works with the mapped model. The tensor map includes the following fields: -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/3b753d80fdf00872baac523dc727b87b3dc271e7/ai_edge_torch/generative/utilities/loader.py#L120-L134 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/utilities/loader.py#L94-L109 The fields that have a default value of `None` are optional and should only be populated if they are relevant to the model architecture. For `TinyLlama`, we will define the following `TENSOR_NAMES`: -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/e54638dd4a91ec09115f9ded1bd5540f3f1a4e68/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py#L27-L40 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/examples/tiny_llama/tiny_llama.py#L30-L43 With the `TensorNames` defined, a user can simply use the loading utils to load an instance of the mapped model. For instance: @@ -59,7 +59,7 @@ using a few input samples before proceeding to the conversion step. ### Model conversion In this step, we use the `ai_edge_torch`'s standard multi-signature conversion API to convert PyTorch `nn.Module` to a single TFLite flatbuffer for on-device execution. For example, in `tiny_llama/convert_to_tflite.py`, we use this python code to convert the `TinyLLama` model to a multi-signature TFLite model: -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/3b753d80fdf00872baac523dc727b87b3dc271e7/ai_edge_torch/generative/examples/tiny_llama/convert_to_tflite.py#L22-L53 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/examples/tiny_llama/convert_to_tflite.py#L26-L61 Once converted, you will get a `.tflite` model which will be ready for on-device execution. Note that the `.tflite` model generated uses static shapes. Inside the generated `.tflite` model, there will be two signatures defined (two entrypoints to the model): 1) `prefill`: taking 2 tensor inputs `prefill_tokens`, `prefill_input_pos`. With shape `(BATCH_SIZE, PREFILL_SEQ_LEN)` and `(PREFILL_SEQ_LEN)`. diff --git a/ai_edge_torch/generative/layers/README.md b/ai_edge_torch/generative/layers/README.md index ef06f663..78d64188 100644 --- a/ai_edge_torch/generative/layers/README.md +++ b/ai_edge_torch/generative/layers/README.md @@ -43,4 +43,4 @@ Currently, the library provides the following configuration class for you to cus ## High-Level function boundary for performance We introduce High-Level Function Boundary (HLFB) as a way of annotating performance-critical pieces of the model (e.g. `scaled_dot_product_attention`, or `KVCache`). HLFB allows the converter to lower the annotated blocks to performant TFLite custom ops. Following is an example of applying HLFB to `SDPA`: -https://github.com/google-ai-edge/ai-edge-torch-archive/blob/3b753d80fdf00872baac523dc727b87b3dc271e7/ai_edge_torch/generative/layers/attention.py#L74-L122 +https://github.com/google-ai-edge/ai-edge-torch/blob/853301630f2b2455bd2e2f73d8a47e1a1534c91c/ai_edge_torch/generative/layers/attention.py#L74-L122