From 453d4aeb3b47430c7e4f9aac7f7d96fb9cec6548 Mon Sep 17 00:00:00 2001
From: Aman Gupta <amaniitm.gupta26@gmail.com>
Date: Wed, 27 Mar 2024 15:02:34 +0530
Subject: [PATCH] Adding doc for pipeline parameters for NLP FT (#3076)

* Adding doc for pipeline parameters for NLP FT

* Addressed comments
---
 .../finetune_component_parameters.md          | 220 ++++++++++++++++++
 1 file changed, 220 insertions(+)
 create mode 100644 sdk/python/foundation-models/system/docs/component_docs/nlp_finetune/finetune_component_parameters.md

diff --git a/sdk/python/foundation-models/system/docs/component_docs/nlp_finetune/finetune_component_parameters.md b/sdk/python/foundation-models/system/docs/component_docs/nlp_finetune/finetune_component_parameters.md
new file mode 100644
index 00000000000..bda3755c53a
--- /dev/null
+++ b/sdk/python/foundation-models/system/docs/component_docs/nlp_finetune/finetune_component_parameters.md
@@ -0,0 +1,220 @@
+# Finetune Pipeline Component
+This component enables finetuning of pretrained models on custom datasets. The component supports Deepspeed for performance enhancement.
+
+The component supports following optimizations:
+1. Parameter efficient finetuning with techniques like LoRA.
+2. Supports Multi-GPU finetuning using Distributed Data Parallel (DDP) and DeepSpeed.
+3. Supports Mixed Precision Training.
+4. Supports Multi-Node training.
+5. Supports flash attention for speed up finetuning and reducing memory footprint.
+
+
+At the time of writing, following tasks are supported through finetuning components:
+| Task | Notebook |
+| --- | --- |
+| Text Generation | https://github.com/Azure/azureml-examples/tree/main/sdk/python/foundation-models/system/finetune/Llama-notebooks/text-generation
+| Text Classification | https://github.com/Azure/azureml-examples/tree/main/sdk/python/foundation-models/system/finetune/text-classification
+| Named Entity Recognition/Token Classification | https://github.com/Azure/azureml-examples/tree/main/sdk/python/foundation-models/system/finetune/token-classification
+| Question Answering | https://github.com/Azure/azureml-examples/tree/main/sdk/python/foundation-models/system/finetune/question-answering
+| Summarization | https://github.com/Azure/azureml-examples/tree/main/sdk/python/foundation-models/system/finetune/summarization
+| Translation | https://github.com/Azure/azureml-examples/tree/main/sdk/python/foundation-models/system/finetune/translation
+
+Respective components can be found in the azureml registry:
+- [text_generation_pipeline](https://ml.azure.com/registries/azureml/components/text_generation_pipeline)
+- [text_classification_pipeline](https://ml.azure.com/registries/azureml/components/text_classification_pipeline)
+- [token_classification_pipeline](https://ml.azure.com/registries/azureml/components/token_classification_pipeline)
+- [question_answering_pipeline](https://ml.azure.com/registries/azureml/components/question_answering_pipeline)
+- [summarization_pipeline](https://ml.azure.com/registries/azureml/components/summarization_pipeline)
+- [translation_pipeline](https://ml.azure.com/registries/azureml/components/translation_pipeline)
+
+# 1. Inputs
+## Model related inputs
+- _pytorch_model_path_ (custom_model, optional)
+    Pytorch model asset path. Special characters like \ and ' are invalid in the parameter value.
+
+- _mlflow_model_path_ (mlflow_model, optional)
+    Mlflow model asset path. Special characters like \ and ' are invalid in the parameter value
+
+**Note: one of the above two inputs is required.**
+
+## Dataset related inputs
+- _train_file_path_ (uri_file, optional)
+    Path to the registered training data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`. Special characters like \ and ' are invalid in the parameter value.
+
+- _validation_file_path_ (uri_file, optional)
+    Path to the registered validation data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`. Special characters like \ and ' are invalid in the parameter value.
+
+- _test_file_path_ (uri_file, optional)
+    Path to the registered test data asset. The supported data formats are `jsonl`, `json`, `csv`, `tsv` and `parquet`. Special characters like \ and ' are invalid in the parameter value.
+
+- _train_mltable_path_ (MLTABLE, optional)
+    Path to the registered training data asset in `mltable` format. Special characters like \ and ' are invalid in the parameter value.
+
+- _validation_mltable_path_ (MLTABLE, optional)
+    Path to the registered validation data asset in `mltable` format. Special characters like \ and ' are invalid in the parameter value.
+
+- _test_mltable_path_ (MLTABLE, optional)
+    Path to the registered test data asset in `mltable` format. Special characters like \ and ' are invalid in the parameter value.
+
+## Compute related inputs
+- _compute_model_import_ (string, optional, default: "serverless")
+    compute to be used for model_import eg. provide 'FT-Cluster' if
+      your compute is named 'FT-Cluster'. Special characters like \ and ' are invalid in the parameter value.
+      If compute cluster name is provided, instance_type field will be ignored and the respective cluster will be used
+- _compute_preprocess_ (string, optional, default: "serverless")
+    compute to be used for preprocess eg. provide 'FT-Cluster' if your
+      compute is named 'FT-Cluster'. Special characters like \ and ' are invalid in the parameter value.
+      If compute cluster name is provided, instance_type field will be ignored and the respective cluster will be used
+- _compute_finetune_ (string, optional, default: "serverless")
+    compute to be used for finetune **NOTE: This has to be GPU compute**
+- _compute_model_evaluation_ (string, optional, default: "serverless")
+    compute to be used for model_evaluation
+
+### Serverless compute related parameters. Will be used only if the compute used is serverless
+- _instance_type_model_import_ (string, optional, default: "Standard_d12_v2")
+    Instance type to be used for model_import component in case of serverless compute, eg. standard_d12_v2. 
+      The parameter compute_model_import must be set to 'serverless' for instance_type to be used
+
+- _instance_type_preprocess_ (string, optional, default: "Standard_d12_v2")
+    Instance type to be used for preprocess component in case of serverless compute, eg. standard_d12_v2. 
+      The parameter compute_preprocess must be set to 'serverless' for instance_type to be used
+
+- _instance_type_finetune_ (string, optional, default: "Standard_nc24rs_v3")
+    Instance type to be used for finetune component in case of serverless compute, eg. standard_nc6. 
+      The parameter compute_finetune must be set to 'serverless' for instance_type to be used
+
+- _instance_type_model_evaluation_ (string, optional, default: "Standard_nc24rs_v3")
+    Instance type to be used for model_evaluation component in case of serverless compute, eg. standard_nc6. 
+      The parameter compute_model_evaluation must be set to 'serverless' for instance_type to be used
+
+## MultiNode and MultiGPU training parameters
+- _number_of_gpu_to_use_finetuning_ (integer, optional, default: 1)
+    number of gpus to be used per node for finetuning, should be equal
+      to number of gpu per node in the compute SKU used for finetune.
+
+- _num_nodes_finetune_ (integer, optional, default: 1)
+    number of nodes to be used for finetuning (used for distributed training).
+
+## Data Preprocessing parameters
+- _batch_size_ (integer, optional, default: 1000)
+    Number of examples to batch before calling the tokenization function
+
+- _max_seq_length_ (integer, optional, default: -1)
+- Default is -1 which means the padding is done up to the model's max length. Else will be padded to `max_seq_length`.
+
+- _pad_to_max_length_ (string, optional, default: "false")
+    If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their `max_seq_length`. If no `max_seq_length` is specified, the padding is done up to the model's max length.
+
+## Finetuning parameters
+### LoRA parameters
+- _apply_lora_ (string, optional, default: false, allowed_values: [true, false]):
+  Whether to enable LoRA for finetuning. If set to true, LoRA will be applied to the model.
+
+- _lora_alpha_ (integer, optional, default: 128)
+    lora attention alpha
+
+- _lora_r_ (integer, optional, default: 8)
+    lora dimension
+
+- lora_dropout (number, optional, default: 0.0)
+    lora dropout value
+
+### Deepspeed parameters
+- _apply_deepspeed_ (bool, optional, default: false)
+    If true enables deepspeed.
+
+- _deepspeed_ (uri_file, optional, default: true)
+  Deepspeed config to be used for finetuning. If no `deepspeed` is provided, the default config in the component will be used else the user passed config will be used.
+
+- _deepspeed_stage_ (string, optional, default: "2")
+    Deepspeed stage to be used for finetuning. It could be one of [`2`, `3`]. Value '3' enabled model sharding across GPUs, useful if the model does not fit on a single GPU.
+
+### Training parameters
+-  _num_train_epochs_ (int, optional, default: 1)
+    Number of epochs to run for finetune.
+
+-  _max_steps_ (int, optional, default: -1)
+    If set to a positive number, the total number of training steps to perform. Overrides 'epochs'. In case of using a finite iterable dataset the training may stop before reaching the set number of steps when all data is exhausted.
+
+-  _per_device_train_batch_size_ (integer, optional, default: 1)
+    Train batch size
+
+-  _per_device_eval_batch_size_ (integer, optional, default: 1)
+    Validation batch size
+
+-  _auto_find_batch_size_ (bool, optional, default: false)
+    If set to true, the train batch size will be automatically downscaled recursively till if finds a valid batch size that fits into memory. If the provided 'per_device_train_batch_size' goes into Out Of Memory (OOM) enabling auto_find_batch_size will find the correct batch size by iteratively reducing 'per_device_train_batch_size' by a factor of 2 till the OOM is fixed.
+
+-  _learning_rate_ (number, optional, default: 0.00002)
+
+    Start learning rate used for training. Defaults to linear scheduler.
+
+-  _lr_scheduler_type_ (string, optional, default: linear)
+
+    The learning rate scheduler to use. It could be one of [`linear`, `cosine`, `cosine_with_restarts`, `polynomial`, `constant`, `constant_with_warmup`].
+
+    If left empty, will be chosen automatically based on the task type and model selected.
+
+-  _warmup_steps_ (integer, optional, default: 0)
+    Number of steps used for a linear warmup from 0 to learning_rate.
+
+-  _optim_ (string, optional, default: adamw_hf)
+
+    Optimizer to be used while training. It could be one of [`adamw_hf`, `adamw_torch`, `adafactor`]
+
+    If left empty, will be chosen automatically based on the task type and model selected.
+
+-  _weight_decay_ (number, optional)
+    The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in AdamW optimizer
+
+-  _gradient_accumulation_steps_ (integer, optional, default: 1)
+
+    Number of updates steps to accumulate the gradients for, before performing a backward/update pass
+
+-  _precision_ (string, optional, default: "32")
+
+    Apply mixed precision training. This can reduce memory footprint by performing operations in half-precision. It could one of [`16`, `32`].
+
+-  _seed_ (int, optional, default: 42)
+
+    Random seed that will be set at the beginning of training.
+
+-  _evaluation_strategy_ (string, optional, default: epoch)
+
+    The evaluation strategy to adopt during training. It could be one of [`epoch`, `steps`].
+
+-  _eval_steps_ (int, optional, default: 500)
+    Number of update steps between two evals if evaluation_strategy='steps'
+
+-  _logging_strategy_ (string, optional, default: steps)
+    The logging strategy to adopt during training. It could be one of [`epoch`, `steps`].
+
+-  _logging_steps_ (integer, optional, default: 10)
+    Number of update steps between two logs if logging_strategy='steps'
+
+-  _save_total_limit_ (integer, optional, default: -1)
+
+    If a value is passed, will limit the total number of checkpoints. Deletes the older checkpoints in output_dir. If the value is -1 saves all checkpoints".
+
+-  _apply_early_stopping_ (string, optional, default: "false")
+    If set to true, early stopping is enabled. The default value is false.
+
+-  _early_stopping_patience_ (int, optional, default: 1)
+    Stop training when the specified metric worsens for early_stopping_patience evaluation calls.
+
+-  _max_grad_norm_ (number, optional, default: 1.0)
+
+    Maximum gradient norm (for gradient clipping).
+
+-  _resume_from_checkpoint_ (string, optional, default: "false")
+    If set to true, resumes the training from last saved checkpoint. Along with loading the saved weights, saved optimizer, scheduler and random states will be loaded if exists.
+
+# 2. Outputs
+- _pytorch_model_folder_ (uri_folder)
+    output folder containing _best_ model as defined by _metric_for_best_model_. Along with the best model, output folder contains checkpoints saved after every evaluation which is defined by the _evaluation_strategy_. Each checkpoint contains the model weight(s), config, tokenizer, optimzer, scheduler and random number states.
+
+- _mlflow_model_folder_ (mlflow_model)
+    output folder containing _best_ finetuned model in mlflow format.
+
+- _evaluation_result_ (uri_folder)
+    Test Data Evaluation Results
\ No newline at end of file