-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable batch size and LR scheduler #5237
base: master
Are you sure you want to change the base?
Conversation
…peed into distributed_data_analyzer
…peed into distributed_data_analyzer
…peed into distributed_data_analyzer
…peed into distributed_data_analyzer
…peed into distributed_data_analyzer
great work and waiting for this. |
thank you @npuichigo . For the time being, you can use this as in the example (first initialize deepspeed to get the deepspeed engine, then call |
@bm-synth First of all you still didn't provide any evidence of "variable batch size and LR scheduler helps improve model quality". But anyway I understand some users just want to do it so we can accept this PR. Regarding your question about curriculum learning: (1) Handling of multiple metrics is at DeepSpeed/deepspeed/runtime/data_pipeline/data_sampling/data_sampler.py Lines 184 to 201 in aaaf8bc
|
Hi @bm-synth - thanks for your continued work on this PR, do you think it is ready to be merged? |
@loadams this can be merged. It was on hold for a while because I was looking for a better way for the user to interface with this, ie a clean way for the user to define in the deepspeed config which curriculum metric to use for this variable batch size module (this comment here ). I suggest we merge this now, and I'll work on the better interfacing in the next couple of weeks during the xmas break, in a different PR. thank you |
Background and rationale
In many use cases, particularly LLMs, one is faced with inputs (sentences) of variable lengths. A common practice is to pack batches by token count (not a fixed batch size), ie by putting together sentences whose given metric (eg sequence lengths) will add up to an user-provided value. As an example, in Attention is all you need, section 5.1:
Dynamic batch sizes has been requested in DeepSpeed issue 1051, DeepSpeed issue 3455 , Pytorch Lightning issue 16914, huggingface issue 2647 and is available already in many libraries e.g. NVIDIA Triton and Meta FairSeq (implementation here ).
The immediate use case for this is when one needs to maximize GPU utilization. Moreover, this is particularly relevant for curriculum learning where a
BxTxE
(Batch x Time x Embedding) -shaped input should ideally have highB
and lowT
at the early curriculum steps (many short sentences packed together as a batch), and lowB
and highT
at the late steps (few long sentences in the batch). A dynamic sizeT
is already supported by Deepspeed, e.g. in the documentation for pipeline parallelism's reset_activation_shape():However, dynamic
B
is not supported. A dynamicB
would require an adequate increase/decrease of learning rate. This technique has been applied previously, and the two most common LR scaling algorithms have been described as:In practice, the user picks the total token count per batch as the metric that drives batching, instead of batching by sentence count. During runtime, the variable batch size is computed and the LR is adjusted respectively, based on the LR and batch size provided by the config.
Illustration of dynamic batch size, sequence length and LR
Imagine we picked a limit of
30
tokens per batch, and have set a referencelr=1e-3
for atrain_batch_size=2
(in the deepspeed config). The batching algorithm for curriculum may pack the data into batches of short sentences (left) at the early stages, and batches of long sentences (right) as later stages, e.g.:Above, we collected samples until we filled up the batch with at most 30 tokens. The batch sizes (number of samples) became then
10
and4
on the left and right examples, respectively. Using the linear scaling rule, the LR for those batches become5e-3
and2e-3
.Pipeline parallelism
Pipeline parallelism requires the same batch size and same sequence length across all micro-batches in a batch, as the activation sizes must be fixed between gradient accumulation steps. Between batches, these may change, and long as
engine.reset_activation_shape()
is called so that the new shapes are communicated on the first gradient accumulation step in the batch. Enforcing similarBxTxE
between batches may lead to smaller micro-batches. As an example, below we can see an illustration of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching for the same dataset, when preparing data for the regular DDP (left) and for the pipeline parallelism use cases (right):We can see that the pipeline use case (right) has the same
BxTxE
shape across all the 4 micro-batches in the same batch, and in order to respect that, it packs less samples in the batch, when compared to the standard use case (left hand size)Attention Head
For an input of size
BxTxE
the attention has a shape ofTxT
for a mask of fixed size across samples of same size, orBxTxT
for a different mask per sample (when samples have different sizes, as in the dataset above). This 3D attention matrix can be illustrated for the DDP microbatch 1 (picture above top-left, 4 sentences) as:Note the memory savings: the attention head has a size of
BxTxT
, i.e. a linear memory dependency on the batch sizeB
and quadratic memory dependency on the largest sequence lengthT
in the (micro-) batch. Thus, supporting a dynamic sizeT
allows for an increase ofB
.PR overview
This PRs implements dynamic batching and LR scaling. The dataloader and LR scheduler necessary can be retrieved by calling
get_dataloader_and_lr_scheduler_for_variable_batch_size
. A small explanation of that function follows:scale_lr
;batch_by_size
.True
the following parameters:required_microbatches_of_same_sizes
that will force theB
dimension to be the same across all gradient accumulation steps of all dataloaders on a batch;required_microbatches_of_same_lengths
that will force theT
dimension to be the same across all gradient accumulation steps. Works by calling the user-providedsample_padding_fn(sentence, len)
that pads a given sentence to the argument length;batch_by_size
returnsmicrobatch_sample_ids
(the list of sample ids per micro-batch),batch_sizes
(the size of effective batch sizes, andbatch_max_seqlens
(longest sequence across all microbatches in a batch)dataloader_for_variable_batch_size
relies onmicrobatch_sample_ids
and will iterate/collate/pad samples for every batch and return a dataloader that iterates the final (variable-size) batches;lr_scheduler_for_variable_batch_size
relies onbatch_sizes
to compute the learning rate for each effective batch, taking into account the batch size and LR in the config file, and scaling the LR based on the size of each effective batch, and the scaling rule mentioned above (Linear, Square root, etc).lr_scheduler
returned that will either accept either:Optimizer
that will scale the learning rates (in param groups) at every batch, orLRScheduler
, that in this case will first get the learning rate from the scheduler and then scale it accordingly.Example
An example for the use case with and without pipelining is provided in
deepspeed/runtime/data_pipeline/data_sampling/variable_batch_size_and_lr_example.py
. The example shows an attention head with attention of variable-sizedBxTxT
per batch, followed by a fixed size feed forward network. These are the main blocks on a Large Language Model. The feed-forward (or linear layer) that follows the attention head requires a constant input size, equivalent to the largest sentence in the whole dataset, so the output of the attention must be padded (seefeedforward: needs to convert BxTxE to BxMxE by padding extra tokens
in the code).