Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding FSDP Support to Training Library #213

Merged
merged 47 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
c6b14f4
increased performance of data process and made a flag for the number …
aldo-pareja Aug 14, 2024
cdca42c
Refactor special tokens handling and improve chat template flexibility
aldo-pareja Aug 27, 2024
1114539
Refactor data processing and tokenization
aldo-pareja Aug 27, 2024
2e33add
added MixtralForCausalLM as another supported model
aldo-pareja Aug 27, 2024
80e65d3
iterim branch exploring accelerate from fix_mistral_template
aldo-pareja Sep 11, 2024
802f74f
accelerate testing
aldo-pareja Sep 17, 2024
d4851d1
fixed data process
aldo-pareja Sep 17, 2024
ca10236
accelerate works on deepspeed
aldo-pareja Sep 17, 2024
4a89949
fsdp first try
aldo-pareja Sep 17, 2024
5c3d96c
testing both fsdp and deepspeed
aldo-pareja Sep 18, 2024
cb9fec7
made deepspeed the default
aldo-pareja Sep 18, 2024
32b7265
black formatting
aldo-pareja Sep 18, 2024
7a2ca02
fixed typo on batch size logging
aldo-pareja Sep 18, 2024
b02fa03
added samples seen to the logging
aldo-pareja Sep 18, 2024
b736c25
fsdp needs a different optimizer workflow
aldo-pareja Sep 18, 2024
6d4bb46
black formatting
aldo-pareja Sep 18, 2024
64595d4
removed weight decay from the fsdp optimizer
aldo-pareja Sep 18, 2024
5f582c1
fixed a merge conflict
aldo-pareja Sep 18, 2024
ca274f6
aligning with the latest
aldo-pareja Sep 18, 2024
85ac691
fixing linting errors
aldo-pareja Sep 18, 2024
2f96481
update: rename arguments and add config options in trainingargs
RobotSail Sep 18, 2024
807d9f7
update init.py
RobotSail Sep 18, 2024
3907e38
bug fixes
RobotSail Sep 18, 2024
0208e64
Added cpu offloading for fsdp
Maxusmusti Sep 18, 2024
b5e6d38
Move sharding strategy to fsdp options
Maxusmusti Sep 18, 2024
d3010a7
Add lora ckpt saving (first pass)
Maxusmusti Sep 19, 2024
d9b71c0
Re-add save per epoch
Maxusmusti Sep 19, 2024
fa8c1ba
Adding lora/qlora saving pt2
Maxusmusti Sep 20, 2024
291acdb
Connect new lora save patch
Maxusmusti Sep 20, 2024
eff3fef
Fix lora ckpt saving
Maxusmusti Sep 20, 2024
edf7618
Load/save fixes assuming rc is allowed
Maxusmusti Sep 23, 2024
14f2b08
Fix for deepspeed accelerate reqs
Maxusmusti Sep 23, 2024
fdbe288
Add accelerate 0.34 patch for saving
Maxusmusti Sep 24, 2024
8fb86ca
Re-add dolomite conversion for ckpts
Maxusmusti Sep 24, 2024
ffb7fab
Clean up extraneous stuff
Maxusmusti Sep 24, 2024
2b35ed1
Remove breaking check
Maxusmusti Sep 24, 2024
065df3e
Feedback round 1
Maxusmusti Sep 25, 2024
e472b4d
Feedback round 2
Maxusmusti Sep 25, 2024
8f8dc8f
Make sure model is passed back in case it is updated in setup_optimizer
Maxusmusti Sep 25, 2024
72abd67
Remove branching prepare statements
Maxusmusti Sep 25, 2024
7f7e9f2
Add note about commented data process code
Maxusmusti Sep 25, 2024
723a85e
Adding more notes for code readers
Maxusmusti Sep 25, 2024
7cd6747
minor bug fixes & improvements
RobotSail Sep 25, 2024
95eb2c0
update docs to include fsdp info, rename TrainingArgs.distributed_tra…
RobotSail Sep 25, 2024
dd47117
Lower max shard size
Maxusmusti Sep 25, 2024
4cdfb8d
Remove extra comment
Maxusmusti Sep 25, 2024
70ff83c
Enum security
Maxusmusti Sep 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -472,7 +472,8 @@ disable=raw-checker-failed,
consider-using-generator,
broad-exception-caught,
super-init-not-called,
duplicate-code
duplicate-code,
too-many-positional-arguments

# Enable the message, report, category or checker with the given id(s). You can
# either give multiple identifier separated by comma (,) or put this option
Expand Down
23 changes: 22 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,11 @@ Here is a breakdown of the general options:
| mock_data_len | Max length of a single mock data sample. Equivalent to `max_seq_len` but for mock data. |
| deepspeed_options | Config options to specify for the DeepSpeed optimizer. |
| lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |
| chat_tmpl_path | Specifies the chat template / special tokens for training. |
| checkpoint_at_epoch | Whether or not we should save a checkpoint at the end of each epoch. |
| fsdp_options | The settings for controlling FSDP when it's selected as the distributed backend. |
| distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
| disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |

#### `DeepSpeedOptions`

Expand All @@ -141,8 +146,24 @@ allow you to customize aspects of the ZeRO stage 2 optimizer.
| Field | Description |
| --- | --- |
| cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |
| cpu_offload_optimizer_ratio | Floating point between 0 & 1. Specifies the ratio of parameters updating (i.e. optimizer step) on CPU side. |
| cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
| save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |

#### `loraOptions`
#### `FSDPOptions`

Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP.
They are listed below:

| Field | Description |
| --- | --- |
| cpu_offload_params | When set to true, offload parameters from the accelerator onto the CPU. This is an all-or-nothing option. |
| sharding_strategy | Specifies the model sharding strategy that FSDP should use. Valid options are: `FULL_SHARD` (ZeRO-3), `HYBRID_SHARD` (ZeRO-3*), `SHARD_GRAD_OP` (ZeRO-2), and `NO_SHARD`. |

> [!NOTE]
> For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.

#### `LoraOptions`

If you'd like to do a LoRA train, you can specify a LoRA
option to `TrainingArgs` via the `LoraOptions` object.
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ py-cpuinfo
# replace custom pytorch images with the 2.3.0
torch>=2.3.0a0
transformers>=4.41.2
accelerate>=0.34.2
datasets>=2.15.0
numba
# Note: numpy ranges copied from instructlab/instructlab
Expand Down
6 changes: 6 additions & 0 deletions src/instructlab/training/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,21 @@
"TorchrunArgs",
"TrainingArgs",
"run_training",
"FSDPOptions",
"ShardingStrategies",
"DistributedBackend",
)

# Local
from .config import (
DataProcessArgs,
DeepSpeedOffloadStrategy,
DeepSpeedOptions,
DistributedBackend,
FSDPOptions,
LoraOptions,
QuantizeDataType,
ShardingStrategies,
TorchrunArgs,
TrainingArgs,
)
Expand Down
13 changes: 7 additions & 6 deletions src/instructlab/training/chat_templates/ibm_generic_tmpl.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
# SPDX-License-Identifier: Apache-2.0

# First Party
from instructlab.training.tokenizer_utils import SpecialTokens
from instructlab.training.tokenizer_utils import SpecialTokens, TokenInfo

SPECIAL_TOKENS = SpecialTokens(
system="<|system|>",
user="<|user|>",
assistant="<|assistant|>",
eos="<|endoftext|>",
pad="<|pad|>",
system=TokenInfo("<|system|>", add_to_tokenizer=True),
Maxusmusti marked this conversation as resolved.
Show resolved Hide resolved
user=TokenInfo("<|user|>", add_to_tokenizer=True),
assistant=TokenInfo("<|assistant|>", add_to_tokenizer=True),
eos=TokenInfo("<|endoftext|>", add_to_tokenizer=True),
pad=TokenInfo("<|pad|>", add_to_tokenizer=True),
bos=TokenInfo("<|begginingoftext|>", add_to_tokenizer=True),
)

CHAT_TEMPLATE = (
Expand Down
43 changes: 29 additions & 14 deletions src/instructlab/training/chat_templates/mistral_tmpl.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,39 @@
# SPDX-License-Identifier: Apache-2.0

# First Party
from instructlab.training.tokenizer_utils import SpecialTokens
from instructlab.training.tokenizer_utils import SpecialTokens, TokenInfo

SPECIAL_TOKENS = SpecialTokens(
bos="<s>",
eos="</s>",
user="[INST]",
assistant="[/INST]",
bos=TokenInfo("<s>", add_to_tokenizer=True),
eos=TokenInfo("</s>", add_to_tokenizer=True),
Maxusmusti marked this conversation as resolved.
Show resolved Hide resolved
user=TokenInfo("[INST]", add_to_tokenizer=False),
assistant=TokenInfo("[/INST]", add_to_tokenizer=False),
)

CHAT_TEMPLATE = (
"{%- if messages[0]['role'] == 'system' %}"
"{%- set system_message = messages[0]['content'] %}"
"{%- set loop_messages = messages[1:] %}"
"{%- else %}"
"{%- set loop_messages = messages %}"
"{%- endif %}"
"{{ '<s>' }}"
"{% for message in messages %}"
"{% if message['role'] == 'pretraining' %}"
"{{'<|pretrain|>' + message['content'] + '</s>' + '<|/pretrain|>'}}"
"{% elif message['role'] == 'user' %}"
"{{ '[INST] ' + message['content'] + ' [/INST]' }}"
"{% elif message['role'] == 'assistant' %}"
"{{ message['content'] + '</s>'}}"
"{% endif %}"
"{% endfor %}"
"{%- for message in loop_messages %}"
"{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}"
"{{- raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}"
"{%- endif %}"
"{%- if message['role'] == 'user' %}"
"{%- if loop.first and system_message is defined %}"
"{{- ' [INST] ' + system_message + '\n\n' + message['content'] + ' [/INST]' }}"
"{%- else %}"
"{{- ' [INST] ' + message['content'] + ' [/INST]' }}"
"{%- endif %}"
"{%- elif message['role'] == 'pretraining' %}"
"{{- '<|pretrain|>' + message['content'] + '</s>' + '<|/pretrain|>' }}"
"{%- elif message['role'] == 'assistant' %}"
"{{- ' ' + message['content'] + '</s>'}}"
"{%- else %}"
"{{- raise_exception('Only user and assistant roles are supported, with the exception of an initial optional system message!') }}"
"{%- endif %}"
"{%- endfor %}"
)
30 changes: 30 additions & 0 deletions src/instructlab/training/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ class DeepSpeedOffloadStrategy(Enum):
NONE = None


# public API
class DistributedBackend(Enum):
FSDP: str = "fsdp"
DEEPSPEED: str = "deepspeed"


# public API
class QuantizeDataType(Enum):
"""
Expand Down Expand Up @@ -111,6 +117,24 @@ class DeepSpeedOptions(BaseModel):
save_samples: int | None = None


# public API
class ShardingStrategies(Enum):
FULL_SHARD = "FULL_SHARD"
SHARD_GRAD_OP = "SHARD_GRAD_OP"
NO_SHARD = "NO_SHARD"
HYBRID_SHARD = "HYBRID_SHARD"
Maxusmusti marked this conversation as resolved.
Show resolved Hide resolved


# public API
class FSDPOptions(BaseModel):
"""
Represents the options for configuring FSDP which are exposed by the Training Library
"""

cpu_offload_params: Optional[bool] = False
sharding_strategy: ShardingStrategies = ShardingStrategies.SHARD_GRAD_OP


# public API
class TrainingArgs(BaseModel):
"""
Expand Down Expand Up @@ -157,6 +181,12 @@ class TrainingArgs(BaseModel):
cpu_offload_optimizer_pin_memory=False,
)
)
fsdp_options: FSDPOptions = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be a factory? I think it can just be an assignment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following the current convention set by DeepSpeedOptions in the file, so imo if we want to change this, we should make a follow-up PR that updates both of them

default_factory=lambda: FSDPOptions(
cpu_offload_params=False, sharding_strategy=ShardingStrategies.SHARD_GRAD_OP
)
)
distributed_backend: DistributedBackend = DistributedBackend.DEEPSPEED

disable_flash_attn: Optional[bool] = False

Expand Down
Loading