deepseed multiGPU resume from checkpoint fails #1134

manishiitg · 2024-01-17T13:59:33Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

should work

Current behaviour

en-hi-spot, pid=16745)     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train
(en-hi-spot, pid=16745)     return inner_training_loop(
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1699, in _inner_training_loop
(en-hi-spot, pid=16745)     deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
(en-hi-spot, pid=16745)     load_path, _ = deepspeed_engine.load_checkpoint(
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2724, in load_checkpoint
(en-hi-spot, pid=16745)     load_path, client_states = self._load_checkpoint(load_dir,
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2794, in _load_checkpoint
(en-hi-spot, pid=16745)     self.load_module_state_dict(checkpoint=checkpoint,
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2587, in load_module_state_dict
(en-hi-spot, pid=16745)     self.module.load_state_dict(
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
(en-hi-spot, pid=16745)     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
(en-hi-spot, pid=16745) RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
(en-hi-spot, pid=16745) 	Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.0.input_layernorm.weight", "base_model.model.model.layers.0.post_attention_layernorm.weight", "base_model.model.model.layers.1.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.1.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.1.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.1.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.1.input_layernorm.weight", "base_model.model.model.layers.1.post_attention_layernorm.weight", "base_model.model.model.layers.2.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.2.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.2.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.2.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.2.input_layernorm.weight", "base_model.model.model.layers.2.post_attention_layernorm.weight", "base_model.model.model.layers.3.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.3.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.3.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.3.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.3.input_layernorm.weight", "base_model.model.model.layers.3.post_attention_layernorm.weight", "base_model.model.model.layers.4.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.4.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.4.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.4.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.4.input_layernorm.weight", "base_model.model.model.layers.4.post_attention_layernorm.weight", "base_model.model.model.laye

Steps to reproduce

when resuming training from checkpoint

Config yaml

base_model: unsloth/tinyllama

model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

chat_template: chatml
datasets:

path: manishiitg/aditi-chat-instruct-hi-v1-dedupe
type: completion

wandb_project: tiny-aditi
hub_model_id: manishiitg/tinyllama-chat-instruct-hi-v1
hf_use_auth_token: true

dataset_prepared_path:
val_set_size: 0
output_dir: /sky-notebook/manishiitg/tinyllama-chat-instruct-hi-v1

sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 14
num_epochs: 4
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true ## manage check point resume from here
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 100 ## increase based on your dataset
save_strategy: steps
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: ""
eos_token: ""
unk_token: ""

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

vip-china · 2024-01-23T09:44:22Z

the same error

winglian · 2024-01-24T23:58:47Z

see #1156 (comment)

zacbrannelly · 2024-01-25T06:23:39Z

Same error and #1156 (comment) didn't fix it. I'm also using mistralai/Mixtral-8x7B-Instruct-v0.1 as a base model.

manishiitg · 2024-01-27T04:24:38Z

@winglian doesnt work still same issue

winglian · 2024-01-27T19:36:27Z

Lora training should be resumed using lora_model_dir

manishiitg · 2024-01-28T03:04:24Z

thats what i tried

i specific the checkpoint path in lora_model_dir but still the exact same error

winglian · 2024-01-28T03:19:24Z

did you unset resume_from_checkpoint?

manishiitg · 2024-01-28T08:14:03Z

if i unset resume_from_checkpoint, i dont get the error again but the training start from epoch zero.

so this i were my training stopped.

# {'loss': 1.6618, 'learning_rate': 0.00019972783253900808, 'epoch': 0.09}it]
# {'loss': 1.6583, 'learning_rate': 0.00019970805448735204, 'epoch': 0.1}/it]
# {'loss': 1.6555, 'learning_rate': 0.00019968758384408713, 'epoch': 0.1}/it]
# {'loss': 1.6521, 'learning_rate': 0.00019966642075140638, 'epoch': 0.1}s/it]

i update the latest checkpoint dir in lora_model_dir and removed resume_from_checkpoint

this is from where the training resumed

{'loss': 1.6269, 'learning_rate': 0.0001999996526902403, 'epoch': 0.0}/it]

this method kind of works, but it always resumes training from scratch?

winglian · 2024-01-28T21:46:01Z

seems this is an upstream issue (never actually resolved) huggingface/peft#746

winglian · 2024-01-28T23:35:24Z

@manishiitg @zacbrannelly @vip-china see #1227, I've confirmed this resumes for me with zero2

before resume

resuming

winglian · 2024-01-28T23:39:53Z

seems the train loss doesn't perfectly line up after resume though 🤷

manishiitg · 2024-01-29T04:40:12Z

great! can't wait for the merge to test it out :)

satpalsr · 2024-01-29T11:36:52Z

@winglian I am trying with ds1, and specified lora_model_dir to checkpoint dir.
It starts from lower loss (similar to my previous checkpoint) but the epoch starts from 0.
What am I missing?

manishiitg · 2024-01-29T11:38:16Z

don't think we need to set lora_model_dir anymore @satpalsr
simply do resume from checkpoint

satpalsr · 2024-01-29T11:41:54Z

Then it says
FileNotFoundError: [Errno 2] No such file or directory: '/axolotl/llama-qlora/checkpoint-123/scheduler.pt'

satpalsr · 2024-01-29T11:59:47Z

got it, scheduler.pt is only saved with new changes now.
It works for new training, but can't use the prev checkpoints I already saved before pulling the code changes.

winglian · 2024-01-29T12:23:31Z

@manishiitg confirmed working for you?

manishiitg · 2024-01-29T13:20:45Z

unfortunately, I am only able to run docker builds on my gpu cluster, so not able to verify from branch.

if i clone and install i get this issue #945

so unable to test the branch

manishiitg · 2024-01-29T13:33:59Z

i can confirm very quickly once the PR is merged to master and docker build is updated :)

ra1995 · 2024-03-24T14:11:33Z

Hi all, I am still facing this issue. My config file is as follows:

base_model: /dev/shm/Yarn-Mistral-7b-64k
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
is_mistral_derived_model: false

bnb_config_kwargs:
llm_int8_has_fp16_weight: false
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true

load_in_8bit: false
load_in_4bit: true
strict: false
bfloat16: true

model_config:
output_router_logits: true

datasets:

path: /home/ragraw06/mistral_project/mnt/processed_datasets/Capybara-train.jsonl
type: completion
field: text
path: /mnt/processed_datasets/OpenOrca1M
type: alpaca_w_system.load_open_orca_chatml
train_on_split: train
path: /mnt/processed_datasets/OpenHermes
type: alpaca:chatml
train_on_split: train

dataset_prepared_path: /dev/shm/datasets/dataset-debug
output_dir: /dev/shm/models/model_backup
resume_from_checkpoint: /dev/shm/models/model_backup/checkpoint-39270
hf_use_auth_token:

adapter: qlora
lora_model_dir: /dev/shm/models/model_backup/checkpoint-39270

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false
val_set_size: 0.005
eval_steps: 0.10
eval_sample_packing: false

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:

q_proj
v_proj
k_proj
o_proj
gate_proj
down_proj
up_proj
lora_target_linear: true
lora_fan_in_fan_out:
lora_modules_to_save:
embed_tokens
lm_head

gradient_accumulation_steps: 1
micro_batch_size: 16
eval_batch_size: 16
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0001
weight_decay: 0.05
max_grad_norm:

train_on_inputs:
group_by_length: false
bf16: true
fp16: false
tf32: false

trust_remote_code: true
gradient_checkpointing: true
flash_attention: true
deepspeed: /home/ragraw06/mistral_project/axolotl/deepspeed_configs/zero2.json
local_rank:
xformers_attention:
fsdp:
fsdp_config:

warmup_ratio: 0.03
save_steps: 0.01
save_total_limit: 2
logging_steps: 100
early_stopping_patience:
special_tokens:
eos_token: "<|im_end|>"
tokens:

"<|im_start|>"
seed: 42

manishiitg added the bug Something isn't working label Jan 17, 2024

Nero10578 mentioned this issue Jan 23, 2024

Getting deepspeed error on training completion and failing to save. if self.deepspeed_config["zero_optimization"]["stage"] == 3: AttributeError: 'Accelerator' object has no attribute 'deepspeed_config' #1092

Closed

8 tasks

winglian mentioned this issue Jan 28, 2024

Peft deepspeed resume #1227

Merged

winglian closed this as completed in #1227 Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepseed multiGPU resume from checkpoint fails #1134

deepseed multiGPU resume from checkpoint fails #1134

manishiitg commented Jan 17, 2024 •

edited

Loading

vip-china commented Jan 23, 2024

winglian commented Jan 24, 2024

zacbrannelly commented Jan 25, 2024

manishiitg commented Jan 27, 2024

winglian commented Jan 27, 2024

manishiitg commented Jan 28, 2024

winglian commented Jan 28, 2024

manishiitg commented Jan 28, 2024

winglian commented Jan 28, 2024

winglian commented Jan 28, 2024

winglian commented Jan 28, 2024

manishiitg commented Jan 29, 2024

satpalsr commented Jan 29, 2024

manishiitg commented Jan 29, 2024

satpalsr commented Jan 29, 2024

satpalsr commented Jan 29, 2024 •

edited

Loading

winglian commented Jan 29, 2024

manishiitg commented Jan 29, 2024

manishiitg commented Jan 29, 2024

ra1995 commented Mar 24, 2024 •

edited

Loading

deepseed multiGPU resume from checkpoint fails #1134

deepseed multiGPU resume from checkpoint fails #1134

Comments

manishiitg commented Jan 17, 2024 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

vip-china commented Jan 23, 2024

winglian commented Jan 24, 2024

zacbrannelly commented Jan 25, 2024

manishiitg commented Jan 27, 2024

winglian commented Jan 27, 2024

manishiitg commented Jan 28, 2024

winglian commented Jan 28, 2024

manishiitg commented Jan 28, 2024

winglian commented Jan 28, 2024

winglian commented Jan 28, 2024

winglian commented Jan 28, 2024

manishiitg commented Jan 29, 2024

satpalsr commented Jan 29, 2024

manishiitg commented Jan 29, 2024

satpalsr commented Jan 29, 2024

satpalsr commented Jan 29, 2024 • edited Loading

winglian commented Jan 29, 2024

manishiitg commented Jan 29, 2024

manishiitg commented Jan 29, 2024

ra1995 commented Mar 24, 2024 • edited Loading

manishiitg commented Jan 17, 2024 •

edited

Loading

satpalsr commented Jan 29, 2024 •

edited

Loading

ra1995 commented Mar 24, 2024 •

edited

Loading