Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepseed multiGPU resume from checkpoint fails #1134

Closed
6 of 8 tasks
manishiitg opened this issue Jan 17, 2024 · 20 comments · Fixed by #1227
Closed
6 of 8 tasks

deepseed multiGPU resume from checkpoint fails #1134

manishiitg opened this issue Jan 17, 2024 · 20 comments · Fixed by #1227
Labels
bug Something isn't working

Comments

@manishiitg
Copy link

manishiitg commented Jan 17, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

should work

Current behaviour

en-hi-spot, pid=16745)     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train
(en-hi-spot, pid=16745)     return inner_training_loop(
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1699, in _inner_training_loop
(en-hi-spot, pid=16745)     deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
(en-hi-spot, pid=16745)     load_path, _ = deepspeed_engine.load_checkpoint(
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2724, in load_checkpoint
(en-hi-spot, pid=16745)     load_path, client_states = self._load_checkpoint(load_dir,
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2794, in _load_checkpoint
(en-hi-spot, pid=16745)     self.load_module_state_dict(checkpoint=checkpoint,
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2587, in load_module_state_dict
(en-hi-spot, pid=16745)     self.module.load_state_dict(
(en-hi-spot, pid=16745)   File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
(en-hi-spot, pid=16745)     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
(en-hi-spot, pid=16745) RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
(en-hi-spot, pid=16745) 	Missing key(s) in state_dict: "base_model.model.model.embed_tokens.weight", "base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.0.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.0.input_layernorm.weight", "base_model.model.model.layers.0.post_attention_layernorm.weight", "base_model.model.model.layers.1.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.1.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.1.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.1.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.1.input_layernorm.weight", "base_model.model.model.layers.1.post_attention_layernorm.weight", "base_model.model.model.layers.2.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.2.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.2.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.2.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.2.input_layernorm.weight", "base_model.model.model.layers.2.post_attention_layernorm.weight", "base_model.model.model.layers.3.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.3.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.3.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.3.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.3.input_layernorm.weight", "base_model.model.model.layers.3.post_attention_layernorm.weight", "base_model.model.model.layers.4.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.4.mlp.gate_proj.base_layer.weight", "base_model.model.model.layers.4.mlp.up_proj.base_layer.weight", "base_model.model.model.layers.4.mlp.down_proj.base_layer.weight", "base_model.model.model.layers.4.input_layernorm.weight", "base_model.model.model.layers.4.post_attention_layernorm.weight", "base_model.model.model.laye

Steps to reproduce

when resuming training from checkpoint

Config yaml

base_model: unsloth/tinyllama

model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

chat_template: chatml
datasets:

  • path: manishiitg/aditi-chat-instruct-hi-v1-dedupe
    type: completion

wandb_project: tiny-aditi
hub_model_id: manishiitg/tinyllama-chat-instruct-hi-v1
hf_use_auth_token: true

dataset_prepared_path:
val_set_size: 0
output_dir: /sky-notebook/manishiitg/tinyllama-chat-instruct-hi-v1

sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:

wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 14
num_epochs: 4
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: true ## manage check point resume from here
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens: 128
save_steps: 100 ## increase based on your dataset
save_strategy: steps
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: ""
eos_token: "
"
unk_token: ""

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@manishiitg manishiitg added the bug Something isn't working label Jan 17, 2024
@vip-china
Copy link

the same error

@winglian
Copy link
Collaborator

see #1156 (comment)

@zacbrannelly
Copy link
Contributor

Same error and #1156 (comment) didn't fix it. I'm also using mistralai/Mixtral-8x7B-Instruct-v0.1 as a base model.

@manishiitg
Copy link
Author

@winglian doesnt work still same issue

@winglian
Copy link
Collaborator

Lora training should be resumed using lora_model_dir

@manishiitg
Copy link
Author

thats what i tried

image

i specific the checkpoint path in lora_model_dir but still the exact same error

@winglian
Copy link
Collaborator

did you unset resume_from_checkpoint?

@manishiitg
Copy link
Author

if i unset resume_from_checkpoint, i dont get the error again but the training start from epoch zero.

so this i were my training stopped.

# {'loss': 1.6618, 'learning_rate': 0.00019972783253900808, 'epoch': 0.09}it]
# {'loss': 1.6583, 'learning_rate': 0.00019970805448735204, 'epoch': 0.1}/it]
# {'loss': 1.6555, 'learning_rate': 0.00019968758384408713, 'epoch': 0.1}/it]
# {'loss': 1.6521, 'learning_rate': 0.00019966642075140638, 'epoch': 0.1}s/it]

i update the latest checkpoint dir in lora_model_dir and removed resume_from_checkpoint

this is from where the training resumed

{'loss': 1.6269, 'learning_rate': 0.0001999996526902403, 'epoch': 0.0}/it]

this method kind of works, but it always resumes training from scratch?

@winglian
Copy link
Collaborator

seems this is an upstream issue (never actually resolved) huggingface/peft#746

@winglian
Copy link
Collaborator

@manishiitg @zacbrannelly @vip-china see #1227, I've confirmed this resumes for me with zero2

before resume
Screenshot 2024-01-28 at 4 38 38 PM

resuming
Screenshot 2024-01-28 at 5 33 06 PM

@winglian
Copy link
Collaborator

seems the train loss doesn't perfectly line up after resume though 🤷

@manishiitg
Copy link
Author

great! can't wait for the merge to test it out :)

@satpalsr
Copy link
Contributor

@winglian I am trying with ds1, and specified lora_model_dir to checkpoint dir.
It starts from lower loss (similar to my previous checkpoint) but the epoch starts from 0.
What am I missing?

@manishiitg
Copy link
Author

don't think we need to set lora_model_dir anymore @satpalsr
simply do resume from checkpoint

@satpalsr
Copy link
Contributor

Then it says
FileNotFoundError: [Errno 2] No such file or directory: '/axolotl/llama-qlora/checkpoint-123/scheduler.pt'

@satpalsr
Copy link
Contributor

satpalsr commented Jan 29, 2024

got it, scheduler.pt is only saved with new changes now.
It works for new training, but can't use the prev checkpoints I already saved before pulling the code changes.

@winglian
Copy link
Collaborator

@manishiitg confirmed working for you?

@manishiitg
Copy link
Author

unfortunately, I am only able to run docker builds on my gpu cluster, so not able to verify from branch.

if i clone and install i get this issue #945

so unable to test the branch

@manishiitg
Copy link
Author

i can confirm very quickly once the PR is merged to master and docker build is updated :)

@ra1995
Copy link

ra1995 commented Mar 24, 2024

Hi all, I am still facing this issue. My config file is as follows:

base_model: /dev/shm/Yarn-Mistral-7b-64k
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
is_mistral_derived_model: false

bnb_config_kwargs:
llm_int8_has_fp16_weight: false
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: true

load_in_8bit: false
load_in_4bit: true
strict: false
bfloat16: true

model_config:
output_router_logits: true

datasets:

  • path: /home/ragraw06/mistral_project/mnt/processed_datasets/Capybara-train.jsonl
    type: completion
    field: text
  • path: /mnt/processed_datasets/OpenOrca1M
    type: alpaca_w_system.load_open_orca_chatml
    train_on_split: train
  • path: /mnt/processed_datasets/OpenHermes
    type: alpaca:chatml
    train_on_split: train

dataset_prepared_path: /dev/shm/datasets/dataset-debug
output_dir: /dev/shm/models/model_backup
resume_from_checkpoint: /dev/shm/models/model_backup/checkpoint-39270
hf_use_auth_token:

adapter: qlora
lora_model_dir: /dev/shm/models/model_backup/checkpoint-39270

sequence_len: 2048
sample_packing: false
pad_to_sequence_len: false
val_set_size: 0.005
eval_steps: 0.10
eval_sample_packing: false

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:

  • q_proj
  • v_proj
  • k_proj
  • o_proj
  • gate_proj
  • down_proj
  • up_proj
    lora_target_linear: true
    lora_fan_in_fan_out:
    lora_modules_to_save:
  • embed_tokens
  • lm_head

gradient_accumulation_steps: 1
micro_batch_size: 16
eval_batch_size: 16
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0001
weight_decay: 0.05
max_grad_norm:

train_on_inputs:
group_by_length: false
bf16: true
fp16: false
tf32: false

trust_remote_code: true
gradient_checkpointing: true
flash_attention: true
deepspeed: /home/ragraw06/mistral_project/axolotl/deepspeed_configs/zero2.json
local_rank:
xformers_attention:
fsdp:
fsdp_config:

warmup_ratio: 0.03
save_steps: 0.01
save_total_limit: 2
logging_steps: 100
early_stopping_patience:
special_tokens:
eos_token: "<|im_end|>"
tokens:

  • "<|im_start|>"
    seed: 42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants