How to resume training from a checkpoint when training LoRA using deepspeed？ #26665

Sakurakdx · 2023-10-08T03:51:00Z

System Info

transformers version: 4.34.0.dev0
Platform: Linux-5.4.143.bsk.7-amd64-x86_64-with-glibc2.28
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.2
Accelerate version: 0.21.0
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: DEEPSPEED
- use_cpu: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {'deepspeed_config_file': 'none', 'zero3_init_flag': False}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR', 'dynamo_mode': 'default', 'dynamo_use_dynamic': False, 'dynamo_use_fullgraph': False}
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@pacman100 @ArthurZucker @younesbelkada

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When using deepspeed to train LoRA, I want to use the resume function of the trainer. The sample code is as follows:

causal_model = AutoModelForCausalLM.from_pretrained(model_pretrained_path_,
                                                    config=config,
                                                    trust_remote_code=True,
                                                    low_cpu_mem_usage=self.params["low_cpu_mem_usage"])

peft = PEFT(config_path_or_data=peft_params)
causal_model = peft.get_peft_model(model=causal_model)

trainer = Seq2SeqTrainer(
        params=trainer_params,
        model=causal_model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        data_collator=data_collator,
        eval_dataset=eval_dataset,
        compute_metrics=dataset_t.metric,
    )

trainer.train(resume_from_checkpoint=True)

deepspeed config as follows:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 2,
        "cpu_offload": false,
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 50,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Expected behavior

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument state_steps in method wrapper_CUDA___fused_adamw_)

The text was updated successfully, but these errors were encountered:

Sakurakdx · 2023-10-12T03:25:10Z

I found that after loading the optimizer, the device of the step in its status is cpu, but it should be on cuda.

Sakurakdx · 2023-10-12T04:07:25Z

It seems torch requires step to be on the cpu device, but deepspeed requires it to be in the same device?

amyeroberts · 2023-11-07T12:40:08Z

Gentle ping @pacman100 @muellerzr

younesbelkada · 2023-12-04T11:14:22Z

I believe #27825 should fix the issue

github-actions · 2023-12-29T08:05:41Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface deleted a comment from github-actions bot Nov 7, 2023

huggingface deleted a comment from github-actions bot Dec 4, 2023

github-actions bot closed this as completed Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to resume training from a checkpoint when training LoRA using deepspeed？ #26665

How to resume training from a checkpoint when training LoRA using deepspeed？ #26665

Sakurakdx commented Oct 8, 2023

Sakurakdx commented Oct 12, 2023

Sakurakdx commented Oct 12, 2023

amyeroberts commented Nov 7, 2023

younesbelkada commented Dec 4, 2023

github-actions bot commented Dec 29, 2023

How to resume training from a checkpoint when training LoRA using deepspeed？ #26665

How to resume training from a checkpoint when training LoRA using deepspeed？ #26665

Comments

Sakurakdx commented Oct 8, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Sakurakdx commented Oct 12, 2023

Sakurakdx commented Oct 12, 2023

amyeroberts commented Nov 7, 2023

younesbelkada commented Dec 4, 2023

github-actions bot commented Dec 29, 2023