Model accuracy drops when upgrading from accelerate==0.25.0 to 0.26.0 or 0.27.2 #2476

gabrielspmoreira · 2024-02-21T15:06:55Z

System Info

- `Accelerate` version: 0.27.2
- Platform: Linux-5.15.0-1032-oracle-x86_64-with-glibc2.29
- Python version: 3.8.10
- Numpy version: 1.22.2
- PyTorch version (GPU?): 2.1.0a0+fe05266 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 2015.68 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I have a LoRA finetuning pipeline for mistralai/Mistral-7B-v0.1, where have been using accelerate==0.25.0
and deepspeed==0.13.2 to train the model.
I tried to pip install --upgrade accelerate to 0.26.0 or 0.27.2, but noticed that the accuracy drops by ~4.5% when doing so. It is hard to figure out from the release notes of latest versions which change might be causing this behaviour.
I am using as basis of my script this bash script from simlm repo, which calls this Python script.
The model accuracy drops by ~4.5% if I pip install --upgrade from accelerate==0.25.0 to 0.26.0 or 0.27.2

Here is additional info on my environment and config files.

The issue happens both when I use accelerate or deepspeed commands:

accelerate launch --config_file default_config_ranker.yaml ./src/train_model.py \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    --use_accelerator True \
    ...

deepspeed ./src/train_model.py --deepspeed ds_config.json \
    --model_name_or_path mistralai/Mistral-7B-v0.1 \
    ...

Accelerate Config (default_config.yaml)

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: ./ds_config.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

DeepSpeed config (ds_config.json)

{
    "bf16": {
        "enabled": false
    },
    "_fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 12,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": 1000,
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 5000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Expected behavior

The model accuracy should not drop a lot when upgrading accelerate version

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2024-02-22T09:12:01Z

This should not happen, thanks for reporting this issue.

but noticed that the accuracy drops by ~4.5% when doing so

Is this train or validation/test accuracy? Are those absolute or relative percentage points?

Generally, this type of finding is very hard to debug without being able to run the code. If it is possible for you, could you check if the same issue occurs without using DeepSpeed? Is it possible to boil down the problem to something that can be run quickly so that we can pinpoint the source of the issue with git bisect? Absent of this, it's going to be hard to identify what exactly causes the drop.

There is one thing that comes to mind from memory: In accelerate 0.25, we had enabled the random sampler to be seedable for reproducibility (#2057) but users reported issues so from 0.26, we went back to the previous behavior (#2319). Maybe this change had the opposite effect for you? If this applies to you, you could try passing use_seedable_sampler=True to Accelerator and check if that fixes things.

github-actions · 2024-03-23T15:07:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model accuracy drops when upgrading from accelerate==0.25.0 to 0.26.0 or 0.27.2 #2476

Model accuracy drops when upgrading from accelerate==0.25.0 to 0.26.0 or 0.27.2 #2476

gabrielspmoreira commented Feb 21, 2024

BenjaminBossan commented Feb 22, 2024

github-actions bot commented Mar 23, 2024

Model accuracy drops when upgrading from accelerate==0.25.0 to 0.26.0 or 0.27.2 #2476

Model accuracy drops when upgrading from accelerate==0.25.0 to 0.26.0 or 0.27.2 #2476

Comments

gabrielspmoreira commented Feb 21, 2024

System Info

Information

Tasks

Reproduction

Accelerate Config (default_config.yaml)

DeepSpeed config (ds_config.json)

Expected behavior

BenjaminBossan commented Feb 22, 2024

github-actions bot commented Mar 23, 2024