Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why is unsloth thinking I'm doing multi gpu optimization when I'm not? #1240

Open
brando90 opened this issue Nov 5, 2024 · 3 comments
Open

Comments

@brando90
Copy link

brando90 commented Nov 5, 2024

code

'''
conda activate beyond_scale_2_unsloth
'''
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from pathlib import Path

from pdb import set_trace as st

opt_args = {
    'batch_size': 8,
    'learning_rate': 5e-2,
    'epochs': 1,
    'adam_epsilon': 1e-8,
    'weight_decay': 1e-4,
    'num_workers': 0,
    'break_early': False
}
hf_args = {'max_seq_length': 256, 'dataset_text_field': "text"}

# Set data type and device
torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32
device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer using Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name="unsloth/Qwen2-1.5B",
    model_name="Qwen/Qwen2.5-Math-1.5B-Instruct",
    max_seq_length=hf_args['max_seq_length'],
    dtype=None,  # Auto-detection for Float16/BFloat16
    load_in_4bit=False,  # Set False if not using 4-bit precision
)

model = model.to(device)
tok = tokenizer
tok.pad_token = tok.eos_token if tok.pad_token_id is None else tok.pad_token

# Add LoRA adapters, targeting only `lm_head` for fine-tuning
st()
model = FastLanguageModel.get_peft_model(
    model=model,
    r=16,  # LoRA rank
    target_modules=["lm_head"],  # Only optimize `lm_head`
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")

# Define training configuration
training_args = TrainingArguments(
    per_device_train_batch_size=opt_args['batch_size'],
    gradient_accumulation_steps=4,
    num_train_epochs=opt_args['epochs'],
    learning_rate=opt_args['learning_rate'],
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="paged_adamw_32bit",
    weight_decay=opt_args['weight_decay'],
    output_dir="./tmp",
    report_to='none'
)

# Initialize the Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field=hf_args['dataset_text_field'],
    max_seq_length=hf_args['max_seq_length'],
    args=training_args,
)

# Print norms before training to check only lm_head will change
print(f'{model.model.embed_tokens.weight.norm(2)=}')
print(f'{model.model.layers[14].self_attn.v_proj.weight.norm(2)=}')
print(f'{model.model.layers[14].mlp.down_proj.weight.norm(2)=}')
print(f'{model.lm_head.weight.norm(2)=}')

# Start training
trainer.train()

# Print norms after training to verify only lm_head changed
print(f'{model.model.embed_tokens.weight.norm(2)=}')
print(f'{model.model.layers[14].self_attn.v_proj.weight.norm(2)=}')
print(f'{model.model.layers[14].mlp.down_proj.weight.norm(2)=}')
print(f'{model.lm_head.weight.norm(2)=}')

print("Done!\a")

but I'm only doing 1 gpu a100...

(beyond_scale_2_unsloth) brando9@ampere1~/beyond-scale-2-alignment-coeff $ python /lfs/ampere1/0/brando9/beyond-scale-2-alignment-coeff/experiments/bm/2024/11_november/week_4_8/train_unsloth_head_qwen2.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Qwen2 patching. Transformers = 4.46.1.
   \\   /|    GPU: NVIDIA A100-SXM4-80GB. Max memory: 79.138 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Traceback (most recent call last):
  File "/lfs/ampere1/0/brando9/beyond-scale-2-alignment-coeff/experiments/bm/2024/11_november/week_4_8/train_unsloth_head_qwen2.py", line 29, in <module>
    model, tokenizer = FastLanguageModel.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/loader.py", line 332, in from_pretrained
    model, tokenizer = dispatch_model.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/qwen2.py", line 87, in from_pretrained
    return FastLlamaModel.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/llama.py", line 1645, in from_pretrained
    raise RuntimeError('Unsloth currently does not support multi GPU setups - but we are working on it!')
RuntimeError: Unsloth currently does not support multi GPU setups - but we are working on it!
@danielhanchen
Copy link
Contributor

Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi

@brando90
Copy link
Author

brando90 commented Nov 5, 2024 via email

@Peter-Fy
Copy link

Peter-Fy commented Nov 10, 2024

I encountered the same issue on a single machine with multiple GPUs. I used os.environ["CUDA_VISIBLE_DEVICES"] = "1" at the beginning of the code to set a single GPU, but sometimes it throws the following error:

RuntimeError: Unsloth currently does not support multi GPU setups - but we are working on it!

Without changing any code, rerunning it sometimes succeeds and sometimes fails.
I believe this issue is the same as #983, and I hope it can be fixed as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants