- Parameter Settings
- Quantization
- Model Type & Max Length
- Batch Size
- Use Flash Attn & Gradient Checkpointing
- LoRA Rank & LoRA Target Modules
- Gradient Accumulation Steps
- Tuners
- Export
- AWQ
- AQLM
Experimental environment:
- A100
- CUDA 11.8
- python 3.10
- torch 2.1.1
- flash_attn 2.3.4
- xformers 0.0.23
- auto_gptq 0.5.1
- bitsandbytes 0.41.3.post2
The following are the same command line settings for all experiments:
--dataset_test_ratio 0 \
--dataset cls-fudan-news-zh \
--save_strategy no \
--check_dataset_strategy warning \
--preprocess_num_proc 4 \
If the following parameters are not specified, the following default values are used:
--max_length 2048 \
--batch_size 1 \
--gradient_checkpointing true \
--use_flash_attn true \
--lora_rank 8 \
--lora_target_modules DEFAULT \
--quantization_bit 0 \
--gradient_accumulation_steps 16 \
Token statistics of the corresponding test dataset (obtained by qwen's tokenizer): 3234.4±2547.5, min=91, max=19548.
The experimental script can be found in scripts/benchmark/test_memory_time/
.
The test script is:
swift sft \
--model_type {MODEL_TYPE} \
--quantization_bit {QUANTIZATION_BIT} \
--sft_type lora \
...
Model Type [LoRA] | Quantization | Training Speed (samples/s) | GPU Memory (GiB) |
qwen-7b-chat | bf16 | 4.31 | 27.74 |
int4 (gptq) | 2.05 | 19.21 | |
int8 (gptq) | 1.97 | 22.20 | |
int4 (bnb) | 2.41 | 23.85 | |
qwen-14b-chat | bf16 | 2.60 | 40.14 |
int4 (gptq) | 1.15 | 23.30 | |
int8 (gptq) | 1.08 | 29.13 | |
int4 (bnb) | 1.36 | 30.05 | |
qwen-72b-chat | bf16 | 0.59 (2*A100) | 73.71+78.54 |
int4 (gptq) | 0.23 | 54.86 | |
int8 (gptq) | 0.21 | 78.44 | |
int4 (bnb) | 0.28 | 74.87 |
The test script is:
swift sft \
--model_type {MODEL_TYPE} \
--max_length {MAX_LENGTH} \
--sft_type lora \
...
Model Type [LoRA] | Max Length | Training Speed (samples/s) | GPU Memory (GiB) |
qwen-1_8b-chat | 512 | 9.88 | 6.99 |
1024 | 9.90 | 10.71 | |
2048 | 8.77 | 16.35 | |
4096 | 5.92 | 23.80 | |
8192 | 4.19 | 37.03 | |
qwen-7b-chat | 512 | 7.43 | 18.01 |
1024 | 6.51 | 21.73 | |
2048 | 4.31 | 27.74 | |
4096 | 2.05 | 35.31 | |
8192 | 1.34 | 48.41 | |
qwen-14b-chat | 512 | 5.63 | 30.14 |
1024 | 4.36 | 34.43 | |
2048 | 2.60 | 40.14 | |
4096 | 1.17 | 47.95 | |
8192 | 0.79 | 60.74 | |
qwen-72b-chat (2*A100) | 512 | 1.41 | 67.68+73.07 |
1024 | 1.02 | 70.25+77.11 | |
2048 | 0.59 | 73.71+78.54 | |
4096 | - | OOM | |
8192 | - | OOM | |
chatglm3-6b | 512 | 6.72 | 13.94 |
1024 | 6.16 | 12.99 | |
2048 | 4.20 | 17.20 | |
4096 | 1.92 | 29.80 | |
8192 | 1.24 | 66.82 | |
yi-6b-chat | 512 | 5.27 | 13.72 |
1024 | 5.07 | 15.44 | |
2048 | 3.84 | 16.95 | |
4096 | 1.99 | 28.25 | |
8192 | 1.35 | 43.81 | |
yi-34b-chat | 512 | 2.32 | 66.72 |
1024 | 1.76 | 69.10 | |
2048 | 1.05 | 71.34 | |
4096 | 0.47 | 78.72 | |
8192 | 0.31 (2*A100) | 47.01+65.03 | |
openbuddy-zephyr-7b-chat | 512 | 5.17 | 14.99 |
1024 | 3.92 | 16.57 | |
2048 | 3.08 | 19.89 | |
4096 | 1.85 | 23.29 | |
8192 | 0.92 | 52.14 | |
baichuan2-7b-chat | 512 | 6.09 | 18.18 |
1024 | 5.36 | 17.45 | |
2048 | 3.43 | 19.18 | |
4096 | 1.69 | 34.22 | |
8192 | 1.16 | 45.47 | |
baichuan2-13b-chat | 512 | 5.32 | 31.01 |
1024 | 3.91 | 31.58 | |
2048 | 1.77 | 32.40 | |
4096 | 0.65 | 49.63 | |
8192 | 0.36 | 76.17 |
The test script is:
swift sft \
--model_type {MODEL_TYPE} \
--max_length {MAX_LENGTH} \
--sft_type full \
...
Model Type [FULL] | Max Length | Training Speed (samples/s) | GPU Memory (GiB) |
qwen-1_8b-chat | 512 | 10.77 | 18.16 |
1024 | 10.39 | 18.62 | |
2048 | 8.73 | 35.11 | |
4096 | 5.45 | 31.62 | |
8192 | 3.81 | 38.93 | |
qwen-7b-chat | 512 | 5.96 | 73.37 |
1024 | 5.00 | 73.64 | |
2048 | 3.30 | 74.26 | |
4096 | 1.64 | 78.76 | |
8192 | 1.11 (2*A100) | 61.34+73.00 | |
qwen-14b-chat (2*A100) | 512 | 3.66 | 60.42+72.31 |
1024 | 2.98 | 60.61+74.37 | |
2048 | 1.93 | 60.70+78.22 | |
4096 | 0.92 | 75.59+78.64 | |
8192 | 0.62 | 76.59+77.68 |
The test script is:
swift sft \
--batch_size {BATCH_SIZE} \
--model_type qwen-7b-chat \
--sft_type lora \
...
Model Type [LoRA] | Batch Size | Training Speed (samples/s) | GPU Memory (GiB) |
qwen-7b-chat | 1 | 4.31 | 27.74 |
2 | 3.60 | 43.11 | |
4 | 3.02 | 63.81 | |
8 | 2.77 | 76.14 |
The test script is:
swift sft \
--use_flash_attn {USE_FLASH_ATTN} \
--gradient_checkpointing {GRADIENT_CHECKPOINTING} \
--model_type qwen-7b-chat \
--sft_type lora \
...
Model Type [LoRA] | Use Flash Attn | Gradient Checkpointing | Training Speed (samples/s) | GPU Memory (GiB) |
qwen-7b-chat | ✔ | ✔ | 4.31 | 27.74 |
✔ | ✘ | 6.19 | 37.70 | |
✘ | ✔ | 3.13 | 27.71 | |
✘ | ✘ | 4.45 | 57.67 |
The test script is:
swift sft \
--lora_rank {LORA_RANK} \
--lora_target_modules {LORA_TARGET_MODULES} \
--model_type qwen-7b-chat \
--sft_type lora \
...
Model Type [LoRA] | LoRA Rank | LoRA Target Modules | Training Speed (samples/s) | GPU Memory (GiB) | Trainable Params (M) |
qwen-7b-chat | 2 | DEFAULT (c_attn) | 4.27 | 27.72 | 1.05 |
8 | DEFAULT | 4.31 | 27.74 | 4.19 | |
64 | DEFAULT | 4.19 | 27.85 | 33.55 | |
8 | ALL (all linear) | 3.22 | 27.87 | 17.89 |
The test script is:
swift sft \
--gradient_accumulation_steps {GRADIENT_ACCUMULATION_STEPS} \
--model_type qwen-7b-chat \
--sft_type lora \
...
Model Type [LoRA] | Gradient Accumulation Steps | Training Speed (samples/s) | GPU Memory (GiB) |
qwen-7b-chat | 1 | 4.26 | 27.73 |
2 | 4.32 | 27.74 | |
4 | 4.31 | 27.74 | |
8 | 4.32 | 27.74 | |
16 | 4.33 | 27.74 | |
32 | 4.30 | 27.74 | |
64 | 4.32 | 27.74 |
exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
adalora | qwen-7b-chat | ms-agent | 2.0 | adalora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False | 26.8389(0.3464%) | True | True | lr=5e-05/epoch=2 | 32.55GiB | 0.92(87543 samples/95338.71 seconds) | 17.33(2345 tokens/135.29 seconds) | 0.57 | 1.07 | 0.391 | 0.665 | 0.569 |
adapter | qwen-7b-chat | ms-agent | 2.0 | adapter | 33.6896(0.4344%) | True | True | lr=5e-05/epoch=2 | 32.19GiB | 1.48(87543 samples/59067.71 seconds) | 26.63(4019 tokens/150.90 seconds) | 0.55 | 1.03 | 0.438 | 0.662 | 0.565 | |
dora | qwen-7b-chat | ms-agent | 2.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=True | 19.2512(0.2487%) | True | True | lr=5e-05/epoch=2 | 32.46GiB | 0.51(87543 samples/171110.54 seconds) | 4.29(2413 tokens/562.32 seconds) | 0.53 | 1.01 | 0.466 | 0.683 | 0.577 |
full+galore128 | qwen-7b-chat | ms-agent | 2.0 | full | galore_rank=128/galore_per_parameter=false/galore_with_embedding=false | 7721.3245(100.0000%) | True | True | lr=5e-05/epoch=2 | 47.02GiB | 1.10(87543 samples/79481.96 seconds) | 28.96(2400 tokens/82.88 seconds) | 0.55 | 1.00 | 0.358 | 0.688 | 0.577 |
full+galore32 | qwen-7b-chat | ms-agent | 2.0 | full | galore_rank=32/galore_per_parameter=false/galore_with_embedding=false | 7721.3245(100.0000%) | True | True | lr=5e-05/epoch=2 | 47.05GiB | 1.11(87543 samples/78989.74 seconds) | 29.17(2431 tokens/83.35 seconds) | 0.56 | 1.01 | 0.386 | 0.667 | 0.539 |
full+galore64 | qwen-7b-chat | ms-agent | 2.0 | full | galore_rank=64/galore_per_parameter=false/galore_with_embedding=false | 7721.3245(100.0000%) | True | True | lr=5e-05/epoch=2 | 46.91GiB | 1.11(87543 samples/79200.36 seconds) | 28.94(2448 tokens/84.60 seconds) | 0.56 | 1.01 | 0.397 | 0.674 | 0.544 |
full+galore_emb | qwen-7b-chat | ms-agent | 2.0 | full | galore_rank=128/galore_per_parameter=false/galore_with_embedding=true | 7721.3245(100.0000%) | True | True | lr=5e-05/epoch=2 | 44.53GiB | 1.10(87543 samples/79775.02 seconds) | 29.45(2433 tokens/82.62 seconds) | 0.55 | 1.00 | 0.398 | 0.670 | 0.568 |
full+galore_perparam | qwen-7b-chat | ms-agent | 2.0 | full | galore_rank=128/galore_per_parameter=true/galore_with_embedding=false | 7721.3245(100.0000%) | True | True | lr=5e-05/epoch=2 | 47.02GiB | 1.25(87543 samples/69821.89 seconds) | 29.02(2478 tokens/85.39 seconds) | 0.54 | 1.00 | 0.372 | 0.669 | 0.524 |
full+no_mix | qwen-7b-chat | ms-agent | 0.0 | full | 7721.3245(100.0000%) | True | True | lr=5e-05/epoch=2 | 72.56GiB | 1.27(29698 samples/23356.97 seconds) | 30.31(11738 tokens/387.29 seconds) | 0.57 | 0.44 | 0.174 | 0.652 | 0.553 | |
full | qwen-7b-chat | ms-agent | 2.0 | full | 7721.3245(100.0000%) | True | True | lr=5e-05/epoch=2 | 73.53GiB | 1.43(87543 samples/61022.97 seconds) | 29.51(3382 tokens/114.62 seconds) | 0.54 | 0.95 | 0.343 | 0.536 | 0.495 | |
llamapro | qwen-7b-chat | ms-agent | 2.0 | llamapro | num_blocks=4 | 809.5826(9.4900%) | True | True | lr=5e-05/epoch=2 | 38.11GiB | 1.53(87543 samples/57294.42 seconds) | 25.80(2374 tokens/92.02 seconds) | 0.53 | 1.00 | 0.434 | 0.645 | 0.357 |
lora+ | qwen-7b-chat | ms-agent | 2.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=16.0/use_rslora=False/use_dora=False | 17.8913(0.2312%) | True | True | lr=5e-05/epoch=2 | 32.35GiB | 0.95(87543 samples/91923.80 seconds) | 18.81(3329 tokens/176.94 seconds) | 0.53 | 0.98 | 0.432 | 0.647 | 0.344 |
lora+neftune | qwen-7b-chat | ms-agent | 2.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=Falseneftune_alpha=15.0 | 17.8913(0.2312%) | True | True | lr=5e-05/epoch=2 | 32.35GiB | 0.96(87543 samples/91525.50 seconds) | 19.84(161792 tokens/8156.02 seconds) | 0.53 | 1.02 | 0.456 | 0.671 | 0.401 |
lora+no_mix | qwen-7b-chat | ms-agent | 0.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False | 17.8913(0.2312%) | True | True | lr=5e-05/epoch=2 | 30.86GiB | 0.91(29698 samples/32570.15 seconds) | 19.89(36308 tokens/1825.26 seconds) | 0.53 | 0.53 | 0.470 | 0.666 | 0.574 |
lora | qwen-7b-chat | ms-agent | 2.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False | 17.8913(0.2312%) | True | True | lr=5e-05/epoch=2 | 32.35GiB | 0.95(87543 samples/91974.29 seconds) | 18.11(2415 tokens/133.32 seconds) | 0.53 | 1.01 | 0.462 | 0.676 | 0.304 |
qwen-7b-chat-eval | qwen-7b-chat | None | 0.0 | None | None(None) | None | 30.81(13765 tokens/446.83 seconds) | 0.517 | 0.679 | 0.568 | |||||||
rslora | qwen-7b-chat | ms-agent | 2.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=True/use_dora=False | 17.8913(0.2312%) | True | True | lr=5e-05/epoch=2 | 32.35GiB | 0.94(87543 samples/92758.63 seconds) | 18.87(2762 tokens/146.34 seconds) | 0.53 | 0.99 | 0.451 | 0.679 | 0.339 |
exp_name | model_type | calibration dataset | quantization method | quantization bits | infer speed(tokens/s) | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
---|---|---|---|---|---|---|---|---|
awq-ms-bench-mini | qwen-7b-chat | ms-bench-mini | awq | 4 | 27.25(16501 tokens/605.47 seconds) | 0.494 | 0.665 | 0.571 |
awq-pileval | qwen-7b-chat | pileval | awq | 4 | 26.92(12994 tokens/482.72 seconds) | 0.497 | 0.675 | 0.577 |
gptq-ms-bench-mini | qwen-7b-chat | ms-bench-mini | gptq | 4 | 31.16(15349 tokens/492.54 seconds) | 0.482 | 0.642 | 0.556 |
gptq-pileval | qwen-7b-chat | pileval | gptq | 4 | 31.67(15185 tokens/479.54 seconds) | 0.478 | 0.654 | 0.559 |
exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
qwen1half-7b-chat-awq | qwen1half-7b-chat-awq | ms-agent | 2.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False | 19.9885(1.5802%) | True | True | lr=5e-05/epoch=2 | 24.26GiB | 0.45(87543 samples/194746.58 seconds) | 16.08(2469 tokens/153.58 seconds) | 0.55 | 1.19 | 0.505 | 0.737 | 0.656 |
exp_name | model_type | dataset | ms-bench mix ratio | tuner | tuner_params | trainable params(M) | flash_attn | gradient_checkpointing | hypers | memory | train speed(samples/s) | infer speed(tokens/s) | train_loss | eval_loss | gsm8k weighted acc | arc weighted acc | ceval weighted acc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
llama2-7b-aqlm-2bit-1x16 | llama2-7b-aqlm-2bit-1x16 | dureader-robust-zh | 0.0 | lora | rank=8/target=ALL/alpha=32/lr_ratio=None/use_rslora=False/use_dora=False | 19.9885(1.6510%) | True | True | lr=5e-05/epoch=2 | 4.04GiB | 0.17(14994 samples/86140.71 seconds) | 0.48 | 0.74 |