Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ptv2显存不够? #232

Open
sanwei111 opened this issue May 18, 2023 · 11 comments
Open

ptv2显存不够? #232

sanwei111 opened this issue May 18, 2023 · 11 comments

Comments

@sanwei111
Copy link

显卡:v100两张,各24G
max_seq_len=512
train_batchsize=2
Traceback (most recent call last):
File "/workspace/code/code/chatglm_finetuning-stable-vocab130528-v2/train.py", line 182, in
trainer.fit(pl_model, train_dataloaders=train_datasets)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
self.strategy.setup(self)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 344, in setup
self.init_deepspeed()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 448, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 484, in _initialize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 413, in _setup_model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1408, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 485, in init
self.initialize_optimizer_states()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 608, in initialize_optimizer_states
single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.50 GiB (GPU 1; 31.75 GiB total capacity; 23.00 GiB already allocated; 7.91 GiB free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "train.py", line 182, in
trainer.fit(pl_model, train_dataloaders=train_datasets)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
self.strategy.setup(self)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 344, in setup
self.init_deepspeed()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 448, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 484, in _initialize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 413, in _setup_model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1408, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 485, in init
self.initialize_optimizer_states()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 608, in initialize_optimizer_states
single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.50 GiB (GPU 0; 31.75 GiB total capacity; 23.00 GiB already allocated; 7.91 GiB free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@sanwei111
Copy link
Author

模型使用的是chatglm,没有量化

@ssbuild
Copy link
Owner

ssbuild commented May 18, 2023

oom 就是不够了。 batch 改为1 应该可以。

@sanwei111
Copy link
Author

还是不行,运行指令是CUDA_VISIBLE_DEVICES=0,1 python train.py
已经是batchsize到1了,maxseqlen=512了

@ssbuild
Copy link
Owner

ssbuild commented May 18, 2023

是不是修改长度后,没有删除output缓存?

@sanwei111
Copy link
Author

改了参数之后,删除output下的数据,再来一次data_utils.py?这个做了

@ssbuild
Copy link
Owner

ssbuild commented May 18, 2023

改了参数之后,删除output下的数据,再来一次data_utils.py?这个做了

关掉deepspeed跑跑看!

@sanwei111
Copy link
Author

关了试一下,感觉好一点,但还是很大:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 1; 31.75 GiB total capacity; 30.54 GiB already allocated; 27.75 MiB free; 30.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@ssbuild
Copy link
Owner

ssbuild commented May 18, 2023

关了试一下,感觉好一点,但还是很大: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 1; 31.75 GiB total capacity; 30.54 GiB already allocated; 27.75 MiB free; 30.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

你的ptv2 pre-seq-len是多少呢?减小看看,你先实验一个能跑的参数!

@sanwei111
Copy link
Author

默认值:32

@sanwei111
Copy link
Author

pre-seq-len:16,batchsize-2,maxseqlen-512,还不行

@sanwei111
Copy link
Author

用的是chatglm_finetuning-stable-vocab130528-v2这个分支

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants