-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ptv2显存不够? #232
Comments
模型使用的是chatglm,没有量化 |
oom 就是不够了。 batch 改为1 应该可以。 |
还是不行,运行指令是CUDA_VISIBLE_DEVICES=0,1 python train.py |
是不是修改长度后,没有删除output缓存? |
改了参数之后,删除output下的数据,再来一次data_utils.py?这个做了 |
关掉deepspeed跑跑看! |
关了试一下,感觉好一点,但还是很大: |
你的ptv2 pre-seq-len是多少呢?减小看看,你先实验一个能跑的参数! |
默认值:32 |
pre-seq-len:16,batchsize-2,maxseqlen-512,还不行 |
用的是chatglm_finetuning-stable-vocab130528-v2这个分支 |
显卡:v100两张,各24G
max_seq_len=512
train_batchsize=2
Traceback (most recent call last):
File "/workspace/code/code/chatglm_finetuning-stable-vocab130528-v2/train.py", line 182, in
trainer.fit(pl_model, train_dataloaders=train_datasets)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
self.strategy.setup(self)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 344, in setup
self.init_deepspeed()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 448, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 484, in _initialize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 413, in _setup_model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1408, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 485, in init
self.initialize_optimizer_states()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 608, in initialize_optimizer_states
single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.50 GiB (GPU 1; 31.75 GiB total capacity; 23.00 GiB already allocated; 7.91 GiB free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "train.py", line 182, in
trainer.fit(pl_model, train_dataloaders=train_datasets)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 92, in launch
return function(*args, **kwargs)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 911, in _run
self.strategy.setup(self)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 344, in setup
self.init_deepspeed()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 448, in init_deepspeed
self._initialize_deepspeed_train(model)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 484, in _initialize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/lightning/pytorch/strategies/deepspeed.py", line 413, in _setup_model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 308, in init
self._configure_optimizer(optimizer, model_parameters)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1408, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 485, in init
self.initialize_optimizer_states()
File "/root/miniconda3/envs/ptunejw/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 608, in initialize_optimizer_states
single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.50 GiB (GPU 0; 31.75 GiB total capacity; 23.00 GiB already allocated; 7.91 GiB free; 23.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The text was updated successfully, but these errors were encountered: