KoAlpaca polyglot 12.8b Fine-tuning 시 에러문의 드립니다. #107

puritysarah · 2023-11-06T04:28:03Z

안녕하세요,

12.8b 모델을 https://github.com/Beomi/KoAlpaca/blob/main/train_v1.1b/run_clm.py 코드로 A100 40G 8장에서 파인튜닝 하는중에 다음과 같이 에러가 납니다. (학습 스크립트는 https://github.com/Beomi/KoAlpaca/blob/main/train_v1.1b/train.sh 사용하였습니다.)

Traceback (most recent call last):
File "run_clm_2.py", line 636, in
main()
File "run_clm_2.py", line 412, in main
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2172, in from_pretrained
raise ValueError("Passing along a device_map requires low_cpu_mem_usage=True")
ValueError: Passing along a device_map requires low_cpu_mem_usage=True

그래서 모델 불러올때 low_cpu_mem_usage=True 옵션을 주었더니 아래와 같은 에러가 납니다.

Traceback (most recent call last):
File "run_clm_2.py", line 636, in
main()
File "run_clm_2.py", line 412, in main
model = AutoModelForCausalLM.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2180, in from_pretrained
raise ValueError(
ValueError: DeepSpeed Zero-3 is not compatible with low_cpu_mem_usage=True or with passing a device_map.

깃헙에 공유된 코드 그대로, gpu 개수만 변경하여 진행해봤는데 에러가 나는데요, 혹시 이부분 도움주실 수 있으신지 문의드립니다.

The text was updated successfully, but these errors were encountered:

Beomi · 2023-11-06T04:42:23Z

혹시

pip install -U transformers accelerate

명령어로 두 패키지 버전을 최신으로 맞추고 한번 다시 실행해서 동일한 에러가 나는지 확인해주시겠어요?

puritysarah · 2023-11-06T14:29:17Z

먼저 빠른 답변감사합니다.

두 패키지들을 업데이트 한 뒤 다시 실행해도 에러가 나는데요.. 다른 서버 (gpu 16장, 8장, 4장) 에서 실행해봐도 같은 에러가 나네요.

Traceback (most recent call last):
File "/workspace/train_v1.1b/run_clm.py", line 637, in
main()
File "/workspace/train_v1.1b/run_clm.py", line 413, in main
model = AutoModelForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
return model_class.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2662, in from_pretrained
raise ValueError("Passing along a `device_map` requires `low_cpu_mem_usage=True`")
ValueError: Passing along a `device_map` requires `low_cpu_mem_usage=True`
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191726 closing signal SIGTERM
[2023-11-06 14:22:09,790] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191727 closing signal SIGTERM
[2023-11-06 14:22:09,791] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191728 closing signal SIGTERM
[2023-11-06 14:22:10,456] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 191725) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_clm.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-11-06_14:22:09
host : gpu-a100x8-1.us-central1-c.c.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 191725)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KoAlpaca polyglot 12.8b Fine-tuning 시 에러문의 드립니다. #107

KoAlpaca polyglot 12.8b Fine-tuning 시 에러문의 드립니다. #107

puritysarah commented Nov 6, 2023

Beomi commented Nov 6, 2023

puritysarah commented Nov 6, 2023 •

edited

Loading

KoAlpaca polyglot 12.8b Fine-tuning 시 에러문의 드립니다. #107

KoAlpaca polyglot 12.8b Fine-tuning 시 에러문의 드립니다. #107

Comments

puritysarah commented Nov 6, 2023

Beomi commented Nov 6, 2023

puritysarah commented Nov 6, 2023 • edited Loading

run_clm.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-11-06_14:22:09 host : gpu-a100x8-1.us-central1-c.c. rank : 0 (local_rank: 0) exitcode : 1 (pid: 191725) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

puritysarah commented Nov 6, 2023 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-11-06_14:22:09
host : gpu-a100x8-1.us-central1-c.c.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 191725)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html