Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用Zero-1+cpu_offload=true时,出现错误? #190

Open
SkrDrag opened this issue Jul 21, 2023 · 2 comments
Open

使用Zero-1+cpu_offload=true时,出现错误? #190

SkrDrag opened this issue Jul 21, 2023 · 2 comments

Comments

@SkrDrag
Copy link

SkrDrag commented Jul 21, 2023

No description provided.

@SkrDrag
Copy link
Author

SkrDrag commented Jul 21, 2023

运行的脚本:bash scripts/ds_finetune_superglue.sh \ config_tasks/model_blocklm_2B.sh \ config_tasks/task_copa.sh

[2023-07-22 01:03:41,666] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-22 01:03:48,454] [INFO] [runner.py:358:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=61318 finetune_glm.py --deepspeed --deepspeed_config config_tasks/config_blocklm_10B.json --finetune --cloze-eval --experiment-name blocklm-2B-copa_07-22-01-03 --task COPA --data-dir /home/llw/workspace/dataset/COPA --save /home/llw/workspace/checkpoints --seq-length 256 --checkpoint-activations --eval-batch-size 16 --save-epoch 100000 --num-workers 1 --no-load-optim --no-load-lr-scheduler --block-lm --cloze-eval --task-mask --num-layers 36 --hidden-size 2048 --num-attention-heads 32 --max-position-embeddings 1024 --tokenizer-type GPT2BPETokenizer --load-pretrained /home/llw/workspace/checkpoints/blocklm-2b-512 --lr-decay-style linear --warmup 0.1 --weight-decay 1.0e-1 --pattern-id 0 --save-interval 10000 --log-interval 20 --eval-interval 1000 --eval-iters 100 --pattern-id 0 --fp16 --model-parallel-size 1 --epochs 100 --overwrite
[2023-07-22 01:03:49,366] [INFO] [launch.py:73:main] 0 NCCL_IB_DISABLE 0
[2023-07-22 01:03:49,367] [INFO] [launch.py:73:main] 0 NCCL_DEBUG info
[2023-07-22 01:03:49,367] [INFO] [launch.py:73:main] 0 NCCL_NET_GDR_LEVEL 2
[2023-07-22 01:03:49,367] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-07-22 01:03:49,367] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-07-22 01:03:49,367] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-07-22 01:03:49,367] [INFO] [launch.py:102:main] dist_world_size=4
[2023-07-22 01:03:49,367] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-07-22 01:03:50,965] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2023-07-22 01:03:50,982] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[2023-07-22 01:03:50,991] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
using world size: 4 and model-parallel size: 1

using dynamic loss scaling
[2023-07-22 01:03:50,999] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
initializing model parallel with size 1
initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
padded vocab (size: 50265) with 39 dummy tokens (new size: 50304)
found end-of-document token: 50256
big-node0:864635:864635 [3] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864635:864635 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864635:864635 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864635:864635 [3] NCCL INFO NET/IB : No device found.
big-node0:864635:864635 [3] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864635:864635 [3] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864635:864924 [3] NCCL INFO Channel 00/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 01/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 02/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 03/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 04/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 05/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 06/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 07/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 08/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 09/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 10/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 11/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 12/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 13/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 14/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 15/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 16/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 17/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 18/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 19/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 20/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 21/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 22/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 23/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 24/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 25/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 26/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 27/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 28/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 29/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 30/32 : 0
big-node0:864635:864924 [3] NCCL INFO Channel 31/32 : 0
big-node0:864635:864924 [3] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864635:864924 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
big-node0:864635:864924 [3] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864635:864924 [3] NCCL INFO comm 0x7f7c5c002e10 rank 0 nranks 1 cudaDev 3 busId 57000 - Init COMPLETE
big-node0:864634:864634 [2] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864634:864634 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864634:864634 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864634:864634 [2] NCCL INFO NET/IB : No device found.
big-node0:864634:864634 [2] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864634:864634 [2] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864634:864931 [2] NCCL INFO Channel 00/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 01/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 02/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 03/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 04/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 05/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 06/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 07/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 08/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 09/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 10/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 11/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 12/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 13/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 14/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 15/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 16/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 17/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 18/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 19/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 20/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 21/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 22/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 23/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 24/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 25/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 26/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 27/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 28/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 29/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 30/32 : 0
big-node0:864634:864931 [2] NCCL INFO Channel 31/32 : 0
big-node0:864634:864931 [2] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864634:864931 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
big-node0:864634:864931 [2] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864634:864931 [2] NCCL INFO comm 0x7f7784002e10 rank 0 nranks 1 cudaDev 2 busId 56000 - Init COMPLETE
big-node0:864632:864632 [0] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864632:864632 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864632:864632 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864632:864632 [0] NCCL INFO NET/IB : No device found.
big-node0:864632:864632 [0] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864632:864632 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864633:864633 [1] NCCL INFO Bootstrap : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864633:864633 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
big-node0:864633:864633 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
big-node0:864633:864633 [1] NCCL INFO NET/IB : No device found.
big-node0:864633:864633 [1] NCCL INFO NET/Socket : Using [0]eno1:210.45.124.81<0> [1]usb0:169.254.3.1<0> [2]virbr0:192.168.122.1<0> [3]veth8ac8d00:fe80::64d7:e7ff:feed:6a02%veth8ac8d00<0>
big-node0:864633:864633 [1] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
big-node0:864633:864936 [1] NCCL INFO Channel 00/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 01/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 02/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 03/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 04/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 05/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 06/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 07/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 08/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 09/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 10/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 11/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 12/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 13/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 14/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 15/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 16/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 17/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 18/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 19/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 20/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 21/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 22/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 23/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 24/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 25/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 26/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 27/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 28/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 29/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 30/32 : 0
big-node0:864633:864936 [1] NCCL INFO Channel 31/32 : 0
big-node0:864633:864936 [1] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864633:864936 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
big-node0:864633:864936 [1] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864633:864936 [1] NCCL INFO comm 0x7f9094002e10 rank 0 nranks 1 cudaDev 1 busId 52000 - Init COMPLETE
big-node0:864632:864934 [0] NCCL INFO Channel 00/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 01/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 02/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 03/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 04/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 05/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 06/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 07/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 08/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 09/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 10/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 11/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 12/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 13/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 14/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 15/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 16/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 17/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 18/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 19/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 20/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 21/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 22/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 23/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 24/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 25/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 26/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 27/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 28/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 29/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 30/32 : 0
big-node0:864632:864934 [0] NCCL INFO Channel 31/32 : 0
big-node0:864632:864934 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0-
big-node0:864632:864934 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
big-node0:864632:864934 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
big-node0:864632:864934 [0] NCCL INFO comm 0x7f130c002e10 rank 0 nranks 1 cudaDev 0 busId 4f000 - Init COMPLETE
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=train)
Added 400 mirror examples, total size is 800...
Returning 800 train examples with label dist.: [(0, 400), (1, 400)]
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=dev)
Returning 100 dev examples with label dist.: [(1, 45), (0, 55)]
building train and validation dataloaders ...
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=dev)
Returning 100 dev examples with label dist.: [(1, 45), (0, 55)]
Creating copa dataset from file at /home/llw/workspace/dataset/COPA (split=test)
Returning 500 test examples with label dist.: [(None, 500)]
building GLM model ...
number of parameters on model parallel rank 0: 1920122880
DeepSpeed is enabled.
[2023-07-22 01:04:11,745] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
big-node0:864633:865177 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864632:865174 [0] NCCL INFO Channel 00/02 : 0 1 2 3
big-node0:864635:865175 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864633:865177 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
big-node0:864632:865174 [0] NCCL INFO Channel 01/02 : 0 1 2 3
big-node0:864635:865175 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
big-node0:864635:865175 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff,ff000000,0fffffff
big-node0:864633:865177 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ff000000,0fffffff
big-node0:864634:865176 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864634:865176 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
big-node0:864634:865176 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ff000000,0fffffff
big-node0:864632:865174 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
big-node0:864632:865174 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
big-node0:864632:865174 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ff000000,0fffffff
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 3(=57000)
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Channel 00 : 1[52000] -> 2[56000] via direct shared memory
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 0(=4f000)
big-node0:864635:865175 [3] NCCL INFO Channel 00 : 3[57000] -> 0[4f000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864632:865174 [0] NCCL INFO Channel 00 : 0[4f000] -> 1[52000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864634:865176 [2] NCCL INFO Channel 00 : 2[56000] -> 3[57000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864635:865175 [3] NCCL INFO Channel 00 : 3[57000] -> 2[56000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864633:865177 [1] NCCL INFO Channel 00 : 1[52000] -> 0[4f000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864634:865176 [2] NCCL INFO Channel 00 : 2[56000] -> 1[52000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 3(=57000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864632:865174 [0] NCCL INFO Channel 01 : 0[4f000] -> 1[52000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864633:865177 [1] NCCL INFO Channel 01 : 1[52000] -> 2[56000] via direct shared memory
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864634:865176 [2] NCCL INFO Channel 01 : 2[56000] -> 3[57000] via direct shared memory
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 0(=4f000)
big-node0:864635:865175 [3] NCCL INFO Channel 01 : 3[57000] -> 0[4f000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 2(=56000)
big-node0:864632:865174 [0] NCCL INFO Could not enable P2P between dev 0(=4f000) and dev 1(=52000)
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 3(=57000)
big-node0:864635:865175 [3] NCCL INFO Could not enable P2P between dev 3(=57000) and dev 2(=56000)
big-node0:864635:865175 [3] NCCL INFO Channel 01 : 3[57000] -> 2[56000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO Could not enable P2P between dev 1(=52000) and dev 0(=4f000)
big-node0:864633:865177 [1] NCCL INFO Channel 01 : 1[52000] -> 0[4f000] via direct shared memory
big-node0:864632:865174 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864632:865174 [0] NCCL INFO comm 0x7f11f0002e10 rank 0 nranks 4 cudaDev 0 busId 4f000 - Init COMPLETE
big-node0:864632:864632 [0] NCCL INFO Launch mode Parallel
big-node0:864634:865176 [2] NCCL INFO Could not enable P2P between dev 2(=56000) and dev 1(=52000)
big-node0:864634:865176 [2] NCCL INFO Channel 01 : 2[56000] -> 1[52000] via direct shared memory
big-node0:864633:865177 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864633:865177 [1] NCCL INFO comm 0x7f8f80002e10 rank 1 nranks 4 cudaDev 1 busId 52000 - Init COMPLETE
big-node0:864635:865175 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864634:865176 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
big-node0:864635:865175 [3] NCCL INFO comm 0x7f79bc002e10 rank 3 nranks 4 cudaDev 3 busId 57000 - Init COMPLETE
big-node0:864634:865176 [2] NCCL INFO comm 0x7f7670002e10 rank 2 nranks 4 cudaDev 2 busId 56000 - Init COMPLETE
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Using /home/llw/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/llw/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.6447687149047852 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.5554640293121338 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.6554622650146484 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7334985733032227 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,175] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,187] [INFO] [engine.py:600:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-07-22 01:04:17,187] [INFO] [engine.py:605:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] Creating fp16 ZeRO stage 1 optimizer
[2023-07-22 01:04:17,187] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] Updating max_elements_per_comm from 50000000.0 -> 62626055.0
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 1919160320, max elements per com: 62626055.0
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 31, sub_partition_size: 15656513, padding: 22247292
[2023-07-22 01:04:17,187] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 1919160320 + 22247292 = 1941407612
[2023-07-22 01:04:17,193] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2023-07-22 01:04:17,193] [INFO] [stage1.py:368:get_data_parallel_sub_partitions] total_num_elements=1941407612
[2023-07-22 01:04:17,193] [INFO] [stage1.py:369:get_data_parallel_sub_partitions] world_size=4
[2023-07-22 01:04:17,193] [INFO] [stage1.py:370:get_data_parallel_sub_partitions] max_elements_per_comm=62626055.0
[2023-07-22 01:04:17,193] [INFO] [stage1.py:371:get_data_parallel_sub_partitions] sub_partition_size=15656513
[2023-07-22 01:04:17,193] [INFO] [stage1.py:372:get_data_parallel_sub_partitions] num_sub_partitions=124
[2023-07-22 01:04:17,193] [INFO] [stage1.py:373:get_data_parallel_sub_partitions] num_comm_intervals=31
[2023-07-22 01:04:17,193] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,205] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
[2023-07-22 01:04:17,210] [INFO] [logging.py:60:log_dist] [Rank 0] Using default max_elements_per_comm 50000000.0
[2023-07-22 01:04:17,211] [INFO] [logging.py:60:log_dist] [Rank 0] Total number of elements in model: 962560, max elements per com: 50000000.0
[2023-07-22 01:04:17,211] [INFO] [logging.py:60:log_dist] [Rank 0] sub_partition_count: 1, sub_partition_size: 240640, padding: 0
[2023-07-22 01:04:17,211] [INFO] [logging.py:60:log_dist] [Rank 0] number of elements with padding: 962560 + 0 = 962560
[2023-07-22 01:04:17,215] [INFO] [stage1.py:367:get_data_parallel_sub_partitions] **** partition info:
[2023-07-22 01:04:17,215] [INFO] [stage1.py:368:get_data_parallel_sub_partitions] total_num_elements=962560
[2023-07-22 01:04:17,215] [INFO] [stage1.py:369:get_data_parallel_sub_partitions] world_size=4
[2023-07-22 01:04:17,215] [INFO] [stage1.py:370:get_data_parallel_sub_partitions] max_elements_per_comm=962560
[2023-07-22 01:04:17,215] [INFO] [stage1.py:371:get_data_parallel_sub_partitions] sub_partition_size=240640
[2023-07-22 01:04:17,215] [INFO] [stage1.py:372:get_data_parallel_sub_partitions] num_sub_partitions=4
[2023-07-22 01:04:17,215] [INFO] [stage1.py:373:get_data_parallel_sub_partitions] num_comm_intervals=1
[2023-07-22 01:04:17,215] [INFO] [stage1.py:374:get_data_parallel_sub_partitions] ****
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000005, betas=(0.900000, 0.950000), weight_decay=0.010000, adam_w=1
[2023-07-22 01:04:17,236] [INFO] [stage1.py:152:init] ZeRO Elastic Checkpoint = True
Killing subprocess 864632
Killing subprocess 864633
Killing subprocess 864634
Killing subprocess 864635
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in
main()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'finetune_glm.py', '--local_rank=3', '--deepspeed', '--deepspeed_config', 'config_tasks/config_blocklm_10B.json', '--finetune', '--cloze-eval', '--experiment-name', 'blocklm-2B-copa_07-22-01-03', '--task', 'COPA', '--data-dir', '/home/llw/workspace/dataset/COPA', '--save', '/home/llw/workspace/checkpoints', '--seq-length', '256', '--checkpoint-activations', '--eval-batch-size', '16', '--save-epoch', '100000', '--num-workers', '1', '--no-load-optim', '--no-load-lr-scheduler', '--block-lm', '--cloze-eval', '--task-mask', '--num-layers', '36', '--hidden-size', '2048', '--num-attention-heads', '32', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--load-pretrained', '/home/llw/workspace/checkpoints/blocklm-2b-512', '--lr-decay-style', 'linear', '--warmup', '0.1', '--weight-decay', '1.0e-1', '--pattern-id', '0', '--save-interval', '10000', '--log-interval', '20', '--eval-interval', '1000', '--eval-iters', '100', '--pattern-id', '0', '--fp16', '--model-parallel-size', '1', '--epochs', '100', '--overwrite']' died with <Signals.SIGSEGV: 11>.

使用Zero-2+cpu_offload=true时并不会出错,能正常工作。
请问这是为什么呢?

@SkrDrag
Copy link
Author

SkrDrag commented Jul 22, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant