-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeroDivisionError: integer division or modulo by zero #11
Comments
Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time. |
Sorry to take so long to reply to you. My parameter setting details are as follows:
"mode == sketch"
"num_clients == 20"
"num_workers == 20"
"num_devices == 1"
"share_ps_gpu , action= "store_true" "
The problem is described as follows:
File "CommEfficient\fed_aggregator.py", line 232, in _call_train
per_proc = len(worker_batches) // len(self.update_forward_grad_ps)
ZeroDivisionError: integer division or modulo by zero
I encountered problems when running "cv_train. py". I think these parameters may cause problems. If you need to set other parameters, please let me know. Thank you very much!
…------------------ 原始邮件 ------------------
发件人: "kiddyboots216/CommEfficient" ***@***.***>;
发送时间: 2023年3月6日(星期一) 晚上11:22
***@***.***>;
***@***.******@***.***>;
主题: Re: [kiddyboots216/CommEfficient] ZeroDivisionError: integer division or modulo by zero (Issue #11)
Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time.
To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Alright, so the issue I think is that you've got 20 clients and 20 workers. This means that you're trying to get through the entire dataset at each iteration. Can you try doing, say, 100 clients and 20 workers? Also, you can try increasing the timeout. Try 900s. |
Hello, my device only has one GPU, and this problem also occurs when executing the code. Have you solved it? |
Hi, this error occurs when the worker processes do not enqueue to updating_ forward_ grad_ ps in time. Can you try increasing the timeout or increasing the number of clients? If you try for example clients=1 workers=1 then you're trying to do the entire dataset at each iteration, and the default timeout is not long enough (perhaps) to process the entire dataset with only 1 DataLoader worker. |
Thanks for your reply. I've tried increasing the number of clients and workers and I still get this problem. I think it's the number of devices that causes this problem, as shown below. My device has only one GPU, so I defined num_device as 1. During code execution, if num_device=1 and share_ps_gpu=False, n_worker_gpus=0. This means that the following "for loop" operation cannot be performed. Therefore, the update_forward_grad_ps list is empty. |
Oh I see! Yeah so you need to set share_ps_gpu=True when you run the code. That way, the workers can share a GPU with the parameter server. This will limit the size of the model you're able to run since you have to hold 2 copies in memory at the same time, but it's necessary if you are running on 1 gpu. |
Could you revert the change to torch.distributed.reduce and add these lines; |
The export commands are just adding some environment variables in to make the error message more useful. The "NCCL error invalid usage" message you were originally getting is not descriptive because it could be a versioning error. |
per_proc = len(worker_batches) // len(self.update_forward_grad_ps)
How can I set the number of processes and clients to avoid "updating_ forward_ grad_ ps" becomes an empty array?
The text was updated successfully, but these errors were encountered: