Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stylegan2 Training GPU usage maximization #16

Open
datafireball opened this issue Apr 5, 2020 · 0 comments
Open

Stylegan2 Training GPU usage maximization #16

datafireball opened this issue Apr 5, 2020 · 0 comments

Comments

@datafireball
Copy link

I came to post this question here as the NVlab stylegan and stylegan2 projects provide minimal instruction about training and don't allow creating issues. In the README, the author listed the expected run time given the number of GPU (1,2,4 or 8), and the number of images for certain resolution. (DGX-1 box has 8 Tesla V100 and 32GB each).

Configuration Resolution Total kimg 1 GPU 2 GPUs 4 GPUs 8 GPUs GPU mem
config-f 1024×1024 25000 69d 23h 36d 4h 18d 14h 9d 18h 13.3 GB
config-f 1024×1024 10000 27d 23h 14d 11h 7d 10h 3d 22h 13.3 GB
config-e 1024×1024 25000 35d 11h 18d 15h 9d 15h 5d 6h 8.6 GB
config-e 1024×1024 10000 14d 4h 7d 11h 3d 20h 2d 3h 8.6 GB
config-f 256×256 25000 32d 13h 16d 23h 8d 21h 4d 18h 6.4 GB
config-f 256×256 10000 13d 0h 6d 19h 3d 13h 1d 22h 6.4 GB

Question: Is there a way to tune the parameters so that the GPU usage is fully maximized given the running host tech spec? If there isn't a magic flag like that, what are the key parameters that I should dial up or down given my training host technical specification?

On one extreme:
As each Tesla GPU has 32GB memory and the training only uses 6.4GB out of 32GB, at the same time, for DGX2 they have doubled the GPU counts to be 16 instead of 8, then the usage is only 8 * 6.4 / (16 * 32) = 10%. If we can tweak something like the minibatch size or something else, does that mean we can cut the training time from 13 days to 2 days?

On the other extreme:
I might only have two small gaming GPUs that each has 6GB GPU memory, then it might require a different batch size which all benchmarks above require a memory usage greater than 6GB.

By looking at the stylegan2 run_trainning.py, the closest parameter that I found is --total-kimg and -num-gpus, maybe --config too.

parser.add_argument('--num-gpus', help='Number of GPUs (default: %(default)s)', default=1, type=int, metavar='N')
parser.add_argument('--total-kimg', help='Training length in thousands of images (default: %(default)s)', metavar='KIMG', default=25000, type=int)

But --total-kimg feels like the total number of the tfrecords you want for training the length rather than width.

By looking into training_loop.py, there are another 50 parameters like minibatch_size_base=32 and minibatch_gpu_base=4 and others which I believe directly impacted the throughput of traning which I don't fully understand which knob should I turn.

Thought?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant