Stylegan2 Training GPU usage maximization #16

datafireball · 2020-04-05T15:36:18Z

I came to post this question here as the NVlab stylegan and stylegan2 projects provide minimal instruction about training and don't allow creating issues. In the README, the author listed the expected run time given the number of GPU (1,2,4 or 8), and the number of images for certain resolution. (DGX-1 box has 8 Tesla V100 and 32GB each).

Configuration	Resolution	Total kimg	1 GPU	2 GPUs	4 GPUs	8 GPUs	GPU mem
config-f	1024×1024	25000	69d 23h	36d 4h	18d 14h	9d 18h	13.3 GB
config-f	1024×1024	10000	27d 23h	14d 11h	7d 10h	3d 22h	13.3 GB
config-e	1024×1024	25000	35d 11h	18d 15h	9d 15h	5d 6h	8.6 GB
config-e	1024×1024	10000	14d 4h	7d 11h	3d 20h	2d 3h	8.6 GB
config-f	256×256	25000	32d 13h	16d 23h	8d 21h	4d 18h	6.4 GB
config-f	256×256	10000	13d 0h	6d 19h	3d 13h	1d 22h	6.4 GB

Question: Is there a way to tune the parameters so that the GPU usage is fully maximized given the running host tech spec? If there isn't a magic flag like that, what are the key parameters that I should dial up or down given my training host technical specification?

On one extreme:
As each Tesla GPU has 32GB memory and the training only uses 6.4GB out of 32GB, at the same time, for DGX2 they have doubled the GPU counts to be 16 instead of 8, then the usage is only 8 * 6.4 / (16 * 32) = 10%. If we can tweak something like the minibatch size or something else, does that mean we can cut the training time from 13 days to 2 days?

On the other extreme:
I might only have two small gaming GPUs that each has 6GB GPU memory, then it might require a different batch size which all benchmarks above require a memory usage greater than 6GB.

By looking at the stylegan2 run_trainning.py, the closest parameter that I found is --total-kimg and -num-gpus, maybe --config too.

parser.add_argument('--num-gpus', help='Number of GPUs (default: %(default)s)', default=1, type=int, metavar='N')
parser.add_argument('--total-kimg', help='Training length in thousands of images (default: %(default)s)', metavar='KIMG', default=25000, type=int)

But --total-kimg feels like the total number of the tfrecords you want for training the length rather than width.

By looking into training_loop.py, there are another 50 parameters like minibatch_size_base=32 and minibatch_gpu_base=4 and others which I believe directly impacted the throughput of traning which I don't fully understand which knob should I turn.

Thought?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stylegan2 Training GPU usage maximization #16

Stylegan2 Training GPU usage maximization #16

datafireball commented Apr 5, 2020

Stylegan2 Training GPU usage maximization #16

Stylegan2 Training GPU usage maximization #16

Comments

datafireball commented Apr 5, 2020