Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accuracy drop #15

Open
dhimanpd opened this issue May 25, 2017 · 4 comments
Open

accuracy drop #15

dhimanpd opened this issue May 25, 2017 · 4 comments

Comments

@dhimanpd
Copy link

Hi @kuza55,
When I use single GPU for training, the model trains with training accuracy of 99.99%. But when i use make_parallel. The training accuracy gets stuck at 96%.

Minimum Loss:
Single GPU: 0.0063
Multi GPU: 0.1213
The loss is also not dropping much.

I am training a multi-label classifier with resnet-50 with sigmoid layers in the end with binary crossentropy.

@kuza55
Copy link
Owner

kuza55 commented Jun 8, 2017

Did you divide your batch size by the number of GPUs you're using? Not really sure what else could be causing problems.

@kuza55
Copy link
Owner

kuza55 commented Jun 8, 2017

Sorry, I misspoke, what I meant to say was:

If you followed the instructions, you probably multiplied your batch size by the number of GPUs you're using. This increases perf, but does you know, make your batches bigger, which can result in worse performance.

If you kept the same batch size (and it is divisible), I would expect you to get the same performance.

Assuming this is the issue, you could try playing with other hyperparameters like the learning rate or dropout, etc.

@dhimanpd
Copy link
Author

dhimanpd commented Jul 9, 2017

I tried same batch size, but still the accuracy drops. Did you try to reproduce any model trained on single-gpu?

@DarkForte
Copy link

Below is my opinion:
When you train multi gpu models with the same batch size as using one gpu, each gpu will see less training samples one time, so the gradient estimated from the samples will decrease. As a result, the accuracy will decrease as well.

When you enlarge batch size to n_gpu * batch_size, you will have less chance to adjust your model within one epoch. For example, if you have 1000 training samples, and you enlarge your batch size from 50 to 100 when you have 2 gpus, although each single gpu still sees 50 samples per iteration, you will have only half the chance to apply the gradients to your model.

It is in fact a big problem for parallel training. You can take a look at this paper: Training ImageNet in 1 hour. The first trick you can try is to increase your learning rate to n_gpu times as well. For a simplified explaination, you have larger batches, so your estimated gradient is more accurate. In this way, you could trust this gradient more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants