The first term of “grad” seems to be wrong #2

PhyscalX · 2017-08-17T09:59:45Z

in code，“gamma * (power_prob_data[ind_i] / (1 - prob_data[ind_i])) * log_prob_data[ind_i]”
howover，if(i==j), the (prob_data[ind_i] - 1) should make it as
"-gamma * (power_prob_data[ind_i] / (1 - prob_data[ind_i])) * log_prob_data[ind_i]"
otherwise，it turns to be a gradient ascent optimization.

zimenglan-sysu-512 · 2017-08-17T10:33:51Z

hi @PhyscalX,

here i just ignore the sign, because it can be dismissed by multiplying the -1 * p_i (p_i - 1) or -1 * p_j * p_i, which is the derivative of normal cross-entropy loss using softmax function. you can see the line 141 and 144 in .cu file.

PhyscalX · 2017-08-17T10:57:15Z

Your mathematic consideration is right, the sign is dismissed.
But the "(power_prob_data[ind_i] / (1 - prob_data[ind_i])) " multiplied by "(prob_data[ind_i] - 1)",
lead to "-power_prob_data[ind_i]" at the same time, actually it should be "power_prob_data[ind_i]"

Besides, directly log(p) is dangerous, because the lower bound of softmax outputs can be very small.
Prefer to add a eps(e.g. 1e-10) before.

Besides，your loss is wrong also.
in code， "loss[index] = -log(max(power_prob_data[ind] * log_prob_data[ind], Dtype(FLT_MIN)));
howover， it should be
"loss[index] = -power_prob_data[ind] * log_prob_data[ind];"

zimenglan-sysu-512 · 2017-08-17T13:20:34Z

hi @PhyscalX,

you are right, the loss is computed wrong, and thanks for reminding of log operation. i will update my code tomorrow and do some test.

for the first term you mention, i need to double check.

thanks again.

PhyscalX · 2017-08-18T02:28:17Z

I have testified my idea on cifar10-quick, it is right，got the similar val acc as original loss @(alpha=1.0/0.75/0.5/0.25, gamma=2.0)

eps is very important in focal loss, all divisions in your code are dangerous, when alpha > 0.25,
it tends to encounter NaN in cifar10-quick at the end of convergence.

I preset eps as 1e-10 in my framework Dragon.
If you are interested in it, check the following codes:

(op_kernel.h, line 336), the declaration of kernel::SparseSoftmaxFocalLoss
(op_kernel.cc, line 777), the CPU implementation of kernel::SparseSoftmaxFocalLoss
(op_kernel.cu, line 1417), the CUDA implementation of kernel::SparseSoftmaxFocalLoss
(sparse_softmax_focal_loss_op.h), the declaration of SparseSoftmaxFocalLossOp
(sparse_softmax_focal_loss_op.cc), the implementation of SparseSoftmaxFocalLossOp

zimenglan-sysu-512 · 2017-08-18T03:24:00Z

hi @PhyscalX,

so thanks a lot. I have fixed the problems that u tell me. for the gradient, i forgot to derivate the (1 - p_t), to ignore the sign. now i add it back.

thanks again.

zimenglan-sysu-512 · 2017-08-18T08:38:47Z

hi @PhyscalX,

you're right, eps is very important, i add it to solve the NaN problem. Right now, it can run normally, you can have a look.

thanks for pointing out my error and giving so useful suggestions.

PhyscalX · 2017-08-18T08:51:02Z

And the last, Recommend you to multiply "grad" by prob_data[ind_i].
Directly dividing it may still lead to potential numerical issues.
Formulate in ONE way，and Implement in ANOTHER.
There are enormous tricks in programming mathematical formulations.

zimenglan-sysu-512 · 2017-08-18T10:05:36Z

hi @PhyscalX,

i have updated.
thanks a lot.

PhyscalX changed the title ~~The first term 哦分~~ The first term of “grad” seems to be wrong Aug 17, 2017

zimenglan-sysu-512 mentioned this issue Aug 18, 2017

focal-loss = 87.3365 #1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first term of “grad” seems to be wrong #2

The first term of “grad” seems to be wrong #2

PhyscalX commented Aug 17, 2017 •

edited

Loading

zimenglan-sysu-512 commented Aug 17, 2017

PhyscalX commented Aug 17, 2017 •

edited

Loading

zimenglan-sysu-512 commented Aug 17, 2017

PhyscalX commented Aug 18, 2017

zimenglan-sysu-512 commented Aug 18, 2017 •

edited

Loading

zimenglan-sysu-512 commented Aug 18, 2017 •

edited

Loading

PhyscalX commented Aug 18, 2017 •

edited

Loading

zimenglan-sysu-512 commented Aug 18, 2017

The first term of “grad” seems to be wrong #2

The first term of “grad” seems to be wrong #2

Comments

PhyscalX commented Aug 17, 2017 • edited Loading

zimenglan-sysu-512 commented Aug 17, 2017

PhyscalX commented Aug 17, 2017 • edited Loading

zimenglan-sysu-512 commented Aug 17, 2017

PhyscalX commented Aug 18, 2017

zimenglan-sysu-512 commented Aug 18, 2017 • edited Loading

zimenglan-sysu-512 commented Aug 18, 2017 • edited Loading

PhyscalX commented Aug 18, 2017 • edited Loading

zimenglan-sysu-512 commented Aug 18, 2017

PhyscalX commented Aug 17, 2017 •

edited

Loading

PhyscalX commented Aug 17, 2017 •

edited

Loading

zimenglan-sysu-512 commented Aug 18, 2017 •

edited

Loading

zimenglan-sysu-512 commented Aug 18, 2017 •

edited

Loading

PhyscalX commented Aug 18, 2017 •

edited

Loading