-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R1 Regularization #671
Comments
Looking at the error message more carefully, it seems to be trying to find the gradient of uncat wrt its 4'th argument. The signature for uncat is: Now I don't quite understand why the second order code calls uncat's back method for the fourth argument. But assuming it does so for legitimate reasons, the fix is simple. Just define:
as a catch-all for any derivative request for any argument other than the first. And see if the code works with this. If it does I will add this definition to core.jl. You can try the following version of AutoGrad which includes the above fix:
|
I wrote a minimal working example to test the issue:
With Autograd 1.2.4, differentiating
With
However, gradients are the same even if we don't include the following block.
Moreover, the following outputs
Hence, it runs without any compile time issues, however, I don't think it outputs any second order gradients. Is it possible that the new back functions defined are too generic and always used as the gradient of uncat? |
First, mixing of the old grad interface (i.e. grad(f)) and the new grad interface (grad(result, param)) is not well tested and part of the problem seems to be mixing the two. So if you can find a way to express the computation using only the new interface (i.e. only using @diff and grad(result, param)), that could solve the problem. Nevertheless I am also trying to figure out what goes wrong when we do mix the two interfaces. I found two problems and pushed a new update to the dy/fix671 branch:
|
It now works while using only the using Knet
using Statistics: mean
atype = Knet.atype()
# A simple model for the example
struct Linear; w; b; end
Linear(in_dim::Int, out_dim::Int) = Linear(param(out_dim,in_dim,atype=atype), param0(out_dim,atype=atype))
(l::Linear)(x) = l.w * x .+ l.b
struct Model; lin1; lin2; lin3; end
Model(in_dim1::Int,in_dim2::Int) = Model(Linear(in_dim1, 1), Linear(in_dim2, 1), Linear(2, 1))
function (m::Model)(x, y)
out1 = m.lin1(x)
out2 = m.lin2(y)
outc = vcat(out1, out2)
return m.lin3(outc)
end
# Loss1: Only first order, Loss2: first+second order, test: only second order
function loss1(model, x, y)
out = model(x, y)
return mean(out)
end
function loss2(model, x, y)
out = model(x, y)
loss = mean(out)
xp = isa(x, Param) ? x : Param(x)
g = @diff sum(model(xp, y))
grad_out = grad(g, xp)
loss += sum(abs2.(grad_out)) / size(x)[end]
return loss
end
function test(model, x, y)
xp = Param(x)
g = @diff sum(model(xp, y))
grad_out = grad(g, xp)
return sum(abs2.(grad_out)) / size(x)[end]
end
x = convert(atype, randn(10, 8))
y = convert(atype, randn(5, 8))
model = Model(10, 5)
L = @diff loss1(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing
L = @diff loss2(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing
L = @diff test(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing
grad_result = @Knet.gcheck loss2(model, Param(x), y) (verbose=1,)
println("gcheck result: $grad_result") and it works without an error using the
Moreover, the gradients are no longer nothing, and gcheck also reports correct gradients. In addition, the fix for the mixed interface also seems to work for this test case. By adding the following code: function loss_mixed_interface(model, x, y)
out = model(x, y)
loss = mean(out)
gradfn = grad(t -> sum(model(t, y)))
grad_out = gradfn(x)
loss += sum(abs2.(grad_out)) / size(x)[end]
return loss
end
function test_mixed_interface(model, x, y)
gradfn = grad(t -> sum(model(t, y)))
grad_out = gradfn(x)
return sum(abs2.(grad_out)) / size(x)[end]
end
L = @diff loss_mixed_interface(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing
L = @diff test_mixed_interface(model, x, y)
@show value(L), grad(L, model.lin1.w)
L = nothing we get the additional output as:
which agree with the results from only using Although it now works for higher order gradients, I think this introduces back the bug from denizyuret/AutoGrad.jl#75 as I get true for both of the following statements grad(x -> x*grad(y -> x+y)(x))(5.0) == 2
grad(x -> x*grad(y -> x+y)(1x))(5.0) == 1 I am trying to understand why the fixes for denizyuret/AutoGrad.jl#75 break the higher order gradients with mixed interface, and I will update if I can find a solution. I will also re-check whether the |
I come across with a very similar error while I am implementing an Implicit-GON (Gradient Origin Network) model for implicit learning task. As I mentioned in the previous issue #670, I was trying to obtain a derivative of a loss function after two forward passes which leads to a second order derivative. In the following mwe, I want to take the derivative of using Knet
using Statistics: mean
atype = Knet.atype()
Knet.seed!(0)
function batched_linear(theta, x_in; atype = KnetArray{Float32})
# """
# multiply a weight matrix of size (O, I) with a batch of matrices
# of size (I, W, B) to have an output of size (O, W, B),
# where B is the batch size.
# size(theta) = (O, I)
# size(x_in) = (O, W, B)
# """
o = size(theta,1)
w = size(x_in, 2)
b = size(x_in, 3)
x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
return out
end
function get_mgrid(sidelen) # Create a grid
iterator = (range(-1,stop=1,length = sidelen))
return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end
function model_forw(theta, z) #Forward implementation of the model
# It is kind of a decoder model where we try to reconstruct a
# target by using z_in
z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
z_in = cat(c, z_rep, dims = 3)
z_in = (permutedims(z_in, (3,2,1)))
z = batched_linear(theta, z_in) .+ 0.001
end
function loss(theta, z, x) # Compute mean squared error loss
x_hat = model_forw(theta, z)
L = mean(sum((x_hat- x).^2, dims = 2))
end
function loss_train(theta,x)
z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
derivative_origin = @diff loss(theta, z, x) # Feed zero latent to model and take the gradient w.r.t. it
z = -grad(derivative_origin, z) # New latent point as negative gradient
x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
L = mean((x_hat- x).^2) # Compute mean squared error loss
end
num_latent = 2
i = 4
o = 1
w = 4
batch_size = 3
x = atype(randn(o,w,batch_size)) # Target
theta = Param(atype(randn(o,i))) # Model Weight
mgrid = get_mgrid(2) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
# z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
derivative_model = @diff loss_train(theta,x) # Differentiate the loss_train.
# It is working in this example However, if I use higher dimensional data and more layers in the model as in the following modification of the above mwe, I wait too much and cannot obtain an output for 10 minutes. Therefore, I am stopping the execution of the code. My implementation includes lots of cat operation, reshaping and permuting dims. However, I am not sure whether these operations slows down the taking derivative. using Knet
using Statistics: mean
atype = Knet.atype()
Knet.seed!(0)
function batched_linear(theta, x_in; atype = KnetArray{Float32})
# """
# multiply a weight matrix of size (O, I) with a batch of matrices
# of size (I, W, B) to have an output of size (O, W, B),
# where B is the batch size.
# size(theta) = (O, I)
# size(x_in) = (O, W, B)
# """
o = size(theta,1)
w = size(x_in, 2)
b = size(x_in, 3)
x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
return out
end
function get_mgrid(sidelen) # Create a grid
iterator = (range(-1,stop=1,length = sidelen))
return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end
function model_forw(theta, z) #Forward implementation of the model
# It is kind of a decoder model where we try to reconstruct a
# target by using z_in
z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
z_in = cat(c, z_rep, dims = 3)
z_in = (permutedims(z_in, (3,2,1)))
z = batched_linear(theta[1], z_in) .+ 0.001
z = sin.(30 * z)
z = batched_linear(theta[2], z) .+ 0.001
z = sin.(30 * z)
z = batched_linear(theta[3], z) .+ 0.001
z = sin.(30 * z)
z = batched_linear(theta[4], z)
end
function loss(theta, z, x) # Compute mean squared error loss
x_hat = model_forw(theta, z)
L = mean(sum((x_hat- x).^2, dims = 2))
end
function loss_train(theta,x)
z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
derivative_origin = @diff loss(theta, z, x) # Feed zero latent to model and take the gradient w.r.t. it
z = -grad(derivative_origin, z) # New latent point as negative gradient
x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
L = mean((x_hat- x).^2) # Compute mean squared error loss
end
num_latent = 32
i = 34
o1 = 256
o2 = 256
o3 = 256
o4 = 1
w = 784
batch_size = 64
x = atype(randn(o4,w,batch_size)) # Target
# Model Weights : theta1, ..., theta4
theta1 = Param(atype(randn(o1,i)))
theta2 = Param(atype(randn(o2,o1)))
theta3 = Param(atype(randn(o3,o2)))
theta4 = Param(atype(randn(o4,o3)))
# Model Weight List
theta = []
push!(theta, theta1)
push!(theta, theta2)
push!(theta, theta3)
push!(theta, theta4)
mgrid = get_mgrid(28) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
z = Param(atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
derivative_origin = @diff loss(theta, z, x) # This works fine
println(derivative_origin)
derivative_model = @diff loss_train(theta,x) # This might work but takes too much time (I waited for 10 min and did not obtain an output) Is there any implementation detail which I misses and that's why my code is running extremely slow? Even taking the derivative of |
@BariscanBozkurt, can you try it with a smaller batch size for at least 2 iterations ? In my case, the first past through the model takes significantly longer. Currently, first 10 iterations complete in approximately 100 seconds, whereas 10 iterations take approximately 15 seconds. If we assume only the first iteration is slow ( which I suspect is due to precompilation ), than it would imply that first iteration takes approximately 135 seconds whereas the other iterations take 1.5 seconds. In other words, the first iteration is approximately 90 times slower. Maybe it's the case for your model too and it would speed up significantly after the first iteration. |
Hi @Kausta. Thank you for your quick reply. I think I understand the problem now. The problem is not about the precompilation since other iterations also takes so much time. It is high probably due to the custom function |
Disregard my previous comment. In my second example code, the hcat function inside using Knet
using Statistics: mean
atype = Knet.atype()
Knet.seed!(0)
function batched_linear(theta, x_in; atype = KnetArray{Float32})
# """
# multiply a weight matrix of size (O, I) with a batch of matrices
# of size (I, W, B) to have an output of size (O, W, B),
# where B is the batch size.
# size(theta) = (O, I)
# size(x_in) = (O, W, B)
# """
o = size(theta,1)
w = size(x_in, 2)
b = size(x_in, 3)
x_in_reshaped = reshape(x_in, size(x_in,1), w*b)
out = reshape(theta * x_in_reshaped, size(theta,1), w, b)
return out
end
function get_mgrid(sidelen) # Create a grid
iterator = (range(-1,stop=1,length = sidelen))
return Array{Float64}(hcat([[i,j] for i = iterator, j = iterator]...)');
end
function model_forw(theta, z_rep) #Forward implementation of the model
# It is kind of a decoder model where we try to reconstruct a
# target by using z_in
# z_rep = hcat([z for _ = 1:size(c,2)]...) # c is image coordinate matrix defined globally below
z_in = cat(c, z_rep, dims = 3)
z_in = (permutedims(z_in, (3,2,1)))
z = batched_linear(theta[1], z_in) .+ 0.001
z = sin.(30 * z)
z = batched_linear(theta[2], z) .+ 0.001
z = sin.(30 * z)
z = batched_linear(theta[3], z) .+ 0.001
z = sin.(30 * z)
z = batched_linear(theta[4], z)
end
function loss(theta, z, x) # Compute mean squared error loss
x_hat = model_forw(theta, z)
L = mean(sum((x_hat- x).^2, dims = 2))
end
function loss_train(theta,x)
z = (atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector as Param type
z_rep = Param(atype(hcat([z for _ = 1:size(c,2)]...)))
derivative_origin = @diff loss(theta, z_rep, x) # Feed zero latent to model and take the gradient w.r.t. it
z = -grad(derivative_origin, z_rep) # New latent point as negative gradient
x_hat = model_forw(theta, z) # Reconstruct the target w.r.t. new latent
L = mean((x_hat- x).^2) # Compute mean squared error loss
end
num_latent = 32
i = 34
o1 = 256
o2 = 256
o3 = 256
o4 = 1
w = 784
batch_size = 64
x = atype(randn(o4,w,batch_size)) # Target
# Model Weights : theta1, ..., theta4
theta1 = Param(atype(randn(o1,i)))
theta2 = Param(atype(randn(o2,o1)))
theta3 = Param(atype(randn(o3,o2)))
theta4 = Param(atype(randn(o4,o3)))
# Model Weight List
theta = []
push!(theta, theta1)
push!(theta, theta2)
push!(theta, theta3)
push!(theta, theta4)
mgrid = get_mgrid(28) # Create grid for generating image coordinate matrix c as below
c = atype(permutedims(repeat(mgrid,1,1,batch_size),(3,1,2))); # Image Coordinates
z = (atype(zeros(batch_size, 1, num_latent))) # Zero initial latent vector
z_rep = Param(hcat([z for _ = 1:size(c,2)]...)) # Make z_rep Param type this time
# The following line (derivative_origin ) works fine again. However, I do not want to obtain the gradient
# w.r.t. z_rep actually. I need the gradient w.r.t z !!!
derivative_origin = @diff loss(theta, z_rep, x)
# The following line to take the derivative w.r.t. model weights is fast now.
derivative_model = @diff loss_train(theta,x) Here, instead of defining |
I found my workaround solution. Instead of using using Knet
using Statistics: mean
atype = Knet.atype()
one_conv_weight = atype(ones(1,1,1,784)) #Globally define convolution weights of all ones
num_latent = 32
batch_size = 64
z = Param(atype(zeros(batch_size, 1, num_latent))) #size : (64,1,32)
# We won't use the following line which includes hcat function to repeat z
# 784 times. Instead, we utilize 1x1 convolution.
# z_rep = hcat([z for _ = 1:784]...) # size : (64,784,32)
z_ = copy(z) # Create a copy of z, so that z_ is not param type
z_ = permutedims(reshape(z_,64,1,1,32),(4,3,2,1)) # size : (32,1,1,64)
z_ = conv4(one_conv_weight, z_)[:,1,:,:] # size : (32, 784, 64)
z_rep = permutedims(z_, (3,2,1)) #size : (64, 784, 32) |
@Kausta found a bug in the cat/uncat higher order gradients implementing R1 regularization. I am moving from email to this github issue to follow up. Here is his error description:
Here is his references on PyTorch/TF implementations:
The text was updated successfully, but these errors were encountered: