-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze layers for transfer learning #3247
base: master
Are you sure you want to change the base?
Conversation
No Taskcluster jobs started for this pull requestThe `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request. |
@DanBmh do you have any indication that this works better than tuning the whole network? In our experiments (see earlier versions of the Common Voice paper and Josh' thesis -- Chapter 8) we found that tuning all layers works quite a bit better than freezing any of them, see e.g. this figure: |
@ftyers I'm working on it right now:) But the approach I'm suggesting is a bit different to yours. It's using both steps. My transfer-learing workflow would look like this:
Just a side note to the topic: I'm not reinitializing the last layer if possible, because in my experiments with German I got better results than with random initialization of the last layer |
@DanBmh So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers? I'd be interested in seeing the results when you have them! |
Exactly First test was not as good as planned:
Not sure why, maybe some training randomness. Because I don't think this should lead to worse results. |
I did run another test, this time I tried your approach with dropping and reinitializing the last layer.
Here you can see an improvement if using the frozen transfer-learning approach. So I would say if the network architecture did not change it's faster to train with the English weights (no dropping, no freezing), but if the network had to be changed (different alphabet) it's better to train in two steps with the frozen network. Of course we would need to do some more test before we can say this for sure. |
Not being judgmental but i think we'll at least wait after 1.0 to merge that |
@DanBmh -- even though my previous research with deepspeech seems to point to frozen-transfer not working, I still think this feature should be integrated. Your two-step approach makes perfect intuitive sense, and there's work from computer vision and NLP that shows frozen transfer works very well. So, I think the feature would be useful, but @lissyx is right, this is a post-1.0 feature. |
@DanBmh -- I don't see why you need a new flag for |
Problem is that when loading the checkpoint for the second training the frozen layers have no variables for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DanBmh -- Firstly, this is a great addition and I'd like to see it merged!
Here's some issues I have with the current implementation:
(1) I don't foresee any scenario where the frozen layers are not identical to the transferred layers. For example, if I'm transferring four layers from model A, re-initializing the last 2 layers for model B, if I freeze any layers from model A, I would freeze all four of them, not just a subset.
In this case, we can change the freeze_source_layers
flag to a boolean, and just use the drop_source_layers
flag to deduce which layers to freeze. Otherwise, we are basically duplicating the drop_source_layers
flag when we don't have to.
(2) I also really don't like the load_frozen_graph
flag, because it requires you to know the training history of some checkpoints. Is there a way to get around this? Can we maybe re-initialize the Adam tensors before the model is saved to disk, after a training run? I think this flag will trip a lot of people up in use. If someone gives me a checkpoint to use as a starting point for further transfer-learning, I don't expect to have to know anything about how those checkpoints were trained.
|
@DanBmh Please dont let that sink, if you have some time to rebase :) |
We might change our opinion here, right @reuben ? |
459af0f
to
d2b0d9b
Compare
@JRMeyer what do you think about reinitialization of tensors named "Adam" by default if they are missing? With an additional message for users that they are reinitialized because they are not in the checkpoint. I can't think of an elegant way to reinitialize the adam-tensors before checkpoint saving at the moment. Because we would need to reinitialize them every time before we save a checkpoint (some might want to load intermediate checkpoints for some reasons). |
Yeah, I think it should be fine to land this once the comments here have been addressed. It's also a good opportunity to make the training functionality more fine grained instead of the huge train.py which can do a billion different things depending on the combination of flags that's passed. It's really hard to reason about what a training call is going to do unless you're deeply familiar with the code. |
67082eb
to
4b20d71
Compare
@JRMeyer ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've gone over and asked some questions. Some of these might be things that really need to be changed and others are just suggestions. Some of them might also not be valid, e.g. there might be some reasons why you've done it how you have that I am not aware of.
import tensorflow.compat.v1 as tfv1 | ||
|
||
from .flags import FLAGS | ||
from .logging import log_info, log_error, log_warn | ||
from .logging import log_error, log_info, log_warn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the order here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Autosort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what the DS policy is for that, I'd have to ask.
print_msg = True | ||
if print_msg: | ||
msg = "Some Adam tensors are missing, they will be initialized automatically." | ||
log_info(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you not do just log_info("...")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like
missing = []
if ...
missing.append(v)
if missing:
for v in missing:
log_info("Missing... {}".format(v))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split into two parts was an autoformatting issue, black would have split this into three lines.
I didn't print every missing layer, because there are messages later, that state which layers were reinitialized exactly.
@@ -93,6 +93,7 @@ def create_flags(): | |||
# Transfer Learning | |||
|
|||
f.DEFINE_integer('drop_source_layers', 0, 'single integer for how many layers to drop from source model (to drop just output == 1, drop penultimate and output ==2, etc)') | |||
f.DEFINE_integer('freeze_source_layers', 0, 'use same value as above to freeze the other layers') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is a bit vague, what is it trying to say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intended use is together with the 'drop_source_layers' flag, if we want to drop the last layer, we can freeze the rest of the net by setting 'freeze_source_layers' to the same value of 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you could be more explicit in the help text, e.g. say that you have to use the --drop_source_layers
. Maybe say something like "freeze N remaining layers after using --drop_source_layers"? But even that is a bit unclear. Try and describe what the flag does, and then maybe give an example of how it can be used together with --drop_source_layers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better now?
# Filter out layers if we want to freeze some | ||
if FLAGS.freeze_source_layers > 0: | ||
filter_vars = drop_freeze_number_to_layers(FLAGS.freeze_source_layers, "freeze") | ||
new_train_vars = list(train_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to build up new_train_vars
from empty, something like:
new_train_vars = []
for tv in train_vars:
if tv.name not in filter_vars:
new_train_vars.append(tv)
train_vars = new_train_vars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed more intuitive, we want to train all except the filtered layers.
Your example doesn't work by the way, because the filter_vars
contain names like layer_1
and train_vars
have the full layer name layer_1:dense:0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence the "something like" :) -- You could of course use find()
or something like that. I have no particularly strong feeling about it, but err on the side of simplicity.
# Compute gradients for model parameters using tower's mini-batch | ||
gradients = optimizer.compute_gradients(avg_loss) | ||
gradients = optimizer.compute_gradients(avg_loss, var_list=train_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like someone else to take a look at this.
@@ -19,32 +19,33 @@ def _load_checkpoint(session, checkpoint_path, allow_drop_layers, allow_lr_init= | |||
# compatibility with older checkpoints. | |||
lr_var = set(v for v in load_vars if v.op.name == 'learning_rate') | |||
if lr_var and ('learning_rate' not in vars_in_ckpt or | |||
(FLAGS.force_initialize_learning_rate and allow_lr_init)): | |||
(FLAGS.force_initialize_learning_rate and allow_lr_init)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spacing only change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing: PEP 8: E127 continuation line over-indented for visual indent
4b20d71
to
a0d5597
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we replacing the TensorFlow ?
Currently when doing transfer learning we reinitialize the uppermost layers randomly. Afterwards training is continuing normally. But this has the problem that gradients are also propagated through the lower layers we would like to keep, and changes them too.
These changes allow training the reinitialized uppermost layers only. After this you can start a new training, further optimizing the complete network.